GANDALF: a General Character Name Description Dataset for Long Fiction

1Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.

Abstract

This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.

Cite

CITATION STYLE

APA

Carlsson, F., Olsson, F., Gyllensten, A. C., & Sahlgren, M. (2021). GANDALF: a General Character Name Description Dataset for Long Fiction. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, MRQA 2021 (pp. 119–132). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.mrqa-1.13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free