GANDALF: a General Character Name Description Dataset for Long Fiction

Fredrik Carlsson; Fredrik Olsson; Amaru Cuba Gyllensten; Magnus Sahlgren

Conference ProceedingsOPEN ACCESS

GANDALF: a General Character Name Description Dataset for Long Fiction

Proceedings of the 3rd Workshop on Machine Reading for Question Answering, MRQA 2021 (2021) 119-132

DOI: 10.18653/v1/2021.mrqa-1.13

1Citations

41Readers

Abstract

This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.

Cite

CITATION STYLE

APA

Carlsson, F., Olsson, F., Gyllensten, A. C., & Sahlgren, M. (2021). GANDALF: a General Character Name Description Dataset for Long Fiction. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, MRQA 2021 (pp. 119–132). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.mrqa-1.13

GANDALF: a General Character Name Description Dataset for Long Fiction

Abstract

Cite

Register to see more suggestions