CHAPTERBREAK: A Challenge Dataset for Long-Range Language Models

10Citations
Citations of this article
43Readers
Mendeley users who have this article in their library.

Abstract

While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce CHAPTERBREAK, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on CHAPTERBREAK show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our CHAPTERBREAK dataset to spur more principled future research into LRLMs.

References Powered by Scopus

SWAG: A large-scale adversarial dataset for grounded commonsense inference

397Citations
N/AReaders
Get full text

Efficient content-based sparse attention with routing transformers

327Citations
N/AReaders
Get full text

Role of Context in Accessing Distant Information During Reading

119Citations
N/AReaders
Get full text

Cited by Powered by Scopus

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

24Citations
N/AReaders
Get full text

Black-box language model explanation by context length probing

2Citations
N/AReaders
Get full text

STORYWARS: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Sun, S., Thai, K., & Iyyer, M. (2022). CHAPTERBREAK: A Challenge Dataset for Long-Range Language Models. In NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 3704–3714). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.naacl-main.271

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 11

55%

Researcher 7

35%

Professor / Associate Prof. 1

5%

Lecturer / Post doc 1

5%

Readers' Discipline

Tooltip

Computer Science 20

83%

Linguistics 2

8%

Neuroscience 1

4%

Engineering 1

4%

Save time finding and organizing research with Mendeley

Sign up for free