Can Transformer Models Measure Coherence in Text? Re-Thinking the Shuffle Test

Philippe Laban; Luke Dai; Lucas Bandarkar; Marti A. Hearst

Conference ProceedingsOPEN ACCESS

Can Transformer Models Measure Coherence in Text? Re-Thinking the Shuffle Test

ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2021) 2 1058-1064

DOI: 10.18653/v1/2021.acl-short.134

16Citations

73Readers

Abstract

The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero- Shot setting: Models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve highperformance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95% accuracy), model performance drops from 94% to 78% as block size increases, creating a conceptually simple challenge to benchmark NLP models.

Cite

CITATION STYLE

APA

Laban, P., Dai, L., Bandarkar, L., & Hearst, M. A. (2021). Can Transformer Models Measure Coherence in Text? Re-Thinking the Shuffle Test. In ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (Vol. 2, pp. 1058–1064). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.acl-short.134

Can Transformer Models Measure Coherence in Text? Re-Thinking the Shuffle Test

Abstract

Cite

Register to see more suggestions