Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Koustuv Sinha; Robin Jia; Dieuwke Hupkes; Joelle Pineau; Adina Williams; Douwe Kiela

Conference ProceedingsOPEN ACCESS

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (2021) 2888-2913

DOI: 10.18653/v1/2021.emnlp-main.230

78Citations

214Readers

Abstract

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks mostly due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and we show that these models still achieve high accuracy after fine-tuning on many downstream tasks - including tasks specifically designed to be challenging for models that ignore word order. Our models also perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and they underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

Cite

CITATION STYLE

APA

Sinha, K., Jia, R., Hupkes, D., Pineau, J., Williams, A., & Kiela, D. (2021). Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 2888–2913). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.230

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Abstract

Cite

Register to see more suggestions