The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct analyses over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions..

Cite

CITATION STYLE

APA

Chiang, T. R., & Yogatama, D. (2023). The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 10305–10321). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.637

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free