Downstream Datasets Make Surprisingly Good Pretraining Corpora

1Citations
Citations of this article
56Readers
Mendeley users who have this article in their library.

Abstract

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around 10×-500× less data), outperforming the latter on 7 and 5 datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Self-pretraining also provides benefits on structured output prediction tasks such as question answering and commonsense inference, often providing more than 50% improvements compared to standard pretraining. Our results hint that often performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.

Cite

CITATION STYLE

APA

Krishna, K., Garg, S., Bigham, J. P., & Lipton, Z. C. (2023). Downstream Datasets Make Surprisingly Good Pretraining Corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 12207–12222). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.682

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free