Cost-effective selection of pretraining data: a case study of pretraining BERT on social media

20Citations
Citations of this article
87Readers
Mendeley users who have this article in their library.

Abstract

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Cite

CITATION STYLE

APA

Dai, X., Karimi, S., Hachey, B., & Paris, C. (2020). Cost-effective selection of pretraining data: a case study of pretraining BERT on social media. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 1675–1681). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.151

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free