Pretrained language models (PLMs) on domain-specific data have been proven to be effective for in-domain natural language processing (NLP) tasks. Our work aimed to develop a language model which can be effective for the NLP tasks with the data from diverse social media platforms. We pretrained a language model on Twitter and Reddit posts in English consisting of 929M sequence blocks for 112K steps. We benchmarked our model and 3 transformer-based models—BERT, BERTweet, and RoBERTa on 40 social media text classification tasks. The results showed that although our model did not perform the best on all of the tasks, it outperformed the baseline model—BERT on most of the tasks, which illustrates the effectiveness of our model. Also, our work provides some insights of how to improve the efficiency of training PLMs.
CITATION STYLE
Guo, Y., & Sarker, A. (2023). SocBERT: A Pretrained Model for Social Media Text. In ACL 2023 - 4th Workshop on Insights from Negative Results in NLP, Proceedings (pp. 45–52). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.insights-1.5
Mendeley helps you to discover research relevant for your work.