Abstract
Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT - the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.
Cite
CITATION STYLE
Szmyd, W., Kotyla, A., Zobniów, M., Falkiewicz, P., Bartczuk, J., & Zygadło, A. (2023). TrelBERT: A pre-trained encoder for Polish Twitter. In EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 (pp. 17–24). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bsnlp-1.3
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.