TrelBERT: A pre-trained encoder for Polish Twitter

Wojciech Szmyd; Alicja Kotyla; Michał Zobniów; Piotr Falkiewicz; Jakub Bartczuk; Artur Zygadło

Conference Proceedings

TrelBERT: A pre-trained encoder for Polish Twitter

EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 (2023) 17-24

DOI: 10.18653/v1/2023.bsnlp-1.3

4Citations

22Readers

Get full text

Abstract

Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT - the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.

Cite

CITATION STYLE

APA

Szmyd, W., Kotyla, A., Zobniów, M., Falkiewicz, P., Bartczuk, J., & Zygadło, A. (2023). TrelBERT: A pre-trained encoder for Polish Twitter. In EACL 2023 - 9th Workshop on Slavic Natural Language Processing, Proceedings of the SlavicNLP 2023 (pp. 17–24). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bsnlp-1.3

TrelBERT: A pre-trained encoder for Polish Twitter

Abstract

Cite

Register to see more suggestions