LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT's pretrain-finetune discrepancy for Arabic offensive language identification

12Citations
Citations of this article
77Readers
Mendeley users who have this article in their library.

Abstract

AraBERT is an Arabic version of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) model. The latter has achieved good performance in a variety of Natural Language Processing (NLP) tasks. In this paper, we propose an effective AraBERT embeddings-based method for dealing with offensive Arabic language in Twitter. First, we pre-process tweets by handling emojis and including their Arabic meanings. Next, to overcome the pretrain-finetune discrepancy, we substitute each detected emojis by the special token [MASK] into both fine tuning and inference phases. Then, we represent tweets tokens by applying AraBERT model. Finally, we feed the tweet representation into a sigmoid function to decide whether a tweet is offensive or not. The proposed method achieved the best results on OffensEval 2020: Arabic task and reached a macro F1 score equal to 90.17%.

Cite

CITATION STYLE

APA

Alami, H., El Alaoui, S. O., Benlahbib, A., & En-Nahnahi, N. (2020). LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification. In 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings (pp. 2080–2085). International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.semeval-1.275

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free