LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT's pretrain-finetune discrepancy for Arabic offensive language identification

Hamza Alami; Said Ouatik El Alaoui; Abdessamad Benlahbib; Noureddine En-Nahnahi

Conference Proceedings

LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT's pretrain-finetune discrepancy for Arabic offensive language identification

14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings (2020) 2080-2085

DOI: 10.18653/v1/2020.semeval-1.275

14Citations

85Readers

Get full text

Abstract

AraBERT is an Arabic version of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) model. The latter has achieved good performance in a variety of Natural Language Processing (NLP) tasks. In this paper, we propose an effective AraBERT embeddings-based method for dealing with offensive Arabic language in Twitter. First, we pre-process tweets by handling emojis and including their Arabic meanings. Next, to overcome the pretrain-finetune discrepancy, we substitute each detected emojis by the special token [MASK] into both fine tuning and inference phases. Then, we represent tweets tokens by applying AraBERT model. Finally, we feed the tweet representation into a sigmoid function to decide whether a tweet is offensive or not. The proposed method achieved the best results on OffensEval 2020: Arabic task and reached a macro F1 score equal to 90.17%.

Cite

CITATION STYLE

APA

Alami, H., El Alaoui, S. O., Benlahbib, A., & En-Nahnahi, N. (2020). LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification. In 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings (pp. 2080–2085). International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.semeval-1.275

LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT's pretrain-finetune discrepancy for Arabic offensive language identification

Abstract

Cite

Register to see more suggestions