MaroBERTa: Multilabel Classification Language Model for Darija Newspaper

1Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A large amount of valuable digital text, audio and video data is available on the web. Thus, a large application of machine learning based on Natural Language Processing (NLP) has taken advantage of these opportunities. Transformers, especially Bidirectional Encoder Representation from Transformers (BERT) based models, have become the state-of-the-art for downstream NLP tasks. Non-normalized languages such as Moroccan Arabic, also known as Darija, increases the complexity of natural language processing. Furthermore, Text written in Darija does not have a standard spelling, and there is a lack of resources, especially for multilabel classification. In this paper, we introduced a multilabel classification model for Moroccan Arabic (Darija) newspapers. Firstly, we created a dataset from 400.000 collected newspaper articles with their titles, written in darija and pre-trained our model: MaroBERTa. Secondly, we implemented a crowd-sourcing platform to help create a novel corpus called Darija Multilabel Dataset for News classification (DMDNews). This dataset contains 28 different classes representing the most frequent topics in Moroccan newspapers. Finally, we fine-tune MaroBERTa and two multilingual models (AraBERT and CAMelBert) for the multilabel classification task using the DMDNews. Experiments shows that our dedicated pretrained Darija model -MaroBERTa- outperforms the existing multilingual models despite of the large amount of data they have been trained on.

Cite

CITATION STYLE

APA

Hamza, L., & Mohammed, R. (2022). MaroBERTa: Multilabel Classification Language Model for Darija Newspaper. In Communications in Computer and Information Science (Vol. 1677 CCIS, pp. 388–401). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20490-6_31

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free