A New Corpus and Lexicon for Offensive Tamazight Language Detection

Kheireddine Abainia; Kenza Kara; Tassadit Hamouni

Conference ProceedingsOPEN ACCESS

A New Corpus and Lexicon for Offensive Tamazight Language Detection

Proceedings of the 7th International Workshop on Social Media World Sensors, SIDEWAYS 2022 (2022)

DOI: 10.1145/3544795.3544852

5Citations

13Readers

Get full text

Abstract

In this paper, we address the offensive language detection on Tamazight language, which is one of the under-resourced languages that are still in their infancy and lack of standard orthography. We are particularly interested in the Kabyle dialect, mainly spoken in some cities of northern Algeria (i.e. Tizi-ouzou and Bejaïa). We propose a new corpus of offensive Tamazight language (i.e. OTAM corpus) compiling 6.2k texts, as well as a new lexicon of offensive and abusive Tamazight words with 12.6k entries. We have evaluated several baseline classifiers of machine learning and deep learning, where the results showed that we could produce acceptable results without features engineering.

Author supplied keywords

Cite

CITATION STYLE

APA

Abainia, K., Kara, K., & Hamouni, T. (2022). A New Corpus and Lexicon for Offensive Tamazight Language Detection. In Proceedings of the 7th International Workshop on Social Media World Sensors, SIDEWAYS 2022. Association for Computing Machinery, Inc. https://doi.org/10.1145/3544795.3544852

A New Corpus and Lexicon for Offensive Tamazight Language Detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions