Exploring Amharic Hate Speech Data Collection and Classification Approaches

6Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we present a study of efficient data selection and annotation strategies for Amharic hate speech. We also build various classification models and investigate the challenges of hate speech data selection, annotation, and classification for the Amharic language. From a total of over 18 million tweets in our Twitter corpus, 15.1k tweets are annotated by two independent native speakers, and a Cohen's kappa score of 0.48 is achieved. A third annotator, a curator, is also employed to decide on the final gold labels. We employ both classical machine learning and deep learning approaches, which include fine-tuning AmFLAIR and AmRoBERTa contextual embedding models. Among all the models, AmFLAIR achieves the best performance with an F1-score of 72%. We publicly release the annotation guidelines, keywords/lexicon entries, datasets, models, and associated scripts with a permissive license.

Cite

CITATION STYLE

APA

Ayele, A. A., Yimam, S. M., Belay, T. D., Asfaw, T. T., & Biemann, C. (2023). Exploring Amharic Hate Speech Data Collection and Classification Approaches. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 49–59). Incoma Ltd. https://doi.org/10.26615/978-954-452-092-2_006

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free