The Effect of Arabic Dialect Familiarity on Data Annotation

Ibrahim Abu Farha; Walid Magdy

Conference Proceedings

The Effect of Arabic Dialect Familiarity on Data Annotation

WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (2022) 399-408

DOI: 10.18653/v1/2022.wanlp-1.39

12Citations

27Readers

Get full text

Abstract

Data annotation is the foundation of most natural language processing (NLP) tasks. However, data annotation is complex and there is often no specific correct label, especially in subjective tasks. Data annotation is affected by the annotators' ability to understand the provided data. In the case of Arabic, this is important due to the large dialectal variety. In this paper, we analyse how Arabic speakers understand other dialects in written text. Also, we analyse the effect of dialect familiarity on the quality of data annotation, focusing on Arabic sarcasm detection. This is done by collecting third-party labels and comparing them to high-quality first-party labels. Our analysis shows that annotators tend to better identify their own dialect and they are prone to confuse dialects they are unfamiliar with. For task labels, annotators tend to perform better on their dialect or dialects they are familiar with. Finally, females tend to perform better than males on the sarcasm detection task. We suggest that to guarantee high-quality labels, researchers should recruit native dialect speakers for annotation.

Cite

CITATION STYLE

APA

Farha, I. A., & Magdy, W. (2022). The Effect of Arabic Dialect Familiarity on Data Annotation. In WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (pp. 399–408). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.wanlp-1.39

The Effect of Arabic Dialect Familiarity on Data Annotation

Abstract

Cite

Register to see more suggestions