Data annotation is the foundation of most natural language processing (NLP) tasks. However, data annotation is complex and there is often no specific correct label, especially in subjective tasks. Data annotation is affected by the annotators' ability to understand the provided data. In the case of Arabic, this is important due to the large dialectal variety. In this paper, we analyse how Arabic speakers understand other dialects in written text. Also, we analyse the effect of dialect familiarity on the quality of data annotation, focusing on Arabic sarcasm detection. This is done by collecting third-party labels and comparing them to high-quality first-party labels. Our analysis shows that annotators tend to better identify their own dialect and they are prone to confuse dialects they are unfamiliar with. For task labels, annotators tend to perform better on their dialect or dialects they are familiar with. Finally, females tend to perform better than males on the sarcasm detection task. We suggest that to guarantee high-quality labels, researchers should recruit native dialect speakers for annotation.
CITATION STYLE
Farha, I. A., & Magdy, W. (2022). The Effect of Arabic Dialect Familiarity on Data Annotation. In WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (pp. 399–408). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.wanlp-1.39
Mendeley helps you to discover research relevant for your work.