Identifying Code-switching in Arabizi

Safaa Shehadi; Shuly Wintner

Conference ProceedingsOPEN ACCESS

Identifying Code-switching in Arabizi

WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (2022) 194-204

DOI: 10.18653/v1/2022.wanlp-1.18

6Citations

21Readers

Abstract

We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.

Cite

CITATION STYLE

APA

Shehadi, S., & Wintner, S. (2022). Identifying Code-switching in Arabizi. In WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (pp. 194–204). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.wanlp-1.18

Identifying Code-switching in Arabizi

Abstract

Cite

Register to see more suggestions