Identifying Code-switching in Arabizi

6Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.

Cite

CITATION STYLE

APA

Shehadi, S., & Wintner, S. (2022). Identifying Code-switching in Arabizi. In WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (pp. 194–204). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.wanlp-1.18

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free