Arabic Dialect Identification

202Citations
Citations of this article
231Readers
Mendeley users who have this article in their library.

Abstract

The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a nontrivial manner from the various spoken regional dialects of Arabic-the true "native languages" of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic Online Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers. © 2014 Association for Computational Linguistics.

References Powered by Scopus

Get full text
6539Citations
2038Readers
Get full text
581Citations
96Readers
Get full text

Cited by Powered by Scopus

ARBERT & MARBERT: Deep bidirectional transformers for Arabic

234Citations
282Readers
161Citations
370Readers

Arabic natural language processing: An overview

126Citations
365Readers

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171–202. https://doi.org/10.1162/COLI_a_00169

Readers over time

‘13‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘24‘25010203040

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 72

60%

Researcher 20

17%

Lecturer / Post doc 19

16%

Professor / Associate Prof. 10

8%

Readers' Discipline

Tooltip

Computer Science 80

71%

Linguistics 18

16%

Social Sciences 8

7%

Engineering 7

6%

Article Metrics

Tooltip
Mentions
References: 1

Save time finding and organizing research with Mendeley

Sign up for free
0