Automatic Identification of Arabic Language Varieties and Dialects in Social Media

59Citations
Citations of this article
118Readers
Mendeley users who have this article in their library.

Abstract

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.

Cite

CITATION STYLE

APA

Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. In SocialNLP 2014 - 2nd Workshop on Natural Language Processing for Social Media, in conjunction with COLING 2014 (pp. 22–27). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5904

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free