In this paper, we introduce a neural network based sequence learning approach for the task of Arabic dialect classification. Character models based on recurrent neural networks with Long Short-Term Memory (LSTM) are suggested to classify short texts, such as tweets, written in different Arabic dialects. The LSTM-based character models can handle long-term dependencies in character sequences and do not require a set of linguistic rules at word-level, which is especially useful for the rich morphology of the Arabic language and the lack of strict orthographic rules for dialects. On the Tunisian Election Twitter dataset, our system achieves a promising average accuracy of 92.2% for distinguishing Modern Standard Arabic from Tunisian dialect. On the Multidialectal Parallel Corpus of Arabic, the proposed character models can distinguish six classes, Modern Standard Arabic and five Arabic dialects, with an average accuracy of 63.4%. They clearly outperform a standard word-level approach based on statistical n-grams as well as several other existing systems.
CITATION STYLE
Sayadi, K., Hamidi, M., Bui, M., Liwicki, M., & Fischer, A. (2018). Character-level dialect identification in arabic using long short-term memory. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10762 LNCS, pp. 324–337). Springer Verlag. https://doi.org/10.1007/978-3-319-77116-8_24
Mendeley helps you to discover research relevant for your work.