Unsupervised Arabic dialect segmentation for machine translation

Citations of this article
Mendeley users who have this article in their library.
Get full text


Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.




Salloum, W., & Habash, N. (2022). Unsupervised Arabic dialect segmentation for machine translation. Natural Language Engineering, 28(2), 223–248. https://doi.org/10.1017/S1351324920000455

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free