Arabic-Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation

19Citations
Citations of this article
29Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Morphologically rich and complex languages such as Arabic, pose a major challenge to neural machine translation (NMT) due to the large number of rare words and the inability of NMT to translate them. Unknown word (UNK) symbols are used to represent out-of-vocabulary words because NMT typically operates with a fixed vocabulary size. These rare words can be effectively encoded as sequences of subword units by using algorithms, such as byte pair encoding (BPE), to tackle the UNK problem. However, for languages with highly inflected and morphological variations, such as Arabic, the aforementioned method has its own limitations that make it not effective enough for translation quality. To alleviate the UNK problem and address the inconvenient behavior of BPE when translating the Arabic language, we propose to utilize a romanization system that converts Arabic scripts to subword units. We investigate the effect of our approach on NMT performance under various segmentation scenarios and compare the results with systems trained on original Arabic form. In addition, we integrate Romanized Arabic as an input factor for Arabic-sourced NMT compared with well-known factors, namely, lemma, part-of-speech tags, and morph features. Extensive experiments on Arabic-Chinese translation demonstrate that the proposed approaches can effectively tackle the UNK problem and significantly improve the translation quality for Arabic-sourced translation. Additional experiments in this study focus on developing the NMT system on Chinese-Arabic translation. Before implementing our experiments, we first propose standard criteria for the data filtering of a parallel corpus, which helps in filtering out its noise.

Cite

CITATION STYLE

APA

Aqlan, F., Fan, X., Alqwbani, A., & Al-Mansoub, A. (2019). Arabic-Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation. IEEE Access, 7, 133122–133135. https://doi.org/10.1109/ACCESS.2019.2941161

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free