A diverse data augmentation strategy for low-resource neural machine translation

25Citations
Citations of this article
46Readers
Mendeley users who have this article in their library.

Abstract

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German-English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English-Turkish, Nepali-English, and Sinhala-English translation tasks.

Cite

CITATION STYLE

APA

Li, Y., Li, X., Yang, Y., & Dong, R. (2020). A diverse data augmentation strategy for low-resource neural machine translation. Information (Switzerland), 11(5). https://doi.org/10.3390/INFO11050255

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free