A diverse data augmentation strategy for low-resource neural machine translation

Yu Li; Xiao Li; Yating Yang; Rui Dong

Journal ArticleOPEN ACCESS

A diverse data augmentation strategy for low-resource neural machine translation

Information (Switzerland) (2020) 11(5)

DOI: 10.3390/INFO11050255

25Citations

46Readers

Abstract

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German-English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English-Turkish, Nepali-English, and Sinhala-English translation tasks.

Author supplied keywords

Cite

CITATION STYLE

APA

Li, Y., Li, X., Yang, Y., & Dong, R. (2020). A diverse data augmentation strategy for low-resource neural machine translation. Information (Switzerland), 11(5). https://doi.org/10.3390/INFO11050255

A diverse data augmentation strategy for low-resource neural machine translation

Abstract

Author supplied keywords

Cite

Register to see more suggestions