One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German-English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English-Turkish, Nepali-English, and Sinhala-English translation tasks.
CITATION STYLE
Li, Y., Li, X., Yang, Y., & Dong, R. (2020). A diverse data augmentation strategy for low-resource neural machine translation. Information (Switzerland), 11(5). https://doi.org/10.3390/INFO11050255
Mendeley helps you to discover research relevant for your work.