Copied monolingual data improves low-resource neural machine translation

141Citations
Citations of this article
187Readers
Mendeley users who have this article in their library.

Abstract

We train a neural machine translation (NMT) system to both translate source-language text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish?English and Romanian?English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages.

Cite

CITATION STYLE

APA

Currey, A., Barone, A. V. M., & Heafield, K. (2017). Copied monolingual data improves low-resource neural machine translation. In WMT 2017 - 2nd Conference on Machine Translation, Proceedings (pp. 148–156). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4715

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free