Generalized data augmentation for low-resource translation

81Citations
Citations of this article
226Readers
Mendeley users who have this article in their library.

Abstract

Translation to or from low-resource languages (LRLs) poses challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing large amounts of monolingual data is regarded as an effective way to alleviate these problems. In this paper, we propose a general framework for data augmentation in low-resource machine translation that not only uses target-side monolingual data, but also pivots through a related high-resource language (HRL). Specifically, we experiment with a two-step pivoting method to convert high-resource data to the LRL, making use of available resources to better approximate the true data distribution of the LRL. First, we inject LRL words into HRL sentences through an induced bilingual dictionary. Second, we further edit these modified sentences using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.1.

Cite

CITATION STYLE

APA

Xia, M., Kong, X., Anastasopoulos, A., & Neubig, G. (2020). Generalized data augmentation for low-resource translation. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 5786–5796). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1579

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free