Abstract
Machine translation of user-generated codemixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to codemixed parallel data. We present an mBERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English codemixed translation task.
Cite
CITATION STYLE
Gupta, A., Vavre, A., & Sarawagi, S. (2021). Training Data Augmentation for Code-Mixed Translation. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5760–5766). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.459
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.