Training Data Augmentation for Code-Mixed Translation

25Citations
Citations of this article
67Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Machine translation of user-generated codemixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to codemixed parallel data. We present an mBERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English codemixed translation task.

Cite

CITATION STYLE

APA

Gupta, A., Vavre, A., & Sarawagi, S. (2021). Training Data Augmentation for Code-Mixed Translation. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5760–5766). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.459

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free