Training Data Augmentation for Code-Mixed Translation

Abhirut Gupta; Aditya Vavre; Sunita Sarawagi

Conference Proceedings

Training Data Augmentation for Code-Mixed Translation

NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (2021) 5760-5766

DOI: 10.18653/v1/2021.naacl-main.459

25Citations

67Readers

Get full text

Abstract

Machine translation of user-generated codemixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to codemixed parallel data. We present an mBERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English codemixed translation task.

Cite

CITATION STYLE

APA

Gupta, A., Vavre, A., & Sarawagi, S. (2021). Training Data Augmentation for Code-Mixed Translation. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5760–5766). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.459

Training Data Augmentation for Code-Mixed Translation

Abstract

Cite

Register to see more suggestions