CHIA: CHoosing Instances to Annotate for Machine Translation

Rajat Bhatnagar; Ananya Ganesh; Katharina Kann

Conference ProceedingsOPEN ACCESS

CHIA: CHoosing Instances to Annotate for Machine Translation

Findings of the Association for Computational Linguistics: EMNLP 2022 (2022) 7328-7344

DOI: 10.18653/v1/2022.findings-emnlp.540

5Citations

18Readers

Abstract

Neural machine translation (MT) systems have been shown to perform poorly on low-resource language pairs, for which large-scale parallel data is unavailable. Making the data annotation process faster and cheaper is therefore important to ensure equitable access to MT systems. To make optimal use of a limited annotation budget, we present CHIA (choosing instances to annotate), a method for selecting instances to annotate for machine translation. Using an existing multi-way parallel dataset of high-resource languages, we first identify instances, based on model training dynamics, that are most informative for training MT models for high-resource languages. We find that there are cross-lingual commonalities in instances that are useful for MT model training, which we use to identify instances that will be useful to train models on a new target language. Evaluating on 20 languages from two corpora, we show that training on instances selected using our method provides an average performance improvement of 1.59 BLEU over training on randomly selected instances of the same size.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Bhatnagar, R., Ganesh, A., & Kann, K. (2022). CHIA: CHoosing Instances to Annotate for Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 7328–7344). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-emnlp.540

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 4

57%

Researcher 2

29%

Lecturer / Post doc 1

14%

Readers' Discipline

Computer Science 8

73%

Medicine and Dentistry 1

Linguistics 1

Neuroscience 1

CHIA: CHoosing Instances to Annotate for Machine Translation

Abstract

References Powered by Scopus

A Call for Clarity in Reporting BLEU Scores

A sequential algorithm for training text classifiers

Improving neural machine translation models with monolingual data

Cited by Powered by Scopus

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline