Discriminative corpus weight estimation for machine translation

Spyros Matsoukas; Antti Veikko I. Rosti; Bing Zhang

Conference ProceedingsOPEN ACCESS

Discriminative corpus weight estimation for machine translation

EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 (2009) 708-717

DOI: 10.3115/1699571.1699605

91Citations

107Readers

Abstract

Current statistical machine translation (SMT) systems are trained on sentencealigned and word-aligned parallel text collected from various sources. Translation model parameters are estimated from the word alignments, and the quality of the translations on a given test set depends on the parameter estimates. There are at least two factors affecting the parameter estimation: domain match and training data quality. This paper describes a novel approach for automatically detecting and down-weighing certain parts of the training corpus by assigning a weight to each sentence in the training bitext so as to optimize a discriminative objective function on a designated tuning set. This way, the proposed method can limit the negative effects of low quality training data, and can adapt the translation model to the domain of interest. It is shown that such discriminative corpus weights can provide significant improvements in Arabic-English translation on various conditions, using a state-of-the-art SMT system. © 2009 ACL and AFNLP.

Cite

CITATION STYLE

APA

Matsoukas, S., Rosti, A. V. I., & Zhang, B. (2009). Discriminative corpus weight estimation for machine translation. In EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 (pp. 708–717). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1699571.1699605

Discriminative corpus weight estimation for machine translation

Abstract

Cite

Register to see more suggestions