Accurate evaluation of segment-level machine translation metrics

67Citations
Citations of this article
93Readers
Mendeley users who have this article in their library.

Abstract

Evaluation of segment-level machine translation metrics is currently hampered by: (1) low inter-annotator agreement levels in human assessments; (2) lack of an effective mechanism for evaluation of translations of equal quality; and (3) lack of methods of significance testing improvements over a baseline. In this paper, we provide solutions to each of these challenges and outline a new human evaluation methodology aimed specifically at assessment of segment-level metrics. We replicate the human evaluation component of WMT-13 and reveal that the current state-of-the-art performance of segment-level metrics is better than previously believed. Three segment-level metrics -METEOR, NLEPOR and SENTBLEUMOSES -are found to correlate with human assessment at a level not significantly outperformed by any other metric in both the individual language pair assessment for Spanish-to-English and the aggregated set of 9 language pairs.

Cite

CITATION STYLE

APA

Graham, Y., Mathur, N., & Baldwin, T. (2015). Accurate evaluation of segment-level machine translation metrics. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1183–1191). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1124

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free