Evaluation of segment-level machine translation metrics is currently hampered by: (1) low inter-annotator agreement levels in human assessments; (2) lack of an effective mechanism for evaluation of translations of equal quality; and (3) lack of methods of significance testing improvements over a baseline. In this paper, we provide solutions to each of these challenges and outline a new human evaluation methodology aimed specifically at assessment of segment-level metrics. We replicate the human evaluation component of WMT-13 and reveal that the current state-of-the-art performance of segment-level metrics is better than previously believed. Three segment-level metrics -METEOR, NLEPOR and SENTBLEUMOSES -are found to correlate with human assessment at a level not significantly outperformed by any other metric in both the individual language pair assessment for Spanish-to-English and the aggregated set of 9 language pairs.
CITATION STYLE
Graham, Y., Mathur, N., & Baldwin, T. (2015). Accurate evaluation of segment-level machine translation metrics. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1183–1191). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1124
Mendeley helps you to discover research relevant for your work.