Results of the WMT16 Metrics Shared Task

Ondřej Bojar; Yvette Graham; Amir Kamran; Miloš Stanojević

Conference ProceedingsOPEN ACCESS

Results of the WMT16 Metrics Shared Task

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2016) 2 199-231

DOI: 10.18653/v1/w17-4755

86Citations

171Readers

Abstract

This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT16 Shared Translation Task. We collected scores of 16 metrics from 9 research groups. In addition to that, we computed scores of 9 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric's scores correlate with WMT16 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence). This year there are several additions to the setup: large number of language pairs (18 in total), datasets from different domains (news, IT and medical), and different kinds of judgments: relative ranking (RR), direct assessment (DA) and HUME manual semantic judgments. Finally, generation of large number of hybrid systems was trialed for provision of more conclusive system-level metric rankings.

Cite

CITATION STYLE

APA

Bojar, O., Graham, Y., Kamran, A., & Stanojević, M. (2016). Results of the WMT16 Metrics Shared Task. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 199–231). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4755

Results of the WMT16 Metrics Shared Task

Abstract

Cite

Register to see more suggestions