Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

135Citations
Citations of this article
107Readers
Mendeley users who have this article in their library.

Abstract

This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked to score the outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 of which are reference-less “metrics” and constitute submissions to the joint task with WMT19 Quality Estimation Task, “QE as a Metric”. In addition, we computed 11 baseline metrics, with 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, and chrF) and 3 reimplementations (chrF+, sacreBLEU-BLEU, and sacreBLEU-chrF). Metrics were evaluated on the system level, how well a given metric correlates with the WMT19 official manual ranking, and segment level, how well the metric correlates with human judgements of segment quality. This year, we use direct assessment (DA) as our only form of manual evaluation.

Cite

CITATION STYLE

APA

Ma, Q., Wei, J. T. Z., Bojar, O., & Graham, Y. (2019). Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In WMT 2019 - 4th Conference on Machine Translation, Proceedings of the Conference (Vol. 2, pp. 62–90). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-5302

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free