Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics

6Citations
Citations of this article
33Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these metrics return certain scores. This year’s Eval4NLP shared task tackles this challenge by searching for methods that can extract feature importance scores that correlate well with human word-level error annotations. In this paper we show that unsupervised metrics that are based on token-matching can intrinsically provide such scores. The submitted system interprets the similarities of the contextualized word-embeddings that are used to compute (X)BERTScore as word-level importance scores. We make our code available.

Cite

CITATION STYLE

APA

Leiter, C. W. (2021). Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics. In Eval4NLP 2021 - Evaluation and Comparison of NLP Systems, Proceedings of the 2nd Workshop (pp. 157–164). Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-056-4_016

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free