On the Effectiveness of Automated Metrics for Text Generation Systems

1Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.

Cite

CITATION STYLE

APA

von Däniken, P., Deriu, J., Tuggener, D., & Cieliebak, M. (2022). On the Effectiveness of Automated Metrics for Text Generation Systems. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 1503–1522). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-emnlp.108

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free