Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Jan Deriu; Pius von Däniken; Don Tuggener; Mark Cieliebak

Conference ProceedingsOPEN ACCESS

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 6456-6474

DOI: 10.18653/v1/2023.findings-acl.404

N/ACitations

11Readers

Abstract

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally overconfident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.

Cite

CITATION STYLE

APA

Deriu, J., von Däniken, P., Tuggener, D., & Cieliebak, M. (2023). Correction of Errors in Preference Ratings from Automated Metrics for Text Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 6456–6474). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.404

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Abstract

Cite

Register to see more suggestions