MUTT: Metric unit TesTing for language generation tasks

Willie Boag; Renan Campos; Kate Saenko; Anna Rumshisky

Conference ProceedingsOPEN ACCESS

MUTT: Metric unit TesTing for language generation tasks

54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers (2016) 4 1935-1943

DOI: 10.18653/v1/p16-1182

3Citations

98Readers

Abstract

Precise evaluation metrics are important for assessing progress in high-level language generation tasks such as machine translation or image captioning. Historically, these metrics have been evaluated using correlation with human judgment. However, human-derived scores are often alarmingly inconsistent and are also limited in their ability to identify precise areas of weakness. In this paper, we perform a case study for metric evaluation by measuring the effect that systematic sentence transformations (e.g. active to passive voice) have on the automatic metric scores. These sentence "corruptions" serve as unit tests for precisely measuring the strengths and weaknesses of a given metric. We find that not only are human annotations heavily inconsistent in this study, but that the Metric Unit TesT analysis is able to capture precise shortcomings of particular metrics (e.g. comparing passive and active sentences) better than a simple correlation with human judgment can.

Cite

CITATION STYLE

APA

Boag, W., Campos, R., Saenko, K., & Rumshisky, A. (2016). MUTT: Metric unit TesTing for language generation tasks. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers (Vol. 4, pp. 1935–1943). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p16-1182

MUTT: Metric unit TesTing for language generation tasks

Abstract

Cite

Register to see more suggestions