How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation

Chia Wei Liu; Ryan Lowe; Iulian V. Serban; Michael Noseworthy; Laurent Charlin; Joelle Pineau

Conference ProceedingsOPEN ACCESS

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation

EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings (2016) 2122-2132

DOI: 10.18653/v1/d16-1230

775Citations

793Readers

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Cite

CITATION STYLE

APA

Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 2122–2132). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/d16-1230

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation

Abstract

Cite

Register to see more suggestions