Towards holistic and automatic evaluation of open-domain dialogue generation

64Citations
Citations of this article
145Readers
Mendeley users who have this article in their library.

Abstract

Open-domain dialogue generation has gained increasing attention in Natural Language Processing. Its evaluation requires a holistic means. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is highly desirable. In this paper, we propose holistic evaluation metrics that capture different aspects of open-domain dialogues. Our metrics consist of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, (3) n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.

Cite

CITATION STYLE

APA

Pang, B., Nijkamp, E., Han, W., Zhou, L., Liu, Y., & Tu, K. (2020). Towards holistic and automatic evaluation of open-domain dialogue generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 3619–3629). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-main.333

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free