Beyond user self-reported likert scale ratings: A comparison model for automatic dialog evaluation

21Citations
Citations of this article
153Readers
Mendeley users who have this article in their library.

Abstract

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task. Our implementation is available at https://github.com/Weixin-Liang/dialog_evaluation_CMADE.

Cite

CITATION STYLE

APA

Liang, W., Zou, J., & Yu, Z. (2020). Beyond user self-reported likert scale ratings: A comparison model for automatic dialog evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1363–1374). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-main.126

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free