Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable?

5Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.

Cite

CITATION STYLE

APA

Mousavi, S. M., Roccabruna, G., Lorandi, M., Caldarella, S., & Riccardi, G. (2022). Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable? In GEM 2022 - 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, Proceedings of the Workshop (pp. 136–147). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.gem-1.12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free