Abstract
Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.
Cite
CITATION STYLE
Mousavi, S. M., Roccabruna, G., Lorandi, M., Caldarella, S., & Riccardi, G. (2022). Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable? In GEM 2022 - 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, Proceedings of the Workshop (pp. 136–147). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.gem-1.12
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.