Abstract
Evaluation in non-factoid question answering tasks generally takes the form of computation of automatic metric scores for systems on a sample test set of questions against human-generated reference answers. Conclusions drawn from the scores produced by automatic metrics inevitably lead to important decisions about future directions. Metrics commonly applied include ROUGE, adopted from the related field of summarization, BLEU and Meteor, both of the latter originally developed for evaluation of machine translation. In this paper, we pose the important question, given that question answering is evaluated by application of automatic metrics originally designed for other tasks, to what degree do the conclusions drawn from such metrics correspond to human opinion about system-generated answers? We take the task of machine reading comprehension (MRC) as a case study and to address this question, provide a new method of human evaluation developed specifically for the task at hand.
Author supplied keywords
Cite
CITATION STYLE
Ji, T., Graham, Y., & Jones, G. J. F. (2020). Contrasting human opinion of non-factoid question answering with automatic evaluation. In CHIIR 2020 - Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (pp. 348–352). Association for Computing Machinery, Inc. https://doi.org/10.1145/3343413.3377996
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.