Contrasting human opinion of non-factoid question answering with automatic evaluation

3Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Evaluation in non-factoid question answering tasks generally takes the form of computation of automatic metric scores for systems on a sample test set of questions against human-generated reference answers. Conclusions drawn from the scores produced by automatic metrics inevitably lead to important decisions about future directions. Metrics commonly applied include ROUGE, adopted from the related field of summarization, BLEU and Meteor, both of the latter originally developed for evaluation of machine translation. In this paper, we pose the important question, given that question answering is evaluated by application of automatic metrics originally designed for other tasks, to what degree do the conclusions drawn from such metrics correspond to human opinion about system-generated answers? We take the task of machine reading comprehension (MRC) as a case study and to address this question, provide a new method of human evaluation developed specifically for the task at hand.

Cite

CITATION STYLE

APA

Ji, T., Graham, Y., & Jones, G. J. F. (2020). Contrasting human opinion of non-factoid question answering with automatic evaluation. In CHIIR 2020 - Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (pp. 348–352). Association for Computing Machinery, Inc. https://doi.org/10.1145/3343413.3377996

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free