Contrasting human opinion of non-factoid question answering with automatic evaluation

Tianbo Ji; Yvette Graham; Gareth J.F. Jones

Conference ProceedingsOPEN ACCESS

Contrasting human opinion of non-factoid question answering with automatic evaluation

CHIIR 2020 - Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (2020) 348-352

DOI: 10.1145/3343413.3377996

3Citations

14Readers

Get full text

Abstract

Evaluation in non-factoid question answering tasks generally takes the form of computation of automatic metric scores for systems on a sample test set of questions against human-generated reference answers. Conclusions drawn from the scores produced by automatic metrics inevitably lead to important decisions about future directions. Metrics commonly applied include ROUGE, adopted from the related field of summarization, BLEU and Meteor, both of the latter originally developed for evaluation of machine translation. In this paper, we pose the important question, given that question answering is evaluated by application of automatic metrics originally designed for other tasks, to what degree do the conclusions drawn from such metrics correspond to human opinion about system-generated answers? We take the task of machine reading comprehension (MRC) as a case study and to address this question, provide a new method of human evaluation developed specifically for the task at hand.

Author supplied keywords

Cite

CITATION STYLE

APA

Ji, T., Graham, Y., & Jones, G. J. F. (2020). Contrasting human opinion of non-factoid question answering with automatic evaluation. In CHIIR 2020 - Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (pp. 348–352). Association for Computing Machinery, Inc. https://doi.org/10.1145/3343413.3377996

Contrasting human opinion of non-factoid question answering with automatic evaluation

Abstract

Author supplied keywords

Cite

Register to see more suggestions