Metric and relevance mismatch in retrieval evaluation

Falk Scholer; Andrew Turpin

Conference Proceedings

Metric and relevance mismatch in retrieval evaluation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009) 5839 LNCS 50-62

DOI: 10.1007/978-3-642-04769-5_5

10Citations

6Readers

Get full text

Abstract

Recent investigations of search performance have shown that, even when presented with two systems that are superior and inferior based on a Cranfield-style batch experiment, real users may perform equally well with either system. In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. Our results show that batch experiment predictions based on P@1 or DCG@1 translate directly to user search effectiveness. However, marginally relevant documents are not strongly differentiable from non-relevant documents. Therefore, when folding multiple relevance levels into a binary scale, marginally relevant documents should be grouped with non-relevant documents, rather than with highly relevant documents, as is currently done in standard IR evaluations. We then investigate relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful. When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks. © 2009 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Scholer, F., & Turpin, A. (2009). Metric and relevance mismatch in retrieval evaluation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5839 LNCS, pp. 50–62). https://doi.org/10.1007/978-3-642-04769-5_5

Metric and relevance mismatch in retrieval evaluation

Abstract

Cite

Register to see more suggestions