NDCG and similar measures remain standard for the offline evaluation of search, recommendation, question answering and similar systems. These measures require definitions for two or more relevance levels, which human assessors then apply to judge individual documents. Due to this dependence on a definition of relevance, it can be difficult to extend these measures to account for factors beyond relevance. Rather than propose extensions to these measures, we instead propose a radical simplification to replace them. For each query, we define a set of ideal rankings and compute the maximum rank similarity between members of this set and an actual ranking generated by a system. This maximum similarity to an ideal ranking becomes our effectiveness measure, replacing NDCG and similar measures. We propose rank biased overlap (RBO) to compute this rank similarity, since it was specifically created to address the requirements of rank similarity between search results. As examples, we explore ideal rankings that account for document length, diversity, and correctness.
CITATION STYLE
Clarke, C. L. A., Smucker, M. D., & Vtyurina, A. (2020). Offline Evaluation by Maximum Similarity to an Ideal Ranking. In International Conference on Information and Knowledge Management, Proceedings (pp. 225–234). Association for Computing Machinery. https://doi.org/10.1145/3340531.3411915
Mendeley helps you to discover research relevant for your work.