Offline metrics for IR evaluation are often derived from a user model that seeks to capture the interaction between the user and the ranking, conflating the interaction with a ranking of documents with the user’s interaction with the search results page. A desirable property of any effectiveness metric is if the scores it generates over a set of rankings correlate well with the “satisfaction” or “goodness" scores attributed to those same rankings by a population of searchers. Using data from a large-scale web search engine, we find that offline effectiveness metrics do not correlate well with a behavioural measure of satisfaction that can be inferred from user activity logs. We then examine three mechanisms to improve the correlation: tuning the model parameters; improving the label coverage, so that more kinds of item are labelled and hence included in the evaluation; and modifying the underlying user models that describe the metrics. In combination, these three mechanisms transform a wide range of common metrics into “card-aware” variants which allow for the gain from cards (or snippets), varying probabilities of clickthrough, and good abandonment.
CITATION STYLE
Thomas, P., Moffat, A., Bailey, P., Scholer, F., & Craswell, N. (2018). Better effectiveness metrics for SErps, cards, and rankings. In ACM International Conference Proceeding Series. Association for Computing Machinery. https://doi.org/10.1145/3291992.3292002
Mendeley helps you to discover research relevant for your work.