Evaluating the robustness of off-policy evaluation

21Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators' performance on a narrow set of hyperparameters and evaluation policies. Therefore, it is difficult to know which estimator is safe and reliable to use. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner. Then, using the IEOE procedure, we perform extensive evaluation of a wide variety of existing estimators on Open Bandit Dataset, a large-scale public real-world dataset for OPE. We demonstrate that our procedure can evaluate the estimators' robustness to the hyperparamter choice, helping us avoid using unsafe estimators. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.

References Powered by Scopus

Doubly robust policy evaluation and optimization

172Citations
N/AReaders
Get full text

Offline A/B testing for recommender systems

146Citations
N/AReaders
Get full text

The offset tree for learning with partial labels

102Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Doubly robust off-policy evaluation for ranking policies under the cascade behavior model

34Citations
N/AReaders
Get full text

A survey on causal inference for recommendation

15Citations
N/AReaders
Get full text

Sim-GAIL: A generative adversarial imitation learning approach of student modelling for intelligent tutoring systems

8Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Saito, Y., Udagawa, T., Kiyohara, H., Mogi, K., Narita, Y., & Tateno, K. (2021). Evaluating the robustness of off-policy evaluation. In RecSys 2021 - 15th ACM Conference on Recommender Systems (pp. 114–123). Association for Computing Machinery, Inc. https://doi.org/10.1145/3460231.3474245

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 11

65%

Researcher 4

24%

Professor / Associate Prof. 1

6%

Lecturer / Post doc 1

6%

Readers' Discipline

Tooltip

Computer Science 12

75%

Engineering 2

13%

Neuroscience 1

6%

Social Sciences 1

6%

Article Metrics

Tooltip
Social Media
Shares, Likes & Comments: 1

Save time finding and organizing research with Mendeley

Sign up for free