Value Penalized Q-Learning for Recommender Systems

14Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data rather than expensive online interactions. In this paper, we propose Value Penalized Q-learning (VPQ), a novel uncertainty-based offline RL algorithm that penalizes the unstable Q-values in the regression target using uncertainty-aware weights, achieving the conservative Q-function without the need of estimating the behavior policy, suitable for RS with a large number of items. Experiments on two real-world datasets show the proposed method serves as a gain plug-in for existing RS models.

Cite

CITATION STYLE

APA

Gao, C., Xu, K., Zhou, K., Li, L., Wang, X., Yuan, B., & Zhao, P. (2022). Value Penalized Q-Learning for Recommender Systems. In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2008–2012). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3531796

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free