Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

61Citations
Citations of this article
75Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible non-parametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change in the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the point-wise optimal action at each period. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (a temporal uncertainty budget), we characterize the complexity of this problem via the minimax regret, which depends on V (the hardness of the problem), the horizon length T, and the number of arms K.

Cite

CITATION STYLE

APA

Besbes, O., Gur, Y., & Zeevi, A. (2019). Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards. Stochastic Systems, 9(4), 319–337. https://doi.org/10.1287/stsy.2019.0033

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free