On upper-confidence bound policies for switching bandit problems

261Citations
Citations of this article
97Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upper-bound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6925 LNAI, pp. 174–188). https://doi.org/10.1007/978-3-642-24412-4_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free