This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1-1/n, the regret of the policy is of order logn. They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms. © 2011 Springer-Verlag.
CITATION STYLE
Salomon, A., & Audibert, J. Y. (2011). Deviations of stochastic bandit regret. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6925 LNAI, pp. 159–173). https://doi.org/10.1007/978-3-642-24412-4_15
Mendeley helps you to discover research relevant for your work.