In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.
CITATION STYLE
Sledge, I. J., & Príncipe, J. C. (2018). An analysis of the value of information when exploring stochastic, discrete multi-armed bandits. Entropy, 20(3). https://doi.org/10.3390/e20030155
Mendeley helps you to discover research relevant for your work.