An analysis of the value of information when exploring stochastic, discrete multi-armed bandits

7Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.

Cite

CITATION STYLE

APA

Sledge, I. J., & Príncipe, J. C. (2018). An analysis of the value of information when exploring stochastic, discrete multi-armed bandits. Entropy, 20(3). https://doi.org/10.3390/e20030155

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free