A modern Bayesian look at the multi-armed bandit

  • Scott S
  • 101


    Mendeley users who have this article in their library.
  • 107


    Citations of this article.


A multi-armed bandit is an experiment with the goal of accumulating rewards from a payoff distribution with unknown parameters that are to be learned sequentially. This article describes a heuristic for managing multi-armed bandits called randomized probability matching, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with payoff distributions that correspond to certain classical experimental designs that have the potential to outperform methods that are ‘optimal’ in simpler contexts. I summarize the relationships between randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. Copyright

Author-supplied keywords

  • Bayesian adaptive design
  • exploration vs exploitation
  • probability matching
  • sequential design

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Steven L. Scott

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free