A Linear Response Bandit Problem

  • Goldenshluger A
  • Zeevi A
N/ACitations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

We consider a two–armed bandit problem which involves sequen-tial sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine my-opic action based on least squares estimates with a suitable " forced sampling " strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear re-sponse bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is sub-ject to sampling from the inferior population at a rate that grows like √ n. 1. Introduction. Sequential allocation problems, otherwise known as multi-armed bandit problems, arise frequently in various areas of statistics, adaptive control, marketing, economics and machine learning. The problem can be described as that of choosing between arms of a slot machine, where each time an arm is pulled a random reward which is arm-dependent is realized. The goal is to maximize the cumulative expected reward. Since the mean reward rate for each arm is not known, the gambler is faced with the classical dilemma between exploration and exploitation. The first instance of these sequential allocation problems was introduced by Robbins (1952), and since then numerous variants thereof have been studied extensively in many different contexts; we refer to Berry and Fristedt (1985), Gittins (1989), Lai (2001) and the recent book by Cesa–Bianchi and Lugosi (2006), as well as references therein. A stream of such literature has focused on the characterization of optimal procedures under Bayesian formulations, but the complexity of the problem has led many researchers to seek approximate solutions that perform well in a suitable asymptotic sense;

Cite

CITATION STYLE

APA

Goldenshluger, A., & Zeevi, A. (2017). A Linear Response Bandit Problem. Stochastic Systems, 3(1), 230–261. https://doi.org/10.1287/11-ssy032

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free