Observe before play: Multi-armed bandit with pre-observations

9Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for K arms with Bernoulli rewards, and prove a Tround regret upper bound O(K2 log T). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its T-round regret relative to an offline greedy strategy is upper bounded in O(MK42 log T) for K arms and M players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and DMP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.

Cite

CITATION STYLE

APA

Zuo, J., Zhang, X., & Joe-Wong, C. (2020). Observe before play: Multi-armed bandit with pre-observations. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 7023–7030). AAAI press. https://doi.org/10.1609/aaai.v34i04.6187

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free