Observe before play: Multi-armed bandit with pre-observations

Jinhang Zuo; Xiaoxi Zhang; Carlee Joe-Wong

Conference ProceedingsOPEN ACCESS

Observe before play: Multi-armed bandit with pre-observations

AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (2020) 7023-7030

DOI: 10.1609/aaai.v34i04.6187

9Citations

11Readers

Abstract

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for K arms with Bernoulli rewards, and prove a Tround regret upper bound O(K2 log T). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its T-round regret relative to an offline greedy strategy is upper bounded in O(MK42 log T) for K arms and M players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and DMP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.

Cite

CITATION STYLE

APA

Zuo, J., Zhang, X., & Joe-Wong, C. (2020). Observe before play: Multi-armed bandit with pre-observations. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 7023–7030). AAAI press. https://doi.org/10.1609/aaai.v34i04.6187

Observe before play: Multi-armed bandit with pre-observations

Abstract

Cite

Register to see more suggestions