Abstract
In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a general one-size-fits-all solution to CBPs and present its applications to three encompassed problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, our algorithms guarantee sample-path constraints on the actual received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance in both the expected reward and variability. We design a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.
Cite
CITATION STYLE
Du, Y., Wang, S., & Huang, L. (2021). A One-Size-Fits-All Solution to Conservative Bandit Problems. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 8B, pp. 7254–7261). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i8.16891
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.