A One-Size-Fits-All Solution to Conservative Bandit Problems

2Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a general one-size-fits-all solution to CBPs and present its applications to three encompassed problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, our algorithms guarantee sample-path constraints on the actual received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance in both the expected reward and variability. We design a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

Cite

CITATION STYLE

APA

Du, Y., Wang, S., & Huang, L. (2021). A One-Size-Fits-All Solution to Conservative Bandit Problems. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 8B, pp. 7254–7261). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i8.16891

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free