Abstract
We propose a PAC formulation for identifying an arm in an n-armed bandit whose mean is within a fixed tolerance of the mth highest mean. This setup generalises a previous formulation with m = 1, and differs from yet another one which requires m such arms to be identified. The key implication of our proposed approach is the ability to derive upper bounds on the sample complexity that depend on n/m in place of n. Consequently, even when the number of arms is infinite, we only need a finite number of samples to identify an arm that compares favourably with a fixed reward quantile. This facility makes our approach attractive to applications such as drug discovery, wherein the number of arms (molecular configurations) may run into a few thousands. We present sampling algorithms for both the finite- and infinite-armed cases, and validate their efficiency through theoretical and experimental analysis. We also present a lower bound on the worst case sample complexity of PAC algorithms for our problem, which matches our upper bound up to a logarithmic factor.
Cite
CITATION STYLE
Chaudhuri, A. R., & Kalyanakrishnan, S. (2017). PAC identification of a bandit arm relative to a reward quantile. In 31st AAAI Conference on Artificial Intelligence, AAAI 2017 (pp. 1777–1783). AAAI press. https://doi.org/10.1609/aaai.v31i1.10802
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.