Motivated by edge computing with artificial intelligence, in this paper we study a bandit-learning problem with switching costs. Existing results in the literature either incur [EQUATION] regret with bandit feedback, or rely on free full-feedback in order to reduce the regret to [EQUATION]. In contrast, we expand our study to incorporate two new factors. First, full feedback could incur a cost. Second, the player may choose 2 (or more) arms at a time, in which case she is free to use any one of the chosen arms to calculate loss, and switching costs are incurred only when she changes the set of chosen arms. For the setting where the player pulls only one arm at a time, our new regret lower-bound shows that, even when costly full-feedback is added, the [EQUATION] regret still cannot be improved. However, the dependence on the number of arms may be improved when the full-feedback cost is small. In contrast, for the setting where the player can choose 2 (or more) arms at a time, we provide a novel online learning algorithm that achieves a lower [EQUATION] regret. Further, our new algorithm does not need any full feedback at all. This sharp difference therefore reveals the surprising power of choosing 2 (or more) arms for this type of bandit-learning problems with switching costs. Both our new algorithm and regret analysis involve several new ideas, which may be of independent interest.
CITATION STYLE
Shi, M., Lin, X., & Jiao, L. (2022). Power-of-2-arms for bandit learning with switching costs. In Proceedings of the International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc) (pp. 131–140). Association for Computing Machinery. https://doi.org/10.1145/3492866.3549720
Mendeley helps you to discover research relevant for your work.