Infinite horizon multi-armed bandits with reward vectors: Exploration/exploitation trade-off

Madalina M. Drugan

Conference Proceedings

Infinite horizon multi-armed bandits with reward vectors: Exploration/exploitation trade-off

Drugan M

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9494 128-144

DOI: 10.1007/978-3-319-27947-3_7

0Citations

4Readers

Get full text

Abstract

We focus on the effect of the exploration/exploitation tradeoff strategies on the algorithmic design off multi-armed bandits (MAB) with reward vectors. Pareto dominance relation assesses the quality of reward vectors in infinite horizon MABs, like the UCB1 and UCB2 algorithms. In single objective MABs, there is a trade-off between the exploration of the suboptimal arms, and exploitation of a single optimal arm. Pareto dominance based MABs fairly exploit all Pareto optimal arms, and explore suboptimal arms. We study the exploration vs exploitation trade-off for two UCB like algorithms for reward vectors. We analyse the properties of the proposed MAB algorithms in terms of upper regret bounds and we experimentally compare their exploration vs exploitation trade-off on a bi-objective Bernoulli environment coming from control theory.

Author supplied keywords

Cite

CITATION STYLE

APA

Drugan, M. M. (2015). Infinite horizon multi-armed bandits with reward vectors: Exploration/exploitation trade-off. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9494, pp. 128–144). Springer Verlag. https://doi.org/10.1007/978-3-319-27947-3_7

Infinite horizon multi-armed bandits with reward vectors: Exploration/exploitation trade-off

Abstract

Author supplied keywords

Cite

Register to see more suggestions