ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT
-
Published:2016-10-05
Issue:3
Volume:31
Page:284-310
-
ISSN:0269-9648
-
Container-title:Probability in the Engineering and Informational Sciences
-
language:en
-
Short-container-title:Prob. Eng. Inf. Sci.
Author:
Burnetas Apostolos,Kanavetas Odysseas,Katehakis Michael N.
Abstract
We consider the multi-armed bandit problem under a cost constraint. Successive samples from each population are i.i.d. with unknown distribution and each sample incurs a known population-dependent cost. The objective is to design an adaptive sampling policy to maximize the expected sum ofnsamples such that the average cost does not exceed a given bound sample-path wise. We establish an asymptotic lower bound for the regret of feasible uniformly fast convergent policies, and construct a class of policies, which achieve the bound. We also provide their explicit form under Normal distributions with unknown means and known variances.
Publisher
Cambridge University Press (CUP)
Subject
Industrial and Manufacturing Engineering,Management Science and Operations Research,Statistics, Probability and Uncertainty,Statistics and Probability
Reference50 articles.
1. Multi-Armed Bandit Problems
2. Approximately optimal adaptive learning in opportunistic spectrum access
3. Near-optimal reinforcement learning in factored MDPs;Osband;Advances in Neural Information Processing Systems Conference,2014
4. Li L. , Munos R. , & Szepesvári C. (2014). On minimax optimal offline policy evaluation. arXiv:1409.3653.
5. Least-squares policy iteration;Lagoudakis;The Journal of Machine Learning Research,2003
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献