Author:
Caro Felipe,Yoo Onesun Steve
Abstract
This article considers an important class of discrete time restless bandits, given by the discounted multiarmed bandit problems with response delays. The delays in each period are independent random variables, in which the delayed responses do not cross over. For a bandit arm in this class, we use a coupling argument to show that in each state there is a unique subsidy that equates the pulling and nonpulling actions (i.e., the bandit satisfies the indexibility criterion introduced by Whittle (1988). The result allows for infinite or finite horizon and holds for arbitrary delay lengths and infinite state spaces. We compute the resulting marginal productivity indexes (MPI) for the Beta-Bernoulli Bayesian learning model, formulate and compute a tractable upper bound, and compare the suboptimality gap of the MPI policy to those of other heuristics derived from different closed-form indexes. The MPI policy performs near optimally and provides a theoretical justification for the use of the other heuristics.
Publisher
Cambridge University Press (CUP)
Subject
Industrial and Manufacturing Engineering,Management Science and Operations Research,Statistics, Probability and Uncertainty,Statistics and Probability
Reference24 articles.
1. On an index policy for restless bandits
2. One-armed bandit models with continuous and delayed responses
3. New adaptive designs for delayed response models
4. 14. Gittins J.C. (1989). Multi-armed bandit allocation indices. Chichester, UK: John Wiley.
5. Gittins procedures for bandits with delayed responses;Eick;Journal of the Royal Statistics Society B,1988
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献