Abstract
Given a finite number of different experiments with unknown probabilities p1, p2, ···, pk of success, the multi-armed bandit problem is concerned with maximising the expected number of successes in a sequence of trials. There are many policies which ensure that the proportion of successes converges to p = max (p1, p2, ···, pk), in the long run. This property is established for a class of decision procedures which rely on randomisation, at each stage, in selecting the experiment for the next trial. Further, it is suggested that some of these procedures might perform well over any finite sequence of trials.
Publisher
Cambridge University Press (CUP)
Subject
Applied Mathematics,Statistics and Probability
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献