Abstract
Abstract
We consider two variations of upper confidence bound strategy for Gaussian two-armed bandits. Rewards for the arms are assumed to have unknown expected values and unknown variances. It is demonstrated that expected regret values for both discussed strategies are continuous functions of reward variance. A set of Monte-Carlo simulations was performed to show the nature of the relation between variance estimation and losses. It is shown that the regret grows only slightly when the estimation error is fairly large, which allows to estimate the variance during the initial steps of the control and stop this estimation later.
Subject
General Physics and Astronomy
Reference19 articles.
1. Using confidence bounds for exploitation-exploration trade-offs;Auer;Journal of Machine Learning Research.,2002
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献