Optimal Learning for Structured Bandits

Author:

Van Parys Bart1ORCID,Golrezaei Negin1ORCID

Affiliation:

1. Massachusetts Institute of Technology, Sloan School of Management, Cambridge, Massachusetts 02142

Abstract

We study structured multiarmed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain convex structural information regarding the reward distributions; that is, the decision-maker knows that the reward distributions of the arms belong to a convex compact set. In the presence of such structural information, the decision-maker then would like to minimize his or her regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer minimal regret. However, as recently pointed out by Russo and Van Roy (2018) and Lattimore and Szepesvari (2017) , neither algorithm is capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call “DUSA,” whose regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. We show that this idea leads to the first computationally viable learning policy with asymptotic minimal regret for various structural information, including well-known structured bandits such as linear, Lipschitz, and convex bandits and novel structured bandits that have not been studied in the literature because of the lack of a unified and flexible framework. Funding: N. Golrezaei was supported in part by the Young Investigator Program (YIP) Award from the Office of Naval Research (ONR) [Grant N00014-21-1-2776] and the MIT Research Support Award. This paper was accepted by Chung Piaw Teo, Optimization

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Strategy and Management

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3