Batched Neural Bandits-Reference-Cited by-同舟云学术

Batched Neural Bandits

Published:2024-01-16 Issue:1 Volume:1 Page:1-18
ISSN:2831-3194
Container-title:ACM / IMS Journal of Data Science
language:en
Short-container-title:ACM / IMS J. Data Sci.

Author:

Gu Quanquan¹^ORCID,Karbasi Amin²^ORCID,Khosravi Khashayar³^ORCID,Mirrokni Vahab³^ORCID,Zhou Dongruo¹^ORCID

Affiliation:

1. University of California, Los Angeles, USA

2. Yale University, USA

3. Google Research NYC, USA

Abstract

In many sequential decision-making problems, the individuals are split into several batches and the decision-maker is only allowed to change her policy at the end of batches. These batch problems have a large number of applications, ranging from clinical trials to crowdsourcing. Motivated by this, we study the stochastic contextual bandit problem for general reward distributions under the batched setting. We propose the BatchNeuralUCB algorithm which combines neural networks with optimism to address the exploration-exploitation tradeoff while keeping the total number of batches limited. We study BatchNeuralUCB under both fixed and adaptive batch size settings and prove that it achieves the same regret as the fully sequential version while reducing the number of policy updates considerably. We confirm our theoretical results via simulations on both synthetic and real-world datasets.

Funder

National Science Foundation CAREER

NSF

ONR

AI Institute for Learning-Enabled Optimization at Scale

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3592474

Reference41 articles.

1. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems. 2312–2320.

2. Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning. 127–135.

3. Using confidence bounds for exploitation-exploration trade-offs;Auer Peter;Journal of Machine Learning Research,2002

4. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem