Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes-Reference-Cited by-同舟云学术

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Published:2024-08-13 Issue:3 Volume:9 Page:1-33
ISSN:2376-3639
Container-title:ACM Transactions on Modeling and Performance Evaluation of Computing Systems
language:en
Short-container-title:ACM Trans. Model. Perform. Eval. Comput. Syst.

Author:

Sharma Nihal¹^ORCID,Sen Rajat²^ORCID,Basu Soumya²^ORCID,Shanmugam Karthikeyan³^ORCID,Shakkottai Sanjay¹^ORCID

Affiliation:

1. The University of Texas at Austin, Austin, United States

2. Google, Mountain View, United States

3. Google DeepMind, Bengaluru, India

Abstract

We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the “best” among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from

\(\mathcal {O}(N\log (NT^2\sqrt {E}))\)

samples, ED-UCB guarantees a regret that scales as

\(\mathcal {O}(E(N+1) + \frac{N\sqrt {E}}{T^2})\)

for N experts over E episodes, each of length T . We finally empirically validate our findings through simulations.

Funder

NSF

ARO

US DOD

ONR

Wireless Networking and Communications Group Industrial Affiliates Program

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3680279

Reference38 articles.

1. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the International Conference on Machine Learning. PMLR, 1638–1646.

2. The Nonstochastic Multiarmed Bandit Problem

3. Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. 2013. Sequential transfer in multi-armed bandit with finite set of models. In Proceedings of the 26th International Conference on Neural Information Processing Systems. 2220–2228.