Weak Signal Asymptotics for Sequentially Randomized Experiments-Reference-Cited by-同舟云学术

Weak Signal Asymptotics for Sequentially Randomized Experiments

Published:2023-12-06 Issue: Volume: Page:
ISSN:0025-1909
Container-title:Management Science
language:en
Short-container-title:Management Science

Author:

Kuang Xu¹^ORCID,Wager Stefan¹^ORCID

Affiliation:

1. Graduate School of Business, Stanford University, Stanford, California 94305

Abstract

We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multiarmed bandit problems. In an experiment with n time steps, we let the mean reward gaps between actions scale to the order [Formula: see text] to preserve the difficulty of the learning task as n grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments—adapted to this scaling regime and with arm selection probabilities that vary continuously with state—converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not upper-confidence bound, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from suboptimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs. This paper was accepted by Baris Ata, stochastic models and simulation. Supplemental Material: The data and online appendix are available at https://doi.org/10.1287/mnsc.2023.4964 .

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Strategy and Management

Link

https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2023.4964

Reference46 articles.

1. Near-Optimal Regret Bounds for Thompson Sampling

2. Policy Learning With Observational Data

3. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

4. The Nonstochastic Multiarmed Bandit Problem