Mostly Exploration-Free Algorithms for Contextual Bandits-Reference-Cited by-同舟云学术

Mostly Exploration-Free Algorithms for Contextual Bandits

Published:2021-03 Issue:3 Volume:67 Page:1329-1349
ISSN:0025-1909
Container-title:Management Science
language:en
Short-container-title:Management Science

Author:

Bastani Hamsa¹^ORCID,Bayati Mohsen²^ORCID,Khosravi Khashayar³^ORCID

Affiliation:

1. Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

2. Stanford Graduate School of Business, Stanford University, Stanford, California 94305;

3. Stanford University Electrical Engineering, Stanford University, Stanford, California 94305

Abstract

The contextual bandit literature has traditionally focused on algorithms that address the exploration–exploitation tradeoff. In particular, greedy algorithms that exploit current estimates without any exploration may be suboptimal in general. However, exploration-free greedy algorithms are desirable in practical settings where exploration may be costly or unethical (e.g., clinical trials). Surprisingly, we find that a simple greedy algorithm can be rate optimal (achieves asymptotically optimal regret) if there is sufficient randomness in the observed contexts (covariates). We prove that this is always the case for a two-armed bandit under a general class of context distributions that satisfy a condition we term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate optimal with positive probability. Thus, standard bandit algorithms may unnecessarily explore. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy algorithm or to explore. We prove that this algorithm is rate optimal without any additional assumptions on the context distribution or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces exploration and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling or upper confidence bound. This paper was accepted by J. George Shanthikumar, big data analytics.

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Strategy and Management

Reference30 articles.

1. Dynamic Pricing Under a General Parametric Choice Model

2. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs

3. Simultaneously Learning and Optimizing Using Controlled Variance Pricing

Cited by 74 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Incentive-Aware Recommender Systems in Two-Sided Markets;ACM Transactions on Recommender Systems;2024-07-31

2. Investigating Consumers’ Purchase Resistance Behavior to AI-Based Content Recommendations on Short-Video Platforms: A Study of Greedy And Biased Recommendations;Journal of Internet Commerce;2024-07-02

3. Optimizing contextual bandit hyperparameters: A dynamic transfer learning-based framework;INT J IND ENG COMP;2024

4. A systematic literature review of solutions for cold start problem;International Journal of System Assurance Engineering and Management;2024-05-14

5. The (Surprising) Sample Optimality of Greedy Procedures for Large-Scale Ranking and Selection;Management Science;2024-05-07