Affiliation:
1. School of Mathematics, Institute for Advanced Study, Princeton, New Jersey 08540;
2. Microsoft Research, New York, New York 10012
Abstract
We consider “incentivized exploration”: a version of multiarmed bandits where the choice of arms is controlled by self-interested agents. The algorithm can only issue recommendations and needs to incentivize the agents to explore, even though they prefer to exploit. The algorithm controls the flow of information and relies on information asymmetry to create incentives. This line of work, motivated by misaligned incentives in recommendation systems, has received considerable attention in the “economics and computation” community. We focus on the “price of incentives”: the loss in performance, broadly construed, incurred for the sake of incentive compatibility. We prove that Thompson sampling, a standard bandit algorithm, is incentive compatible if initialized with sufficiently many data points. The performance loss because of incentives is, therefore, limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity. How many rounds are needed to collect even one sample of each arm? We zoom in on this question, which is perhaps the most basic question one could ask about incentivized exploration. We characterize the dependence on agents' beliefs and the number of arms (which was essentially ignored in prior work), providing matching upper and lower bounds. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the “strength of beliefs.”
Publisher
Institute for Operations Research and the Management Sciences (INFORMS)
Subject
Management Science and Operations Research,Computer Science Applications
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献