MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT-Reference-Cited by-同舟云学术

MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Published:2014-10-10 Issue:1 Volume:29 Page:51-76
ISSN:0269-9648
Container-title:Probability in the Engineering and Informational Sciences
language:en
Short-container-title:Prob. Eng. Inf. Sci.

Author:

Cowan Wesley,Katehakis Michael N.

Abstract

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round.In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.

Publisher

Cambridge University Press (CUP)

Subject

Industrial and Manufacturing Engineering,Management Science and Operations Research,Statistics, Probability and Uncertainty,Statistics and Probability

Reference52 articles.

1. Replacement of periodically inspected equipment. (An optimal optional stopping rule)

2. INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS

3. Optimal Adaptive Policies for Sequential Allocation Problems

4. Sequential choice from several populations.

5. Filippi S. , Cappé O. & Garivier A. (2010). Optimism in reinforcement learning and Kullback–Leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing, pp. 115–122, Monticello, IL: IEEE.

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimal activation of halting multi‐armed bandit models;Naval Research Logistics (NRL);2023-08-16

2. Scheduling in wireless networks with spatial reuse of spectrum as restless bandits;Performance Evaluation;2021-09

3. A General Theory of MultiArmed Bandit Processes with Constrained Arm Switches;SIAM Journal on Control and Optimization;2021-01

4. A Verification Theorem for Threshold-Indexability of Real-State Discounted Restless Bandits;Mathematics of Operations Research;2020-05

5. Reinforcement learning: a comparison of UCB versus alternative adaptive policies;First Congress of Greek Mathematicians;2020-03-23