Tabular and Deep Learning for the Whittle Index-Reference-Cited by-同舟云学术

Tabular and Deep Learning for the Whittle Index

Published:2024-08-13 Issue:3 Volume:9 Page:1-21
ISSN:2376-3639
Container-title:ACM Transactions on Modeling and Performance Evaluation of Computing Systems
language:en
Short-container-title:ACM Trans. Model. Perform. Eval. Comput. Syst.

Author:

Robledo Relaño Francisco¹^ORCID,Borkar Vivek²^ORCID,Ayesta Urtzi³^ORCID,Avrachenkov Konstantin⁴^ORCID

Affiliation:

1. UPV/EHU, Bilbao, Spain and UPPA, Pau France

2. Indian Institute of Technology, Mumbai, India

3. Institut de Recherche en Informatique de Toulouse, Toulouse, France, UPV/EHU, Donostia Spain and Ikerbasque, Bilbao, Spain

4. Inria, Sophia Antipolis, France

Abstract

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this article, we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result, we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning, and other state-of-the-art algorithms.

Funder

Department of Education of the Basque Government through the Consolidated Research Group MATHMODE

French “Agence Nationale de la Recherche (ANR)”

S. S. Bhatnagar Fellowship from Council of Scientific and Industrial Research, Government of India and Google Research

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3670686

Reference28 articles.

1. Learning Algorithms for Markov Decision Processes with Average Cost

2. Whittle index based Q-learning for restless bandits with average reward

3. Stochastic Approximation

4. Q-Learning for Bandit Problems

5. Towards Q-learning the Whittle Index for Restless Bandits