A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation-Reference-Cited by-同舟云学术

A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation

Published:2021-05 Issue:3 Volume:69 Page:950-973
ISSN:0030-364X
Container-title:Operations Research
language:en
Short-container-title:Operations Research

Author:

Bhandari Jalaj¹,Russo Daniel²^ORCID,Singal Raghav¹^ORCID

Affiliation:

1. Operations Research, Columbia University, New York, New York 10027;

2. Graduate School of Business, Columbia University, New York, New York 10027

Abstract

Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Computer Science Applications

Reference52 articles.

1. Primal-Dual Simulation Algorithm for Pricing Multidimensional American Options

2. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Decentralized Adaptive TD(λ) Learning With Linear Function Approximation: Nonasymptotic Analysis;IEEE Transactions on Systems, Man, and Cybernetics: Systems;2024-08

2. High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation;IEEE Transactions on Information Theory;2024-08

3. Finite- Time Analysis of Asynchronous Multi-Agent TD Learning;2024 American Control Conference (ACC);2024-07-10

4. On the Convergence of TD-Learning on Markov Reward Processes with Hidden States;2024 European Control Conference (ECC);2024-06-25

5. Finite-Time High-Probability Bounds for Polyak–Ruppert Averaged Iterates of Linear Stochastic Approximation;Mathematics of Operations Research;2024-04-16