Approximation Benefits of Policy Gradient Methods with Aggregated States-Reference-Cited by-同舟云学术

Approximation Benefits of Policy Gradient Methods with Aggregated States

Published:2023-11 Issue:11 Volume:69 Page:6898-6911
ISSN:0025-1909
Container-title:Management Science
language:en
Short-container-title:Management Science

Author:

Russo Daniel¹^ORCID

Affiliation:

1. Graduate School of Business, Columbia University, New York, New York 10027

Abstract

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, in which the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per period is bounded by ϵ, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as [Formula: see text], where γ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision objective can be far more robust. This paper was accepted by Hamid Nazerzadeh, data science. Supplemental Material: Data are available at https://doi.org/10.1287/mnsc.2023.4788 .

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Strategy and Management

Link

https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2023.4788

Reference38 articles.

1. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

2. Aggregation in Dynamic Programming

3. First-Order Methods in Optimization

4. Adaptive aggregation methods for infinite horizon dynamic programming

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-Timescale Ensemble $Q$-Learning for Markov Decision Process Policy Optimization;IEEE Transactions on Signal Processing;2024