Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure-Reference-Cited by-同舟云学术

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Published:2023-05-19 Issue:2 Volume:7 Page:1-60
ISSN:2476-1249
Container-title:Proceedings of the ACM on Measurement and Analysis of Computing Systems
language:en
Short-container-title:Proc. ACM Meas. Anal. Comput. Syst.

Author:

Sam Tyler¹^ORCID,Chen Yudong²^ORCID,Yu Christina Lee¹^ORCID

Affiliation:

1. Cornell University, Ithaca, NY, USA

2. University of Wisconsin-Madison, Madison, WI, USA

Abstract

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ε-optimal policy is Ω(|S||A|H/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in |S| and |A| due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of Õ((|S|+|A|)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of |S|, |A|, and ε. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

Funder

NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Computer Science (miscellaneous)

Link

https://dl.acm.org/doi/pdf/10.1145/3589973

Reference57 articles.

1. Entrywise eigenvector analysis of random matrices with low expected rank;Abbe Emmanuel;Annals of statistics,2020

2. Alekh Agarwal , Sham Kakade , Akshay Krishnamurthy , and Wen Sun . 2020 . FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H . Lin (Eds.) , Vol. 33 . Curran Associates, Inc. , 20095--20107. https://proceedings. neurips.cc/paper/2020/file/e894d787e2fd6c133af47140aa156f00-Paper.pdf Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. 2020. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 20095--20107. https://proceedings. neurips.cc/paper/2020/file/e894d787e2fd6c133af47140aa156f00-Paper.pdf

3. Alekh Agarwal , Sham Kakade , and Lin F. Yang . 2020. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal . In Proceedings of Thirty Third Conference on Learning Theory (Proceedings of Machine Learning Research , Vol. 125), Jacob Abernethy and Shivani Agarwal (Eds.). PMLR, 67-- 83 . Alekh Agarwal, Sham Kakade, and Lin F. Yang. 2020. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal. In Proceedings of Thirty Third Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 125), Jacob Abernethy and Shivani Agarwal (Eds.). PMLR, 67--83.

4. Martin Anthony and Peter L . Bartlett . 2009 . Neural Network Learning: Theoretical Foundations ( 1 st ed.). Cambridge University Press , USA. Martin Anthony and Peter L. Bartlett. 2009. Neural Network Learning: Theoretical Foundations (1st ed.). Cambridge University Press, USA.

5. Sanjeev Arora , Rong Ge , and Ankur Moitra . 2012 . Learning Topic Models -- Going beyond SVD . 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (2012), 1--10. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning Topic Models -- Going beyond SVD. 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (2012), 1--10.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Tensor and Matrix Low-Rank Value-Function Approximation in Reinforcement Learning;IEEE Transactions on Signal Processing;2024

2. Matrix Low-Rank Trust Region Policy Optimization;2023 IEEE 9th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP);2023-12-10