Affiliation:
1. College of Intelligence Science and Technology National University of Defense Technology Changsha China
2. State Key Laboratory of Astronautic Dynamics Xi'an Satellite Control Center Xi'an China
Abstract
AbstractPolicy evaluation (PE) is a critical sub‐problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least‐Squares Truncated Temporal‐Difference learning (LST2D) is proposed. In LST2D, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least‐Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre‐training methods are utilised to improve the approximation ability of LST2D. Furthermore, an Actor‐Critic algorithm based on LST2D and pre‐trained feature representations (ACLPF) is proposed, where LST2D is integrated into the critic network to improve learning‐prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of LST2D. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor‐critic architecture.
Funder
National Natural Science Foundation of China
Publisher
Institution of Engineering and Technology (IET)
Subject
Artificial Intelligence,Computer Networks and Communications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Information Systems
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献