PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning-Reference-Cited by-同舟云学术

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Published:2021-05-16 Issue:3 Volume:12 Page:1-21
ISSN:2157-6904
Container-title:ACM Transactions on Intelligent Systems and Technology
language:en
Short-container-title:ACM Trans. Intell. Syst. Technol.

Author:

Li Shilei¹^ORCID,Li Meng²,Su Jiongming³,Chen Shaofei³,Yuan Zhimin¹,Ye Qing¹

Affiliation:

1. Department of Information Security, Naval University of Engineering, Wuhan, China

2. Army Academy of Artillery and Air Defense, Hefei, China

3. College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China

Abstract

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Funder

China Postdoctoral Science Foundation

National Natural Science Foundation of China

National Defense Science and Technology Foundation Enhancement Plan

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3452008

Reference57 articles.

1. Timothy P. Lillicrap Jonathan J. Hunt Alexander Pritzel Nicolas Heess Tom Erez Yuval Tassa David Silver and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. Retrieved from https://arXiv:1509.02971. Timothy P. Lillicrap Jonathan J. Hunt Alexander Pritzel Nicolas Heess Tom Erez Yuval Tassa David Silver and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. Retrieved from https://arXiv:1509.02971.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Revealing the underlying solvent effect on film morphology in high-efficiency organic solar cells through combinedex situandin situobservations;Energy & Environmental Science;2023

2. A dual learning-based recommendation approach;Knowledge-Based Systems;2022-10

3. On the effect of the sampling ratio of past trajectories in the combination of evolutionary algorithm and deep reinforcement learning;Proceedings of the Genetic and Evolutionary Computation Conference Companion;2022-07-09

4. A Soar-Based Space Exploration Algorithm for Mobile Robots;Entropy;2022-03-19