An efficient and robust gradient reinforcement learning: Deep comparative policy

Author:

Wang Jiaguo1,Li Wenheng2,Lei Chao3,Yang Meng4,Pei Yang1

Affiliation:

1. Northwestern Polytechnical University, Xi’an, China

2. AVIC Xi’an Aeronautics Computing Technique Research Institute, Xi’an, China

3. School of Computing and Information Systems, The University of Melbourne, Parkville, Victoria, Australia

4. Faculty of Information Technology, Monash University, Clayton Victoria, Australia

Abstract

Recently, actor-critic architectures such as deep deterministic policy gradient (DDPG) are able to understand higher-level concepts for searching rich reward, and generate complex actions in continuous action space, and widely used in practical applications. However, when action space is limited and has dynamic hard margins, training DDPG can be problematic and inefficiency. Since real-world actuators always have margins and interferences, after initialization, the actor network is likely to be stuck at a local optimal point on action space margin: actor gradient orients to the outside of action space but actuators stop at the margin. If the hard margins are complex, dynamic and unknown to the DDPG agent, it is unable to use penalty functions to recover from local optimum. If we enlarge the random process for local exploration, the training could be in potential risk of failure. Therefore, simply relying on gradient of critic network to train the actor network is not a robust method in real environment. To solve this problem, in this paper we modify DDPG to deep comparative policy (DCP). Rather than leveraging critic-to-actor gradient, the core training process of DCP is regulated by a T-fold compare among random proposed adjacent actions. The performance of DDPG, DCP and related algorithms are tested and compared in two experiments. Our results show that, DCP is effective, efficient and qualified to perform all tasks that DDPG can perform. More importantly, DCP is less likely to be influenced by the action space margins, DCP can provide more safety in avoiding training failure and local optimum, and gain more robustness in applications with dynamic hard margins in the action space. Another advantage is that, complex penalty for margin touching detection is not required, the reward function can always be brief and short.

Publisher

IOS Press

Subject

Artificial Intelligence,General Engineering,Statistics and Probability

Reference10 articles.

1. Mastering the game of Go with deep neural networks and tree search;Silver;Nature,2016

2. Mastering the game of go without human knowledge;Silver;Nature,2017

3. Dual learning for machine translation;He;Advances in neural Information Processing Systems,2016

4. Van H.H. , Guez A. and Silver D. , Deep reinforcement learning with double q-learning, Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016).

5. Rainbow: Combining improvements in deep reinforcement learning;Hessel;Proceedings of the AAAI Conference on Artificial Intelligence

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3