Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning-Reference-Cited by-同舟云学术

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Published:2023-11-14 Issue:6 Volume:14 Page:1-28
ISSN:2157-6904
Container-title:ACM Transactions on Intelligent Systems and Technology
language:en
Short-container-title:ACM Trans. Intell. Syst. Technol.

Author:

Xu Meng¹^ORCID,She Yechao¹^ORCID,Jin Yang¹^ORCID,Wang Jianping¹^ORCID

Affiliation:

1. Department of Computer Science, City University of Hong Kong, China

Abstract

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

Funder

Hong Kong Research Grant Council

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3623405

Reference62 articles.

1. Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. 2019. Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 11–20.

2. Neural Module Networks

3. Mohammad Babaeizadeh Iuri Frosio Stephen Tyree Jason Clemons and Jan Kautz. 2016. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256 (2016). Retrieved from https://arxiv.org/abs/1611.06256

4. Glen Berseth Cheng Xie Paul Cernek and Michiel Van de Panne. 2018. Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765 (2018). Retrieved from https://arxiv.org/abs/1802.04765

5. Simulated Annealing

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Strengthening Cooperative Consensus in Multi-Robot Confrontation;ACM Transactions on Intelligent Systems and Technology;2023-12-29