Author:
Zhao Li-yang,Chang Tian-qing,Guo Li-bin,Zhang Jie,Zhang Lei,Ma Jin-dun
Abstract
AbstractThe joint action-value function (JAVF) plays a key role in the centralized training of multi-agent deep reinforcement learning (MADRL)-based algorithms using the value function decomposition (VFD) and in the generating process of a collaborative policy between agents. However, under the influence of multiple factors such as environmental noise, inadequate exploration and iterative updating mechanism, estimation bias is inevitably introduced, causing its overestimation problem, which in turn prevents agents from obtaining accurate reward signals during the learning process, and fails to correctly approximate the optimal policy. To address this problem, this paper first analyzes the causes of joint action-value function overestimation, gives the corresponding mathematical proofs and theoretical derivations, and obtains the lower bound of the overestimation error; then, a MADRL overestimation reduction method based on the multi-step weighted double estimation named λWD QMIX is proposed. Specifically, the λWD QMIX method effectively achieves more stable and accurate JAVF estimation results using the bias correction estimation mechanisms based on the weighted double estimation and multi-step updating based on eligibility trace backup, without additionally adding or changing any network structure. The results of a series of experiments on the StarCraft II micromanipulation benchmark show that the proposed λWD QMIX algorithm can effectively improve the final performance and learning efficiency of the baseline algorithm, and can be seamlessly integrated with the partially MADRL algorithms based on communication learning.
Publisher
Springer Science and Business Media LLC
Reference49 articles.
1. Hasselt H (2010) Double Q-learning. Adv Neural Inf Process Syst 23:2613–2621
2. Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 2094–2100
3. Wang ZT, Ueda M (2021) Convergent and efficient deep Q network algorithm, arXiv preprint http://arxiv.org/abs/2106.15419
4. Watkins CJCH (1989) Learning from delayed rewards
5. Wang J, Ren Z, Liu T, Yu Y, Zhang C (2021) QPLEX: duplex dueling multi-agent Q-learning. In: International conference on learning representations, pp 834–852
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献