Reducing Q-Value Estimation Bias via Mutual Estimation and Softmax Operation in MADRL-Reference-Cited by-同舟云学术

Reducing Q-Value Estimation Bias via Mutual Estimation and Softmax Operation in MADRL

Published:2024-01-16 Issue:1 Volume:17 Page:36
ISSN:1999-4893
Container-title:Algorithms
language:en
Short-container-title:Algorithms

Author:

Li Zheng¹,Chen Xinkai¹,Fu Jiaqing¹,Xie Ning¹,Zhao Tingting²³

Affiliation:

1. Center for Future Media, School of Computer Science and Engineering, and Yibin Park, University of Electronic Science and Technology of China, Chengdu 611731, China

2. School of Computer Science and Technology, Tianjin University of Science and Technology, Tianjin 300457, China

3. RIKEN Center for Advanced Intelligence Project (AIP), Tokyo 103-0027, Japan

Abstract

With the development of electronic game technology, the content of electronic games presents a larger number of units, richer unit attributes, more complex game mechanisms, and more diverse team strategies. Multi-agent deep reinforcement learning shines brightly in this type of team electronic game, achieving results that surpass professional human players. Reinforcement learning algorithms based on Q-value estimation often suffer from Q-value overestimation, which may seriously affect the performance of AI in multi-agent scenarios. We propose a multi-agent mutual evaluation method and a multi-agent softmax method to reduce the estimation bias of Q values in multi-agent scenarios, and have tested them in both the particle multi-agent environment and the multi-agent tank environment we constructed. The multi-agent tank environment we have built has achieved a good balance between experimental verification efficiency and multi-agent game task simulation. It can be easily extended for different multi-agent cooperation or competition tasks. We hope that it can be promoted in the research of multi-agent deep reinforcement learning.

Funder

National Key R&D Program of China

Chengdu Science and Technology Project

National Natural Science Foundation of China

Intelligent Terminal Key Laboratory of SiChuan Province

Publisher

MDPI AG

Subject

Computational Mathematics,Computational Theory and Mathematics,Numerical Analysis,Theoretical Computer Science

Link

https://www.mdpi.com/1999-4893/17/1/36/pdf

Reference21 articles.

1. Mastering the game of go with deep neural networks and tree search;Silver;Nature,2016

2. Sutton, R.S., and Barto, A.G. (1998). Reinforcement Learning: An Introduction, MIT Press.

3. Double Q-learning;Hasselt;Adv. Neural Inf. Process. Syst.,2010

4. van Hasselt, H. (2011). Insight in Reinforcement Learning. Formal Analysis and Empirical Evaluation of Temporal-Difference Algorithms. [Ph.D. Thesis, Utrecht University].

5. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv.