Efficient experience replay architecture for offline reinforcement learning-Reference-Cited by-同舟云学术

Efficient experience replay architecture for offline reinforcement learning

Published:2023-03-21 Issue:1 Volume:43 Page:35-43
ISSN:2754-6969
Container-title:Robotic Intelligence and Automation
language:en
Short-container-title:RIA

Author:

Zhang Longfei,Feng Yanghe,Wang Rongxiao,Xu Yue,Xu Naifu,Liu Zeyi,Du Hang

Abstract

Purpose Offline reinforcement learning (RL) acquires effective policies by using prior collected large-scale data, while, in some scenarios, collecting data may be hard because it is time-consuming, expensive and dangerous, i.e. health care, autonomous driving, seeking a more efficient offline RL method. The purpose of the study is to introduce an algorithm, which attempts to sample the high-value transitions in the prioritized buffer, and uniformly sample from the normal experience buffer, improving sample efficiency of offline reinforcement learning, as well as alleviating the “extrapolation error” commonly arising in offline RL. Design/methodology/approach The authors propose a new structure of experience replay architecture, which consists of double experience replies, a prioritized experience replay and a normal experience replay, supplying samples for policy updates in different training phases. At the first training stage, the authors sample from prioritized experience replay according to the calculated priority of each transitions. At the second training stage, the authors sample from the normal experience replay uniformly. The combination of the two experience replies is initialized by the same offline data set. Findings The proposed method eliminates out-of-distribution problem in an offline RL regime, and promotes training by leveraging a new efficient experience replay. The authors evaluate their method on D4RL benchmark, and the results reveal that the algorithm can achieve superior performance over the state-of-the-art offline RL algorithm. The ablation study proves that the authors’ experience replay architecture plays an important role in terms of improving final performance, data-efficiency and training stability. Research limitations/implications Because of the extra addition of prioritized experience replay, the proposed method increases the computational burden and has the risk of changing data distribution due to the combined sample strategy. Therefore, researchers are encouraged to use the experience replay block effectively and efficiently further. Practical implications Offline RL is susceptible to the quality and coverage of pre-collected data, which may be not easy to be collected from specific environment, demanding practitioners to handcraft behavior policy to interact with environment for gathering data. Originality/value The proposed approach focuses on the experience replay architecture for offline RL, and empirically demonstrates the superiority of the algorithm on data efficiency and final performance over conservative Q-learning across diverse D4RL tasks. In particular, the authors compare different variants of their experience replay block, and the experiments show that the stages, when to sample from the priority buffer, play an important role in the algorithm. The algorithm is easy to implement and can be combined with any Q-value approximation-based offline RL methods by minor adjustment.

Publisher

Emerald

Reference40 articles.

1. An optimistic perspective on offline reinforcement learning,2020

2. Uncertainty-based offline reinforcement learning with diversified q-ensemble,2021

3. Generalized prioritized sweeping,1997

4. Hindsight experience replay;ArXiv,2017

5. Improving experience replay through modeling of similar transitions’ sets,2021

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. One-shot sim-to-real transfer policy for robotic assembly via reinforcement learning with visual demonstration;Robotica;2024-01-24

2. Robust Output Regulation for a Flexible Wing System with Output Disturbances;2023 42nd Chinese Control Conference (CCC);2023-07-24

3. Fixed-Time Control for a Flexible Smart Structure With Actuator Failure: A Broad Learning System Approach;IEEE Transactions on Cybernetics;2023