Goal-conditioned offline reinforcement learning through state space partitioning-Reference-Cited by-同舟云学术

Goal-conditioned offline reinforcement learning through state space partitioning

Published:2024-02-05 Issue:5 Volume:113 Page:2435-2465
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Wang Mianchu,Jin Yue,Montana Giovanni^ORCID

Abstract

AbstractOffline reinforcement learning (RL) aims to create policies for sequential decision-making using exclusively offline datasets. This presents a significant challenge, especially when attempting to accomplish multiple distinct goals or outcomes within a given scenario while receiving sparse rewards. Prior methods using advantage weighting for offline goal-conditioned learning improve policies monotonically. However, they still face challenges from distribution shift and multi-modality that arise due to conflicting ways to reach a goal. This issue is especially challenging in long-horizon tasks, where the presence of multiple, often conflicting, solutions makes it hard to identify a single optimal policy for transitioning from a state to a desired goal. To address these challenges, we introduce a complementary advantage-based weighting scheme that incorporates an additional source of inductive bias. Given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Our proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL, outperforms several competing offline algorithms in widely used benchmarks. Furthermore, we provide a theoretical guarantee that the learned policy will not be inferior to the underlying behavior policy.

Funder

Engineering and Physical Sciences Research Council

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10994-023-06500-z.pdf

Reference46 articles.

1. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., Zaremba, W. (2017). Hindsight experience replay. In: Advances in neural information processing systems.

2. Bain, M., & Sammut, C. (1995). A framework for behavioural cloning. Machine Intelligence, 15, 103–129.

3. Chane-Sane, E., Schmid, C., Laptev, I. (2021). Goal-conditioned reinforcement learning with imagined subgoals. In: International conference on machine learning.

4. Charlesworth, H., & Montana, G. (2020). Plangan: Model-based planning with sparse rewards and multiple goals. Advances in Neural Information Processing Systems, 33, 8532–8542.

5. Chebotar, Y., Hausman, K., Lu, Y., Xiao, T., Kalashnikov, D., Varley, J., Irpan, A., Eysenbach, B., Julian, R., Finn, C., Levine, S. (2021). Actionable models: Unsupervised offline reinforcement learning of robotic skills. In: International conference on machine learning.