Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition-Reference-Cited by-同舟云学术

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Published:2000-11-01 Issue: Volume:13 Page:227-303
ISSN:1076-9757
Container-title:Journal of Artificial Intelligence Research
language:
Short-container-title:jair

Author:

Dietterich T. G.

Abstract

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Publisher

AI Access Foundation

Subject

Artificial Intelligence

Cited by 585 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Model-free aperiodic tracking for discrete-time systems using hierarchical reinforcement learning;Neurocomputing;2024-12

2. The AI circular hydrogen economist: Hydrogen supply chain design via hierarchical deep multi-agent reinforcement learning;Chemical Engineering Journal;2024-10

3. Policy Testing with MDPFuzz (Replicability Study);Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis;2024-09-11

4. Makine Çizelgeleme Problemlerinin Çözümünde Pekiştirmeli Öğrenme Etkisinin Analizi;ALKÜ Fen Bilimleri Dergisi;2024-08-30

5. A model of how hierarchical representations constructed in the hippocampus are used to navigate through space;Adaptive Behavior;2024-08-28