Affiliation:
1. Technische Universitat Munchen, , APO AA
2. Ecole Polytechnique, , APO AA
3. ETH Z¨urich, , APO AA
Abstract
In the realm of stochastic control, particularly in the fields of economics and engineering, Markov Decision Processes (MDP's) are employed to represent various processes ranging from asset management to transportation logistics. Upon closer examination these constrained MDP's often exhibit specific causal structures concerning the dynamics of transitions and rewards. Thus, leveraging this structure can facilitate computational simplifications for determining the optimal policy. This study introduces a framework, which we denote as SD-MDP, in which we disentangle the causal structure of state transition and reward function dynamics. Through this method, we are able to establish theoretical guarantees on improvements in computational efficiency compared to standard MDP solver (such as linear programming). We further derive error bounds on the optimal value approximation via Monte Carlo simulation for this family of stochastic control problems.
Publisher
Association for Computing Machinery (ACM)
Reference4 articles.
1. Bandit Processes and Dynamic Allocation Indices
2. G.H. Hardy J.E. Littlewood and G. P´olya. 1952. Inequalities. Cambridge Mathematical Library. Cambridge University Press. isbn: 9780521358804.
3. Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. 2022. Efficient reinforcement learning with prior causal knowledge. In Conference on Causal Learning and Reasoning. PMLR, 526--541.
4. Progressive hedging innovations for a class of stochastic mixed-integer resource allocation problems