Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes-Reference-Cited by-同舟云学术

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Published:2023-09-26 Issue: Volume: Page:
ISSN:0030-364X
Container-title:Operations Research
language:en
Short-container-title:Operations Research

Author:

Bennett Andrew¹,Kallus Nathan¹^ORCID

Affiliation:

1. Cornell Tech, Cornell University, New York, New York 10044

Abstract

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived assuming a perfect Markov decision process (MDP) model. In “Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes,” A. Bennett and N. Kallus tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, they consider estimating the value of a given target policy in an unknown POMDP, given observations of trajectories generated by a different and unknown policy, which may depend on the unobserved states. They consider both when the target policy value can be identified the observed data and, given identification, how best to estimate it. Both these problems are addressed by extending the framework of proximal causal inference to POMDP settings, using sequences of so-called bridge functions. This results in a novel framework for off-policy evaluation in POMDPs that they term proximal reinforcement learning, which they validate in various empirical settings.

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Subject

Management Science and Operations Research,Computer Science Applications

Link

https://pubsonline.informs.org/doi/pdf/10.1287/opre.2021.0781

Reference19 articles.

1. The variational method of moments

2. Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration With Application to Autonomous Sequential Repair Problems

3. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals

4. Estimation of Nonparametric Conditional Moment Models With Possibly Nonsmooth Generalized Residuals

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A confounding bridge approach for double negative control inference on causal effects;Statistical Theory and Related Fields;2024-08-30

2. Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions;International Statistical Review;2024-07-29

3. Statistical Reinforcement Learning and Dynamic Treatment Regimes;ICSA Book Series in Statistics;2024