Projected state-action balancing weights for offline reinforcement learning
From MaRDI portal
Publication:6183753
DOI10.1214/23-AOS2302MaRDI QIDQ6183753FDOQ6183753
Authors: Jiayi Wang, Zhengling Qi, Raymond Wong
Publication date: 4 January 2024
Published in: The Annals of Statistics (Search for Journal in Brave)
Abstract: Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
Full work available at URL: https://arxiv.org/abs/2109.04640
Recommendations
- Double reinforcement learning for efficient off-policy evaluation in Markov decision processes
- Proximal reinforcement learning: efficient off-policy evaluation in partially observed Markov decision processes
- Reliable off-policy evaluation for reinforcement learning
- Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
- Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
Cites Work
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- A distribution-free theory of nonparametric regression
- Approximate residual balancing: debiased inference of average treatment effects in high dimensions
- Batch policy learning in average reward Markov decision processes
- Covariate Balancing Propensity Score
- Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data
- Dynamic treatment regimes: technical challenges and applications
- Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
- Estimating dynamic treatment regimes in mobile health using V-learning
- Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
- Generalized optimal matching methods for causal inference
- Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting
- High-dimensional \(A\)-learning for optimal dynamic treatment regimes
- Instrumental Variable Estimation of Nonparametric Models
- Kernel-based covariate functional balancing for observational studies
- Large Sample Properties of Generalized Method of Moments Estimators
- Marginal Mean Models for Dynamic Regimes
- Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations
- New statistical learning methods for estimating optimal dynamic treatment regimes
- Nonparametric estimation of an additive model with a link function
- Off-policy estimation of long-term average outcomes with applications to mobile health
- Optimal Dynamic Treatment Regimes
- Optimal global rates of convergence for nonparametric regression
- Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression
- Personalized Policy Learning Using Longitudinal Mobile Health Data
- Quantile-optimal treatment regimes
- Regularized least-squares regression: learning from a sequence
- Regularized policy iteration with nonparametric function spaces
- Some new asymptotic theory for least squares series: pointwise and uniform results
Cited In (5)
- Reliable off-policy evaluation for reinforcement learning
- Proximal reinforcement learning: efficient off-policy evaluation in partially observed Markov decision processes
- Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
- Exploiting action impact regularity and exogenous state variables for offline reinforcement learning
- Offline reinforcement learning with representations for actions
This page was built for publication: Projected state-action balancing weights for offline reinforcement learning
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6183753)