Projected state-action balancing weights for offline reinforcement learning
From MaRDI portal
Publication:6183753
DOI10.1214/23-AOS2302arXiv2109.04640MaRDI QIDQ6183753FDOQ6183753
Authors: Jiayi Wang, Zhengling Qi, Raymond Wong
Publication date: 4 January 2024
Published in: The Annals of Statistics (Search for Journal in Brave)
Abstract: Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
Full work available at URL: https://arxiv.org/abs/2109.04640
Cites Work
- Large Sample Properties of Generalized Method of Moments Estimators
- Personalized Policy Learning Using Longitudinal Mobile Health Data
- Title not available (Why is that?)
- Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data
- Title not available (Why is that?)
- Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
- Approximate Residual Balancing: Debiased Inference of Average Treatment Effects in High Dimensions
- Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression
- Instrumental Variable Estimation of Nonparametric Models
- Optimal global rates of convergence for nonparametric regression
- Title not available (Why is that?)
- High-dimensional \(A\)-learning for optimal dynamic treatment regimes
- A distribution-free theory of nonparametric regression
- Marginal Mean Models for Dynamic Regimes
- Optimal Dynamic Treatment Regimes
- Some new asymptotic theory for least squares series: pointwise and uniform results
- Globally Efficient Non-Parametric Inference of Average Treatment Effects by Empirical Balancing Calibration Weighting
- Covariate Balancing Propensity Score
- New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes
- Title not available (Why is that?)
- Dynamic treatment regimes: technical challenges and applications
- Nonparametric estimation of an additive model with a link function
- Regularized least-squares regression: learning from a sequence
- Regularized policy iteration with nonparametric function spaces
- Generalized Optimal Matching Methods for Causal Inference
- Kernel-based covariate functional balancing for observational studies
- Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning
- Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations
- Batch policy learning in average reward Markov decision processes
- Title not available (Why is that?)
- Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health
- Quantile-Optimal Treatment Regimes
- Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
This page was built for publication: Projected state-action balancing weights for offline reinforcement learning
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6183753)