Online Bootstrap Inference For Policy Evaluation In Reinforcement Learning
From MaRDI portal
Publication:6185586
DOI10.1080/01621459.2022.2096620arXiv2108.03706OpenAlexW3191746168MaRDI QIDQ6185586FDOQ6185586
Authors: Zhuoran Yang, Zhaoran Wang, Wei Sun, Guang Cheng
Publication date: 8 January 2024
Published in: Journal of the American Statistical Association (Search for Journal in Brave)
Abstract: The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.
Full work available at URL: https://arxiv.org/abs/2108.03706
Recommendations
- Online reinforcement learning using a probability density estimation
- On-line policy gradient estimation with multi-step sampling
- Bayesian optimization for policy search via online-offline experimentation
- Online bootstrap confidence intervals for the stochastic gradient descent estimator
- An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
- Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes
- A basic formula for online policy gradient algorithms
asymptotic normalitymultiplier bootstrapstochastic approximationstatistical inferencereinforcement learning
Cites Work
- Markov Chains and Stochastic Stability
- Acceleration of Stochastic Approximation by Averaging
- A Stochastic Approximation Method
- The bootstrap and Edgeworth expansion
- Inference and uncertainty quantification for noisy matrix completion
- Stability of Stochastic Approximation under Verifiable Conditions
- 10.1162/1532443041827907
- Moment consistency of the exchangeably weighted bootstrap for semiparametric M-estimation
- An analysis of temporal-difference learning with function approximation
- Reinforcement learning. An introduction
- Constructing dynamic treatment regimes over indefinite time horizons
- Markov Chains
- Generalized TD learning
- Markov chains and mixing times. With a chapter on ``Coupling from the past by James G. Propp and David B. Wilson.
- Trajectory averaging for stochastic approximation MCMC algorithms
- An emphatic approach to the problem of off-policy temporal-difference learning
- Estimating dynamic treatment regimes in mobile health using V-learning
- Challenges of real-world reinforcement learning: definitions, benchmarks and analysis
- Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework
- Statistical inference for model parameters in stochastic gradient descent
- Online bootstrap confidence intervals for the stochastic gradient descent estimator
- Statistical inference for online decision making via stochastic gradient descent
- Title not available (Why is that?)
- Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
This page was built for publication: Online Bootstrap Inference For Policy Evaluation In Reinforcement Learning
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6185586)