On the sample complexity of actor-critic method for reinforcement learning with function approximation
From MaRDI portal
Publication:6134324
DOI10.1007/S10994-023-06303-2arXiv1910.08412OpenAlexW2981237928MaRDI QIDQ6134324FDOQ6134324
Authors: Harshat Kumar, Alec Koppel, Alejandro Ribeiro
Publication date: 22 August 2023
Published in: Machine Learning (Search for Journal in Brave)
Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.
Full work available at URL: https://arxiv.org/abs/1910.08412
Recommendations
- An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes
- Reinforcement learning via approximation of the Q-function
- Sample Complexity and Overparameterization Bounds for Temporal-Difference Learning With Neural Network Approximation
- An analysis of temporal-difference learning with function approximation
- Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning
- Actor-critic algorithms based on symmetric perturbation sampling
- An online actor-critic algorithm with function approximation for constrained Markov decision processes
- Estimation and approximation bounds for gradient-based reinforcement learning
- On tight bounds for function approximation error in risk-sensitive reinforcement learning
non-convex optimizationMarkov decision processstochastic programmingreinforcement learningactor-critic
Cites Work
- TD-regularized actor-critic methods
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- 10.1162/153244302760200704
- Markov chains and stochastic stability
- \({\mathcal Q}\)-learning
- Dynamic programming and optimal control. Vol. 1.
- Approximation by superpositions of a sigmoidal function
- Approximate Dynamic Programming
- Natural actor-critic algorithms
- OnActor-Critic Algorithms
- Actor-Critic--Type Learning Algorithms for Markov Decision Processes
- Reinforcement learning. An introduction
- Title not available (Why is that?)
- Asynchronous stochastic approximation and Q-learning
- Stochastic approximation with two time scales
- The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
- The theory of dynamic programming
- Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions
- Policy gradient in Lipschitz Markov decision processes
- A concentration bound for stochastic approximation via Alekseev's formula
- Global convergence of policy gradient methods to (almost) locally optimal policies
- Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference
- A convergent online single time scale actor critic algorithm
- A Small Gain Analysis of Single Timescale Actor Critic
Cited In (2)
This page was built for publication: On the sample complexity of actor-critic method for reinforcement learning with function approximation
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6134324)