Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

DOI10.1007/s10994-007-5038-2zbMath1470.68072OpenAlexW2104753538MaRDI QIDQ1009248

Csaba Szepesvári, András Antos, Rémi Munos

Publication date: 31 March 2009

Published in: Machine Learning (Search for Journal in Brave)

Full work available at URL: https://doi.org/10.1007/s10994-007-5038-2

zbMATH Keywords

reinforcement learning nonparametric regression policy iteration finite-sample bounds off-policy learning least-squares regression Bellman-residual minimization least-squares temporal difference learning

Mathematics Subject Classification ID

Nonparametric regression and quantile regression (62G08) Learning and adaptive systems in artificial intelligence (68T05)

Related Items (19)

Least squares policy iteration with instrumental variables vs. direct policy search: comparison against optimal benchmarks using energy storage ⋮ A Two-Timescale Stochastic Algorithm Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic ⋮ A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications ⋮ Policy space identification in configurable environments ⋮ Deep reinforcement trading with predictable returns ⋮ Unnamed Item ⋮ Model selection in reinforcement learning ⋮ Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning ⋮ Off-policy evaluation in partially observed Markov decision processes under sequential ignorability ⋮ Hybrid least-squares algorithms for approximate policy evaluation ⋮ Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains ⋮ Rollout sampling approximate policy iteration ⋮ Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling ⋮ Estimating optimal shared-parameter dynamic regimens with application to a multistage depression clinical trial ⋮ Unnamed Item ⋮ A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation ⋮ Learning When-to-Treat Policies ⋮ Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health ⋮ Batch policy learning in average reward Markov decision processes

Cites Work

This page was built for publication: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path