Q() with off-policy corrections
From MaRDI portal
Publication:2831390
Abstract: We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(). We illustrate this theoretical relationship empirically on a continuous-state control task.
Recommendations
Cites work
- scientific article; zbMATH DE number 3126094 (Why is no real title available?)
- scientific article; zbMATH DE number 1321699 (Why is no real title available?)
- scientific article; zbMATH DE number 700091 (Why is no real title available?)
- Analytical mean squared error curves for temporal difference learning
- \(\text{Q}(\lambda)\) with off-policy corrections
- \({\mathcal Q}\)-learning
Cited in
(13)- TD-regularized actor-critic methods
- Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
- Deep Reinforcement Learning: A State-of-the-Art Walkthrough
- Off-policy linear temporal difference learning algorithms with a generalized oblique projection
- Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization
- Off-policy temporal difference learning with distribution adaptation in fast mixing chains
- Off-policy learning with eligibility traces: a survey
- Classification with costly features as a sequential decision-making problem
- scientific article; zbMATH DE number 7306868 (Why is no real title available?)
- Reinforcement learning in sparse-reward environments with hindsight policy gradients
- \(\text{Q}(\lambda)\) with off-policy corrections
- An emphatic approach to the problem of off-policy temporal-difference learning
- Deep exploration via randomized value functions
This page was built for publication: \(\text{Q}(\lambda)\) with off-policy corrections
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2831390)