Q() with off-policy corrections
From MaRDI portal
Publication:2831390
DOI10.1007/978-3-319-46379-7_21zbMATH Open1466.68067arXiv1602.04951OpenAlexW2962766894MaRDI QIDQ2831390FDOQ2831390
Authors: Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Rémi Munos
Publication date: 9 November 2016
Published in: Lecture Notes in Computer Science (Search for Journal in Brave)
Abstract: We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(). We illustrate this theoretical relationship empirically on a continuous-state control task.
Full work available at URL: https://arxiv.org/abs/1602.04951
Recommendations
Cites Work
Cited In (13)
- TD-regularized actor-critic methods
- Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
- Deep Reinforcement Learning: A State-of-the-Art Walkthrough
- Off-policy linear temporal difference learning algorithms with a generalized oblique projection
- Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization
- Off-policy temporal difference learning with distribution adaptation in fast mixing chains
- Off-policy learning with eligibility traces: a survey
- Classification with costly features as a sequential decision-making problem
- Title not available (Why is that?)
- Reinforcement learning in sparse-reward environments with hindsight policy gradients
- \(\text{Q}(\lambda)\) with off-policy corrections
- An emphatic approach to the problem of off-policy temporal-difference learning
- Deep exploration via randomized value functions
This page was built for publication: \(\text{Q}(\lambda)\) with off-policy corrections
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2831390)