Q() with off-policy corrections

From MaRDI portal
Publication:2831390

DOI10.1007/978-3-319-46379-7_21zbMATH Open1466.68067arXiv1602.04951OpenAlexW2962766894MaRDI QIDQ2831390FDOQ2831390


Authors: Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Rémi Munos Edit this on Wikidata


Publication date: 9 November 2016

Published in: Lecture Notes in Computer Science (Search for Journal in Brave)

Abstract: We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(lambda). We illustrate this theoretical relationship empirically on a continuous-state control task.


Full work available at URL: https://arxiv.org/abs/1602.04951




Recommendations



Cites Work


Cited In (13)





This page was built for publication: \(\text{Q}(\lambda)\) with off-policy corrections

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2831390)