Learning Policies for Markov Decision Processes From Data

From MaRDI portal
Publication:5223736

DOI10.1109/TAC.2018.2866455zbMATH Open1482.93721arXiv1701.05954OpenAlexW2581566809WikidataQ129352002 ScholiaQ129352002MaRDI QIDQ5223736FDOQ5223736


Authors: Manjesh Kumar Hanawal, Hao Liu, Henghui Zhu, Ioannis Ch. Paschalidis Edit this on Wikidata


Publication date: 18 July 2019

Published in: IEEE Transactions on Automatic Control (Search for Journal in Brave)

Abstract: We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-actions pairs followed by the policy. We assume that the policy belongs to a class of parameterized policies which are defined using features associated with the state-action pairs. The features are known a priori, however, only an unknown subset of them could be relevant. The policy parameters that correspond to an observed target policy are recovered using ell1-regularized logistic regression that best fits the observed state-action samples. We establish bounds on the difference between the average reward of the estimated and the original policy (regret) in terms of the generalization error and the ergodic coefficient of the underlying Markov chain. To that end, we combine sample complexity theory and sensitivity analysis of the stationary distribution of Markov chains. Our analysis suggests that to achieve regret within order O(sqrtepsilon), it suffices to use training sample size on the order of Omega(logncdotpoly(1/epsilon)), where n is the number of the features. We demonstrate the effectiveness of our method on a synthetic robot navigation example.


Full work available at URL: https://arxiv.org/abs/1701.05954







Cited In (6)





This page was built for publication: Learning Policies for Markov Decision Processes From Data

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5223736)