Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

From MaRDI portal
Publication:6302549

arXiv1806.01492MaRDI QIDQ6302549FDOQ6302549


Authors: Aaron Sidford, Mengdi Wang, Xian Wu, Linfeng Yang, Yinyu Ye Edit this on Wikidata


Publication date: 5 June 2018

Abstract: In this paper we consider the problem of computing an epsilon-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1) time. Given such a DMDP with states S, actions A, discount factor gammain(0,1), and rewards in range [0,1] we provide an algorithm which computes an epsilon-optimal policy with probability 1delta where emph{both} the time spent and number of sample taken are upper bounded by [ Oleft[frac{|S||A|}{(1-gamma)^3 epsilon^2} log left(frac{|S||A|}{(1-gamma)delta epsilon} ight) logleft(frac{1}{(1-gamma)epsilon} ight) ight] ~. ] For fixed values of epsilon, this improves upon the previous best known bounds by a factor of (1gamma)1 and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing epsilon-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.




Has companion code repository: https://github.com/uclaopt/AsyncQVI









This page was built for publication: Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6302549)