Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

arXiv1806.01492MaRDI QIDQ6302549FDOQ6302549

Authors: Aaron Sidford, Mengdi Wang, Xian Wu, Linfeng Yang, Yinyu Ye

Publication date: 5 June 2018

Abstract: In this paper we consider the problem of computing an

e p s i l o n

-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in

O (1)

time. Given such a DMDP with states

S

, actions

A

, discount factor

g a m m a i n (0, 1)

, and rewards in range

[0, 1]

we provide an algorithm which computes an

e p s i l o n

-optimal policy with probability

1 - d e l t a

where emph{both} the time spent and number of sample taken are upper bounded by [ Oleft[frac{|S||A|}{(1-gamma)^3 epsilon^2} log left(frac{|S||A|}{(1-gamma)delta epsilon} ight) logleft(frac{1}{(1-gamma)epsilon} ight) ight] ~. ] For fixed values of

e p s i l o n

, this improves upon the previous best known bounds by a factor of

(1 - g a m m a)^{- 1}

and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing

e p s i l o n

-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

Has companion code repository: https://github.com/uclaopt/AsyncQVI

This page was built for publication: Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6302549)