Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards (Q1299918)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards
scientific article

    Statements

    Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards (English)
    0 references
    0 references
    0 references
    22 November 1999
    0 references
    Let be given a time-discrete Markov decision process with the discounted reward criterion. A policy is called Blackwell (B-) optimal w.r.t. a reference set \(\Pi\) of policies if for all states \(x\) and all \(\pi\in\Pi\), the expected discounted reward is maximal for all discount factors \(\beta\in(\beta_0(x,\pi),1)\). Outgoing from their earlier papers [\textit{R. Dekker} and \textit{A. Hordijk}, Math. Oper. Res. 17, 271-289 (1992; Zbl 0773.90088); \textit{A. A. Yushkevich}, SIAM J. Control Optim. 35, 2157-2182 (1997; Zbl 0892.93059)], the authors came to a greater common work under milder conditions and with stronger results from what the present paper is Part I. In a forthcoming paper, already available as a preprint, B-optimality w.r.t. several reference sets will be compared, and easier verifiable conditions will be given. With regard to the rich state and action spaces, for proving the main results six assumptions are needed and concern measurability and continuity of the reward function, existence of a bounding function \(\mu\) for constructing a norm (known as \(\mu\)-norm), convergence of the powers of the transition laws and integrability of their densities, and existience of a reference measure for all transition laws. Under these assumptions, there exists a deterministic stationary policy B-optimal w.r.t. all stationary and w.r.t. all policies. Using Laurent series in \(\rho=(1-\beta)/\beta\) in powers \(\rho^n,n=-1,0,1,\ldots\), with some growth condition for the coefficients, and the resulting lexicographic partial order, the authors succeed in generalizing the famous Howard/Blackwell/Veinott policy improvement method to B-optimality. The proofs are very careful and in detail. The authors give many argumentations for that what fails to hold if some assumption is omitted. There are some remarks to special cases. To applying the results to hydrology, inventory and queueing theory is referred to the introduction.
    0 references
    0 references
    Markov decision processes
    0 references
    Blackwell optimality
    0 references
    0 references