Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards (Q1299918)

scientific article

Language	Label	Description	Also known as
English	Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards	scientific article

Statements

instance of

scholarly article

0 references

title

Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state space and unbounded rewards (English)

0 references

author

Arie Hordijk

0 references

Alexander A. Yushkevich

0 references

published in

Mathematical Methods of Operations Research

0 references

publication date

22 November 1999

0 references

review text

Let be given a time-discrete Markov decision process with the discounted reward criterion. A policy is called Blackwell (B-) optimal w.r.t. a reference set \(\Pi\) of policies if for all states \(x\) and all \(\pi\in\Pi\), the expected discounted reward is maximal for all discount factors \(\beta\in(\beta_0(x,\pi),1)\). Outgoing from their earlier papers [\textit{R. Dekker} and \textit{A. Hordijk}, Math. Oper. Res. 17, 271-289 (1992; Zbl 0773.90088); \textit{A. A. Yushkevich}, SIAM J. Control Optim. 35, 2157-2182 (1997; Zbl 0892.93059)], the authors came to a greater common work under milder conditions and with stronger results from what the present paper is Part I. In a forthcoming paper, already available as a preprint, B-optimality w.r.t. several reference sets will be compared, and easier verifiable conditions will be given. With regard to the rich state and action spaces, for proving the main results six assumptions are needed and concern measurability and continuity of the reward function, existence of a bounding function \(\mu\) for constructing a norm (known as \(\mu\)-norm), convergence of the powers of the transition laws and integrability of their densities, and existience of a reference measure for all transition laws. Under these assumptions, there exists a deterministic stationary policy B-optimal w.r.t. all stationary and w.r.t. all policies. Using Laurent series in \(\rho=(1-\beta)/\beta\) in powers \(\rho^n,n=-1,0,1,\ldots\), with some growth condition for the coefficients, and the resulting lexicographic partial order, the authors succeed in generalizing the famous Howard/Blackwell/Veinott policy improvement method to B-optimality. The proofs are very careful and in detail. The authors give many argumentations for that what fails to hold if some assumption is omitted. There are some remarks to special cases. To applying the results to hydrology, inventory and queueing theory is referred to the introduction.

0 references

zbMATH Keywords

Markov decision processes

0 references

Blackwell optimality

0 references

MaRDI profile type

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1007/s001860300319

0 references