Stability estimation of some Markov controlled processes (Q2111840)

scientific article

Language	Label	Description	Also known as
English	Stability estimation of some Markov controlled processes	scientific article

Statements

instance of

scholarly article

0 references

title

Stability estimation of some Markov controlled processes (English)

0 references

published in

Open Mathematics

0 references

publication date

17 January 2023

0 references

review text

The paper analyses a discrete-time Markovian controlled process with expected total discounted reward. The distribution \(G\) of the underlying randomness is unknown, but assumed to be approximated by some known distribution \(\tilde G\) -- obtained, for example, from statistical data. The `real' process \(X\) is driven by \(G\); the `approximating' process \(\tilde X\) is driven by \(\tilde G\). The aim is to develop stability estimates, comparing the `real' and `approximating' processes. A discount factor \(\alpha \in [0,1)\) is used, discounting future rewards, and the reward function \(r\) is assumed to be uniformly bounded, say by \(b < \infty\). Let \(V(\pi)\) and \(\tilde V(\pi)\) be the value function when policy \(\pi\) is followed for the `real' and `approximating' process, respectively: \[ \textstyle V(\pi) := \mathbb E^\pi\bigl( \sum_{t\ge1} \alpha^{t-1} r(X_{t-1}, a_t) \bigr) \quad\text{and}\quad \tilde V(\pi) := \mathbb E^\pi\bigl( \sum_{t\ge1} \alpha^{t-1} r(\tilde X_{t-1}, a_t) \bigr); \] here, \(a_t\) is the \textit{action} at time \(t\), chosen according to the policy and the information revealed prior. It is known that, under certain conditions, there exist \textit{stationary} policies -- that is, ones which choose the next action as a function only of the current state -- \(f_\star\) and \(\tilde{f_\star}\) which optimise the respective value functions: \[ \textstyle V(f_\star) = \sup_\pi V(\pi) \quad\text{and}\quad \tilde V(\tilde{f_\star}) = \sup_{\tilde \pi} \tilde V(\tilde \pi), \] where the suprema are over \emph{all} policies, not just stationary ones. The goal is then to bound the \textit{stability index} \[ \Delta := V(f_\star) - V(\tilde{f_\star}) \ge 0, \] which governs how far from optimal the `approximating' policy is in the `real' process. Theorem 1 proves the bound \[ \Delta \le \frac{2 \alpha b}{(1 - \alpha)^2} d_\mathsf{TV}(G, \tilde G), \] where \(d_\mathsf{TV}\) denotes the \textit{total-variation distance}, under certain conditions. The conditions are relatively simple, as are the statement and the proof, making the theorem appealing. However, the authors note a shortcoming: in many situations, \(G\) is continuous and \(\tilde G\) is an empirical distribution, which is inherently discrete, so \(d_\mathsf{TV}(G, \tilde G) = \infty\). Theorem 2 rectifies this issue by obtaining a more complicated bound \[ \Delta \lesssim d_\mathsf{Dudley}(G, \tilde G), \] where \(d_\mathsf{Dudley}\) denotes the \textit{Dudley metric}, under certain conditions, such as requiring the reward function to be Lipschitz. The ``\(\lesssim\)'' symbol hides some factors depending on the rewards function used, such as its Lipschitz constant. The proof of this theorem is much more detailed and nuanced. The example where \(\tilde G\) is an empirical distribution is then exposited: very roughly, if \(G\) has finite exponential moments in an interval around \(0\), then \(d_\mathsf{Dudley}(G, \tilde G_n) \to 0\) as \(n \to \infty\), where \(\tilde G_n\) is the empirical distribution obtained from \(n\) iid samples. Contrastingly, \(d_\mathsf{TV}(G, \tilde G_n) = \infty\) for every \(n\). The assumptions of Theorem 2 are still required, naturally. The paper closes with a few example applications. The first demonstrates the necessity of certain conditions in Theorem 2. The next two are particularly pleasing: the second models dam operations and stocks of water; the third models a controlled environmental stochastic process.

0 references

reviewed by

Sam Thomas

0 references

zbMATH Keywords

optimal control policy

0 references

ability inequality

0 references

total variation

0 references

Dudley metrics