Stability estimation of some Markov controlled processes (Q2111840)
From MaRDI portal
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | Stability estimation of some Markov controlled processes |
scientific article |
Statements
Stability estimation of some Markov controlled processes (English)
0 references
17 January 2023
0 references
The paper analyses a discrete-time Markovian controlled process with expected total discounted reward. The distribution \(G\) of the underlying randomness is unknown, but assumed to be approximated by some known distribution \(\tilde G\) -- obtained, for example, from statistical data. The `real' process \(X\) is driven by \(G\); the `approximating' process \(\tilde X\) is driven by \(\tilde G\). The aim is to develop stability estimates, comparing the `real' and `approximating' processes. A discount factor \(\alpha \in [0,1)\) is used, discounting future rewards, and the reward function \(r\) is assumed to be uniformly bounded, say by \(b < \infty\). Let \(V(\pi)\) and \(\tilde V(\pi)\) be the value function when policy \(\pi\) is followed for the `real' and `approximating' process, respectively: \[ \textstyle V(\pi) := \mathbb E^\pi\bigl( \sum_{t\ge1} \alpha^{t-1} r(X_{t-1}, a_t) \bigr) \quad\text{and}\quad \tilde V(\pi) := \mathbb E^\pi\bigl( \sum_{t\ge1} \alpha^{t-1} r(\tilde X_{t-1}, a_t) \bigr); \] here, \(a_t\) is the \textit{action} at time \(t\), chosen according to the policy and the information revealed prior. It is known that, under certain conditions, there exist \textit{stationary} policies -- that is, ones which choose the next action as a function only of the current state -- \(f_\star\) and \(\tilde{f_\star}\) which optimise the respective value functions: \[ \textstyle V(f_\star) = \sup_\pi V(\pi) \quad\text{and}\quad \tilde V(\tilde{f_\star}) = \sup_{\tilde \pi} \tilde V(\tilde \pi), \] where the suprema are over \emph{all} policies, not just stationary ones. The goal is then to bound the \textit{stability index} \[ \Delta := V(f_\star) - V(\tilde{f_\star}) \ge 0, \] which governs how far from optimal the `approximating' policy is in the `real' process. Theorem 1 proves the bound \[ \Delta \le \frac{2 \alpha b}{(1 - \alpha)^2} d_\mathsf{TV}(G, \tilde G), \] where \(d_\mathsf{TV}\) denotes the \textit{total-variation distance}, under certain conditions. The conditions are relatively simple, as are the statement and the proof, making the theorem appealing. However, the authors note a shortcoming: in many situations, \(G\) is continuous and \(\tilde G\) is an empirical distribution, which is inherently discrete, so \(d_\mathsf{TV}(G, \tilde G) = \infty\). Theorem 2 rectifies this issue by obtaining a more complicated bound \[ \Delta \lesssim d_\mathsf{Dudley}(G, \tilde G), \] where \(d_\mathsf{Dudley}\) denotes the \textit{Dudley metric}, under certain conditions, such as requiring the reward function to be Lipschitz. The ``\(\lesssim\)'' symbol hides some factors depending on the rewards function used, such as its Lipschitz constant. The proof of this theorem is much more detailed and nuanced. The example where \(\tilde G\) is an empirical distribution is then exposited: very roughly, if \(G\) has finite exponential moments in an interval around \(0\), then \(d_\mathsf{Dudley}(G, \tilde G_n) \to 0\) as \(n \to \infty\), where \(\tilde G_n\) is the empirical distribution obtained from \(n\) iid samples. Contrastingly, \(d_\mathsf{TV}(G, \tilde G_n) = \infty\) for every \(n\). The assumptions of Theorem 2 are still required, naturally. The paper closes with a few example applications. The first demonstrates the necessity of certain conditions in Theorem 2. The next two are particularly pleasing: the second models dam operations and stocks of water; the third models a controlled environmental stochastic process.
0 references
optimal control policy
0 references
ability inequality
0 references
total variation
0 references
Dudley metrics
0 references
0 references
0 references
0 references