Estimating the value of a discounted reward process (Q1196212)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Estimating the value of a discounted reward process
scientific article

    Statements

    Estimating the value of a discounted reward process (English)
    0 references
    0 references
    0 references
    17 December 1992
    0 references
    There are considered a discounted reward process defined by a sequence of random variables \(\{r_ t\}\), \(t\in\{1,2,\dots\}\), and the discount factor \(\lambda\) with \(0\leq\lambda<1\). Under the assumption \(| E(r_ t)|\leq M<\infty\) for some \(M>0\) the expected total discounted reward or value of the discounted reward process is defined by \(f(\lambda)\equiv E\sum^ \infty_{t=1}\lambda^{t-1} r_ t\). When \(t\in \{1,2,\dots,T\}\) for some stopping time \(T\) then a terminating reward process results. Evaluation of \(f(\lambda)\) requires for all but the simplest models simulation or experimentation. An unbiased estimator for \(f(\lambda)\) is provided from sampling cumulative sums of the rewards up to independent negative binomial stopping times. When rewards are positive this estimator proves to be monotone in the sample variate. Motivated by results of \textit{C. Derman}, \textit{B. L. Fox} and \textit{P. W. Glynn}, the approach is based on a differential equation which relates the expected total discounted return of a reward process to the expected total undiscounted return of the process terminated at a negative bionomial stopping time \(T^*\). The advantage of this procedure for practical applications is discussed, for instance when designing experiments in which the cost per experimental unit is high and the cost per time unit of observation is low. Another advantage is that it provides an easily computed estimator of the derivative of \(f(\lambda)\) with respect to \(\lambda\), describing the sensitivity of \(f(\lambda)\) to changes in \(\lambda\). Additional results refer to variance properties of the estimator and simulations.
    0 references
    0 references
    0 references
    0 references
    0 references
    discounted reward process
    0 references
    expected total discounted reward
    0 references
    unbiased estimator
    0 references
    sampling cumulative sums of the rewards
    0 references
    independent negative binomial stopping times
    0 references
    differential equation
    0 references
    expected total discounted return
    0 references
    expected total undiscounted return
    0 references
    variance properties
    0 references
    simulations
    0 references
    0 references