On Bayesian index policies for sequential resource allocation (Q1750289)
From MaRDI portal
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | On Bayesian index policies for sequential resource allocation |
scientific article |
Statements
On Bayesian index policies for sequential resource allocation (English)
0 references
18 May 2018
0 references
Consider the stochastic multi-armed bandit problem. So, it is assumed that there are \(K\) (\(K \geq 2\)) arms with which an agent interacts in a sequential way on the control horizon \(T\). The choice of the \(a\)-th arm is accompanied with a random reward the distribution of which depends only on the chosen arm and is unknown to the agent. The goal is to maximize (in some sense) the total expected reward. If the agent knew distributions of the rewards, he should always use the oracle policy and choose the arm corresponding to the largest mathematical expectation of the reward. However, for the actually used policy \(A\) the total expected reward is less than corresponding to the oracle policy by a value of the regret \(R(T,A)\). In the article, asymptotically optimal Bayesian index policies for the multi-armed bandit problem are proposed for a large class of distributions of the rewards belonging to a one-dimensional exponential family. For these strategies \(R(T,A)\), considered in the frequentist view (i.e. for some fixed unknown distributions of the rewards), has the order \(\log T\). The proposed policies are the Bayes-UCB algorithms which rely on quantiles of posterior distributions. Bayesian insight on alternative exploration rates is also presented. To this end, proposed algorithms are compared with finite-horizon Gittins indices, kl-UCB\(^+\) and kl-UCB-H\(^{+}\) algorithms. Numerical results are given.
0 references
multi-armed bandit problems
0 references
Bayesian methods
0 references
upper-confidence bounds
0 references
Gittins indices
0 references
0 references
0 references