On Bayesian index policies for sequential resource allocation (Q1750289)

!

WARNING

This is the item page for this Wikibase entity, intended for internal use and editing purposes.

Please use the normal view instead:

On Bayesian index policies for sequential resource allocation

scientific article; zbMATH DE number 6870281

Language	Label	Description	Also known as
default for all languages	No label defined
English	On Bayesian index policies for sequential resource allocation	scientific article; zbMATH DE number 6870281

Statements

instance of

scholarly article

0 references

title

On Bayesian index policies for sequential resource allocation (English)

0 references

author

Emilie Kaufmann

0 references

published in

The Annals of Statistics

0 references

publication date

18 May 2018

0 references

full work available at URL

https://arxiv.org/abs/1601.01190

0 references

review text

Consider the stochastic multi-armed bandit problem. So, it is assumed that there are $K$ ($K \geq 2$) arms with which an agent interacts in a sequential way on the control horizon $T$. The choice of the $a$-th arm is accompanied with a random reward the distribution of which depends only on the chosen arm and is unknown to the agent. The goal is to maximize (in some sense) the total expected reward. If the agent knew distributions of the rewards, he should always use the oracle policy and choose the arm corresponding to the largest mathematical expectation of the reward. However, for the actually used policy $A$ the total expected reward is less than corresponding to the oracle policy by a value of the regret $R(T,A)$. In the article, asymptotically optimal Bayesian index policies for the multi-armed bandit problem are proposed for a large class of distributions of the rewards belonging to a one-dimensional exponential family. For these strategies $R(T,A)$, considered in the frequentist view (i.e. for some fixed unknown distributions of the rewards), has the order $\log T$. The proposed policies are the Bayes-UCB algorithms which rely on quantiles of posterior distributions. Bayesian insight on alternative exploration rates is also presented. To this end, proposed algorithms are compared with finite-horizon Gittins indices, kl-UCB$^+$ and kl-UCB-H$^{+}$ algorithms. Numerical results are given.

0 references

reviewed by

Alex V. Kolnogorov

0 references

zbMATH Keywords

multi-armed bandit problems

0 references

Bayesian methods

0 references

upper-confidence bounds

0 references

Gittins indices