Exploration-exploitation policies with almost sure, arbitrarily slow growing asymptotic regret
From MaRDI portal
Publication:5070864
Abstract: The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies . For any slowly increasing function , subject to mild regularity constraints, we construct two policies (the -Forcing, and the -Inflated Sample Mean) that achieve a measure of regret of order almost surely as , bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.
Recommendations
- Finite-time analysis of the multiarmed bandit problem
- Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
- Explore first, exploit next: the true shape of regret in bandit problems
- On Bayesian index policies for sequential resource allocation
- Pure exploration in multi-armed bandits problems
Cites work
- scientific article; zbMATH DE number 6982311 (Why is no real title available?)
- An asymptotically optimal policy for finite support models in the multiarmed bandit problem
- Asymptotically efficient adaptive allocation rules
- Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
- Explore first, exploit next: the true shape of regret in bandit problems
- Finite-time analysis of the multiarmed bandit problem
- Multi-armed bandits under general depreciation and commitment
- Normal bandits of unknown means and variances
- Optimal adaptive policies for sequential allocation problems
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems
- Regret bounds for reinforcement learning via Markov chain concentration
- Some aspects of the sequential design of experiments
This page was built for publication: Exploration-exploitation policies with almost sure, arbitrarily slow growing asymptotic regret
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5070864)