Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

DOI10.2307/1427934zbMATH Open0840.90129OpenAlexW2000080679MaRDI QIDQ4862097FDOQ4862097

Publication date: 9 July 1996

Published in: Advances in Applied Probability (Search for Journal in Brave)

Full work available at URL: https://doi.org/10.2307/1427934

Recommendations

On the bias, risk, and consistency of sample means in multi-armed bandits
An index-based deterministic convergent optimal algorithm for constrained multi-armed bandit problems
On an index policy for restless bandits
The sample complexity of exploration in the multi-armed bandit problem
Lower bounds on the sample complexity of exploration in the multi-armed bandit problem.
Index-based policies for discounted multi-armed bandits on parallel machines.

large deviations asymptotically efficient multi-armed bandit problem upper confidence bounds Kullback-Leibler number stochastic adaptive control non-Bayesian infinite horizon version

Mathematics Subject Classification ID

Markov and semi-Markov decision processes (90C40) Optimal stochastic control (93E20) Stochastic learning and adaptive control (93E35)

Cited In (48)

Robustness of stochastic bandit policies
Learning the distribution with largest mean: two bandit frameworks
Geiringer theorems: from population genetics to computational intelligence, memory evolutive systems and Hebbian learning
Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards
The multi-armed bandit problem: an efficient nonparametric solution
Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem
Kullback-Leibler upper confidence bounds for optimal sequential allocation
How fragile are information cascades?
On Bayesian index policies for sequential resource allocation
UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem
Exploration-exploitation policies with almost sure, arbitrarily slow growing asymptotic regret
Boundary crossing probabilities for general exponential families
Functional Sequential Treatment Allocation
Dealing with expert bias in collective decision-making
An online algorithm for the risk-aware restless bandit
An asymptotically optimal policy for finite support models in the multiarmed bandit problem
Approximate indexability and bandit problems with concave rewards and delayed feedback
An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem
Finite-time analysis for the knowledge-gradient policy
Boundary crossing for general exponential families
A confirmation of a conjecture on Feldman’s two-armed bandit problem
Normal bandits of unknown means and variances
Factorial Designs for Online Experiments
Explore first, exploit next: the true shape of regret in bandit problems
The sample complexity of exploration in the multi-armed bandit problem
Nonasymptotic Analysis of Monte Carlo Tree Search
Continuous Assortment Optimization with Logit Choice Probabilities and Incomplete Information
Tuning Bandit Algorithms in Stochastic Environments
A revised approach for risk-averse multi-armed bandits under CVaR criterion
Convergence rate analysis for optimal computing budget allocation algorithms
Infinite Arms Bandit: Optimality via Confidence Bounds
Multi-armed bandits based on a variant of simulated annealing
Wisdom of crowds versus groupthink: learning in groups and in isolation
Optimistic Gittins Indices
Multi-agent reinforcement learning: a selective overview of theories and algorithms
Empirical Gittins index strategies with \(\varepsilon\)-explorations for multi-armed bandit problems
Efficient crowdsourcing of unknown experts using bounded multi-armed bandits
Some memoryless bandit policies
Linearly parameterized bandits
Title not available (Why is that?)
Gittins' theorem under uncertainty
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
Lower bounds on the sample complexity of exploration in the multi-armed bandit problem.
Exploration and exploitation of scratch games
Derivative-free optimization methods
Deviations of stochastic bandit regret
A non-parametric solution to the multi-armed bandit problem with covariates
Finite-time lower bounds for the two-armed bandit problem

This page was built for publication: Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q4862097)