An Efficient Algorithm for Learning with Semi-bandit Feedback
From MaRDI portal
Abstract: We consider the problem of online combinatorial optimization under semi-bandit feedback. The goal of the learner is to sequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m) over previous bounds for this algorithm.
Recommendations
- A better resource allocation algorithm with semi-bandit feedback
- A learning algorithm for the finite-time two-armed bandit problem
- Mechanisms with learning for stochastic multi-armed bandit problems
- Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards
- scientific article; zbMATH DE number 6542809
- A minimax and asymptotically optimal algorithm for stochastic bandits
- Improved algorithms for bandit with graph feedback via regret decomposition
- Stochastic convex optimization with bandit feedback
- A randomized stochastic approximation algorithm for self-learning
- Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits
Cited in
(13)- An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
- scientific article; zbMATH DE number 6542809 (Why is no real title available?)
- Regret in online combinatorial optimization
- Combinatorial bandits
- Beyond the hazard rate: more perturbation algorithms for adversarial multi-armed bandits
- An efficient algorithm for cooperative semi-bandits
- A better resource allocation algorithm with semi-bandit feedback
- An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem
- Online learning over a finite action set with limited switching
- Efficient algorithms for combinatorial online prediction
- A Survey of Preference-Based Online Learning with Bandit Algorithms
- Importance weighting without importance weights: an efficient algorithm for combinatorial semi-bandits
- scientific article; zbMATH DE number 7064051 (Why is no real title available?)
This page was built for publication: An Efficient Algorithm for Learning with Semi-bandit Feedback
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2859220)