An Efficient Algorithm for Learning with Semi-bandit Feedback

DOI10.1007/978-3-642-40935-6_17MaRDI QIDQ2859220zbMATH OpenOpenAlexFDO

Publication date 6 November 2013

Published in Lecture Notes in Computer Science (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/1305.2732

combinatorial optimization bandit problems online learning follow-the-perturbed-leader

Learning and adaptive systems in artificial intelligence (68T05) Combinatorial optimization (90C27) General considerations in statistical decision theory (62C05)

Abstract: We consider the problem of online combinatorial optimization under semi-bandit feedback. The goal of the learner is to sequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m) over previous bounds for this algorithm.

Recommendations

Cited in

(13)

This page was built for publication: An Efficient Algorithm for Learning with Semi-bandit Feedback

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2859220)