Bandits with switching costs, T^2/3 regret

DOI10.1145/2591796.2591868zbMATH Open1315.68207arXiv1310.2997OpenAlexW2109690147MaRDI QIDQ5259581FDOQ5259581

Authors: Ofer Dekel, Jian Ding, Tomer Koren, Yuval Peres

Publication date: 26 June 2015

Published in: Proceedings of the forty-sixth annual ACM symposium on Theory of computing (Search for Journal in Brave)

Abstract: We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's

T

-round minimax regret in this setting is

w i d e t i l d e T h e t a (T^{2 / 3})

, thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of

T h e t a (s q r t T)

. The difference between these two rates provides the emph{first} indication that learning with bandit feedback can be significantly harder than learning with full-information feedback (previous results only showed a different dependence on the number of actions, but not on

T

.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of

w i d e t i l d e T h e t a (T^{2 / 3})

. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is

w i d e t i l d e T h e t a (T^{2 / 3})

. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

Full work available at URL: https://arxiv.org/abs/1310.2997

Recommendations

zbMATH Keywords

lower bounds multi-armed bandit switching costs online learning

Mathematics Subject Classification ID

Learning and adaptive systems in artificial intelligence (68T05) Online algorithms; streaming algorithms (68W27) Randomized algorithms (68W20) Sums of independent random variables; random walks (60G50) Computational difficulty of problems (lower bounds, completeness, difficulty of approximation, etc.) (68Q17) Rationality and learning in game theory (91A26)

Cites Work

Cited In (6)

Uses Software

SuLQ

This page was built for publication: Bandits with switching costs, \(T^{2/3}\) regret

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5259581)