Abstract: We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after steps achieves regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.
Recommendations
Cites work
- scientific article; zbMATH DE number 4087408 (Why is no real title available?)
- scientific article; zbMATH DE number 3638998 (Why is no real title available?)
- scientific article; zbMATH DE number 700091 (Why is no real title available?)
- scientific article; zbMATH DE number 2086977 (Why is no real title available?)
- Asymptotically efficient adaptive allocation rules
- Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards
- Equivalence notions and model minimization in Markov decision processes
- Finite-time analysis of the multiarmed bandit problem
- Near-optimal regret bounds for reinforcement learning
- On Chebyshev-Type Inequalities for Primes
- On the possibility of learning in reactive environments with arbitrary dependence
- Pseudometrics for State Aggregation in Average Reward Markov Decision Processes
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems
- The Nonstochastic Multiarmed Bandit Problem
- Threshold limits for cover times
Cited in
(12)- Bounded Regret for Finitely Parameterized Multi-Armed Bandits
- Regret bounds for Narendra-Shapiro bandit algorithms
- Whittle index based Q-learning for restless bandits with average reward
- Approximations of the restless bandit problem
- An online algorithm for the risk-aware restless bandit
- Regret Bounds for Restless Markov Bandits
- Learning unknown service rates in queues: a multiarmed bandit approach
- Multi-armed bandit problem with online clustering as side information
- Learning in structured MDPs with convex cost functions: improved regret bounds for inventory management
- A new bandit setting balancing information from state evolution and corrupted context
- Approximation algorithms for restless bandit problems
- Optimal exploration-exploitation in a multi-armed bandit problem with non-stationary rewards
This page was built for publication: Regret bounds for restless Markov bandits
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q465253)