Whittle index based Q-learning for restless bandits with average reward
From MaRDI portal
Publication:2116660
DOI10.1016/J.AUTOMATICA.2022.110186zbMATH Open1485.93341arXiv2004.14427OpenAlexW3022217175MaRDI QIDQ2116660FDOQ2116660
Konstantin Avrachenkov, Vivek Borkar
Publication date: 18 March 2022
Published in: Automatica (Search for Journal in Brave)
Abstract: A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.
Full work available at URL: https://arxiv.org/abs/2004.14427
Artificial neural networks and deep learning (68T07) Discrete event control/observation systems (93C65)
Cites Work
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Title not available (Why is that?)
- Bandit Algorithms
- Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges
- Multi‐Armed Bandit Allocation Indices
- On an index policy for restless bandits
- Simulation-based optimization of Markov reward processes
- Learning algorithms for Markov decision processes with average cost
- Asynchronous Stochastic Approximations
- Index policies for the maintenance of a collection of machines by a set of repairmen
- The complexity of optimal queuing network control
- Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access
- Admission and routing of soft real-time jobs to multiclusters: design and comparison of index policies
- Convergence results for single-step on-policy reinforcement-learning algorithms
- An actor-critic algorithm for constrained Markov decision processes
- The Borkar-Meyn theorem for asynchronous stochastic approximations
- Index Policies for the Admission Control and Routing of Impatient Customers to Heterogeneous Service Stations
- Indexability and Index Heuristics for a Simple Class of Inventory Routing Problems
- Asymptotically optimal index policies for an abandonment queue with convex holding cost
- A stability criterion for two timescale stochastic approximation schemes
- An index policy for dynamic pricing in cloud computing under price commitments
- Whittle Index Policy for Crawling Ephemeral Content
- Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning
Cited In (2)
This page was built for publication: Whittle index based Q-learning for restless bandits with average reward
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2116660)