Infomax strategies for an optimal balance between exploration and exploitation
From MaRDI portal
Publication:310029
Abstract: Proper balance between exploitation and exploration is what makes good decisions, which achieve high rewards like payoff or evolutionary fitness. The Infomax principle postulates that maximization of information directs the function of diverse systems, from living systems to artificial neural networks. While specific applications are successful, the validity of information as a proxy for reward remains unclear. Here, we consider the multi-armed bandit decision problem, which features arms (slot-machines) of unknown probabilities of success and a player trying to maximize cumulative payoff by choosing the sequence of arms to play. We show that an Infomax strategy (Info-p) which optimally gathers information on the highest mean reward among the arms saturates known optimal bounds and compares favorably to existing policies. The highest mean reward considered by Info-p is not the quantity actually needed for the choice of the arm to play, yet it allows for optimal tradeoffs between exploration and exploitation.
Recommendations
- A dynamic programming strategy to balance exploration and exploitation in the bandit problem
- scientific article; zbMATH DE number 1907146
- Learning to optimize via information-directed sampling
- Optimal selection with alternative information
- Reinforcement learning: exploration-exploitation dilemma in multi-agent foraging task
Cites work
- scientific article; zbMATH DE number 3889341 (Why is no real title available?)
- scientific article; zbMATH DE number 4078557 (Why is no real title available?)
- scientific article; zbMATH DE number 3638998 (Why is no real title available?)
- scientific article; zbMATH DE number 1168332 (Why is no real title available?)
- scientific article; zbMATH DE number 1983334 (Why is no real title available?)
- scientific article; zbMATH DE number 2061729 (Why is no real title available?)
- scientific article; zbMATH DE number 236854 (Why is no real title available?)
- scientific article; zbMATH DE number 3316587 (Why is no real title available?)
- A Mathematical Theory of Communication
- A bound on the financial value of information
- Adaptive treatment allocation and the multi-armed bandit problem
- An asymptotically optimal policy for finite support models in the multiarmed bandit problem
- Asymptotically efficient adaptive allocation rules
- Elements of Information Theory
- Finite-time analysis of the multiarmed bandit problem
- Information, Physics, and Computation
- Kullback-Leibler upper confidence bounds for optimal sequential allocation
- Multi-armed bandit allocation indices. With a foreword by Peter Whittle.
- Optimal Adaptive Policies for Markov Decision Processes
- Optimal stopping and dynamic allocation
- Rényi Divergence and Kullback-Leibler Divergence
- The value of information for populations in varying environments
- Thompson sampling: an asymptotically optimal finite-time analysis
Cited in
(2)
This page was built for publication: Infomax strategies for an optimal balance between exploration and exploitation
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q310029)