Learning to optimize via information-directed sampling
From MaRDI portal
Publication:4969321
DOI10.1287/OPRE.2017.1663zbMATH Open1458.90497arXiv1403.5556OpenAlexW2765733960MaRDI QIDQ4969321FDOQ4969321
Daniel Russo, Benjamin Van Roy
Publication date: 5 October 2020
Published in: Operations Research (Search for Journal in Brave)
Abstract: We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. We illustrate through simple analytic examples how information-directed sampling accounts for kinds of information that alternative approaches do not adequately address and that this can lead to dramatic performance gains. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate state-of-the-art simulation performance.
Full work available at URL: https://arxiv.org/abs/1403.5556
Recommendations
Cites Work
- Title not available (Why is that?)
- Bayesian experimental design: A review
- Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting
- Title not available (Why is that?)
- An informational approach to the global optimization of expensive-to-evaluate functions
- Title not available (Why is that?)
- Asymptotically efficient adaptive allocation rules
- Computing a classic index for finite-horizon bandits
- Multi‐Armed Bandit Allocation Indices
- Finite-time analysis of the multiarmed bandit problem
- Entropy and Information Theory
- Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint
- Title not available (Why is that?)
- On a Measure of the Information Provided by an Experiment
- Kullback-Leibler upper confidence bounds for optimal sequential allocation
- Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
- Title not available (Why is that?)
- Partial Monitoring—Classification, Regret Bounds, and Algorithms
- Adaptive treatment allocation and the multi-armed bandit problem
- Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
- Dynamic Pricing Under a General Parametric Choice Model
- Title not available (Why is that?)
- Near-optimal regret bounds for reinforcement learning
- The knowledge gradient algorithm for a general class of online learning problems
- A Knowledge-Gradient Policy for Sequential Information Collection
- Title not available (Why is that?)
- Learning to Optimize via Posterior Sampling
- Twenty Questions with Noise: Bayes Optimal Policies for Entropy Loss
- Linearly Parameterized Bandits
- Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space
- Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: finite parameter space
- Refined knowledge-gradient policy for learning probabilities
- An information-theoretic analysis of Thompson sampling
- Bisection search with noisy responses
- Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains
- Regret in Online Combinatorial Optimization
Cited In (11)
- Reinforcement Learning, Bit by Bit
- Title not available (Why is that?)
- Finding the optimal exploration-exploitation trade-off online through Bayesian risk estimation and minimization
- Simple Bayesian Algorithms for Best-Arm Identification
- Approximating the operating characteristics of Bayesian uncertainty directed trial designs
- Exploratory distributions for convex functions
- Title not available (Why is that?)
- Title not available (Why is that?)
- Dynamic Programs with Shared Resources and Signals: Dynamic Fluid Policies and Asymptotic Optimality
- Optimistic Gittins Indices
- Online team formation under different synergies
This page was built for publication: Learning to optimize via information-directed sampling
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q4969321)