Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

DOI10.1561/2200000024zbMath1281.91051arXiv1204.5721OpenAlexW2950929549WikidataQ59538563 ScholiaQ59538563MaRDI QIDQ5396763

Sébastien Bubeck, Nicolò Cesa-Bianchi

Publication date: 3 February 2014

Published in: Foundations and Trends® in Machine Learning (Search for Journal in Brave)

Full work available at URL: https://arxiv.org/abs/1204.5721

zbMATH Keywords

optimization online learning reinforcement learning game-theoretic learning learning and statistical methods

Mathematics Subject Classification ID

Minimax procedures in statistical decision theory (62C20) Learning and adaptive systems in artificial intelligence (68T05) Research exposition (monographs, survey articles) pertaining to game theory, economics, and finance (91-02) Markov and semi-Markov decision processes (90C40) Sequential statistical design (62L05) Rationality and learning in game theory (91A26) Probabilistic games; gambling (91A60) Decision theory for games (91A35)

Related Items

Functional Sequential Treatment Allocation, Multi-Item Nontruthful Auctions Achieve Good Revenue, Best Arm Identification for Contaminated Bandits, Greedy Algorithm Almost Dominates in Smoothed Contextual Bandits, Unnamed Item, Bayesian Exploration: Incentivizing Exploration in Bayesian Games, An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization, Ranking and Selection with Covariates for Personalized Decision Making, Practical Nonparametric Sampling Strategies for Quantile-Based Ordinal Optimization, Always Valid Inference: Continuous Monitoring of A/B Tests, Unifying mirror descent and dual averaging, Self-adjusting grid networks, Daisee: Adaptive importance sampling by balancing exploration and exploitation, Technical note—Knowledge gradient for selection with covariates: Consistency and computation, Online learning for scheduling MIP heuristics, Multi-armed bandit-based hyper-heuristics for combinatorial optimization problems, Distributed online bandit linear regressions with differential privacy, Dynamic Resource Allocation in the Cloud with Near-Optimal Efficiency, Multi-armed bandit problem with online clustering as side information, Nonparametric learning for impulse control problems -- exploration vs. exploitation, Asymptotic optimality of myopic ranking and selection procedures, Convergence rate analysis for optimal computing budget allocation algorithms, User-friendly Introduction to PAC-Bayes Bounds, Universal regression with adversarial responses, On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables, Reinforcement Learning, Bit by Bit, MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT, Control-data separation and logical condition propagation for efficient inference on probabilistic programs, Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection, Constrained regret minimization for multi-criterion multi-armed bandits, Treatment recommendation with distributional targets, Online Debiasing for Adaptively Collected High-Dimensional Data With Applications to Time Series Analysis, Efficient and generalizable tuning strategies for stochastic gradient MCMC, Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards, MNL-Bandit: A Dynamic Learning Approach to Assortment Selection, Bandits with Global Convex Constraints and Objective, Online Decision Making with High-Dimensional Covariates, Bandit-Based Task Assignment for Heterogeneous Crowdsourcing, Online Network Revenue Management Using Thompson Sampling, Tractable Sampling Strategies for Ordinal Optimization, A Knowledge Gradient Policy for Sequencing Experiments to Identify the Structure of RNA Molecules Using a Sparse Additive Belief Model, Adaptive Matching for Expert Systems with Uncertain Task Types, Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors, Optimistic Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds, Unnamed Item, Randomized allocation with arm elimination in a bandit problem with covariates, Improvements and Generalizations of Stochastic Knapsack and Markovian Bandits Approximation Algorithms, Explore First, Exploit Next: The True Shape of Regret in Bandit Problems, Learning‐based iterative modular adaptive control for nonlinear systems, Unnamed Item, Derivative-free optimization methods, Bayesian Uncertainty Directed Trial Designs, Dynamic Pricing with Multiple Products and Partially Specified Demand Distribution, Learning to Optimize via Posterior Sampling, Online Learning of Nash Equilibria in Congestion Games, Nested-Batch-Mode Learning and Stochastic Optimization with An Application to Sequential MultiStage Testing in Materials Science, Unnamed Item, Unnamed Item, Unnamed Item, Learning in Repeated Auctions, Small-Loss Bounds for Online Learning with Partial Information, A Primal–Dual Learning Algorithm for Personalized Dynamic Pricing with an Inventory Constraint, Satisficing in Time-Sensitive Bandit Learning, Gradient-free two-point methods for solving stochastic nonsmooth convex optimization problems with small non-random noises, Optimal control with learning on the fly: a toy problem, Approximation algorithms for stochastic combinatorial optimization problems, Strategic conversations under imperfect information: epistemic message exchange games, LinUCB applied to Monte Carlo tree search, Adaptive large neighborhood search for mixed integer programming, Smoothness-Adaptive Contextual Bandits, Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regret Regimes, Setting Reserve Prices in Second-Price Auctions with Unobserved Bids, Learning in auctions: regret is hard, envy is easy, EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET, Tracking and Regret Bounds for Online Zeroth-Order Euclidean and Riemannian Optimization, Cloud Conveyors System: A Versatile Application for Exploring Cyber-Physical Systems, Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case, Unnamed Item, On minimaxity of follow the leader strategy in the stochastic setting, Kullback-Leibler upper confidence bounds for optimal sequential allocation, Distributed cooperative decision making in multi-agent multi-armed bandits, Personalized optimization with user's feedback, Unnamed Item, Worst-case regret analysis of computationally budgeted online kernel selection, Adaptive-treed bandits, Unnamed Item, Combining multiple strategies for multiarmed bandit problems and asymptotic optimality, Budget-limited distribution learning in multifidelity problems, Bandit-based Monte-Carlo structure learning of probabilistic logic programs, Regret minimization in online Bayesian persuasion: handling adversarial receiver's types under full and partial feedback models, A quality assuring, cost optimal multi-armed bandit mechanism for expertsourcing, Indifference-Zone-Free Selection of the Best, Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback, Unnamed Item, Unnamed Item, Unnamed Item, Truthful learning mechanisms for multi-slot sponsored search auctions with externalities, Controlling unknown linear dynamics with bounded multiplicative regret, Efficient Ranking and Selection in Parallel Computing Environments, Multi-channel transmission scheduling with hopping scheme under uncertain channel states, Learning the distribution with largest mean: two bandit frameworks, Scale-free online learning, Zeroth-order nonconvex stochastic optimization: handling constraints, high dimensionality, and saddle points, A reliability-aware multi-armed bandit approach to learn and select users in demand response, An adversarial model for scheduling with testing, Undiscounted bandit games, An optimal bidimensional multi-armed bandit auction for multi-unit procurement, Learning in games with continuous action sets and unknown payoff functions, A unified framework for stochastic optimization, Regret bounds for restless Markov bandits, Exploratory distributions for convex functions, Ballooning multi-armed bandits, Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex, Unnamed Item, Unnamed Item, Unnamed Item, Unnamed Item, Unnamed Item, A continuous-time approach to online optimization, Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards, Unnamed Item, Unnamed Item, Unnamed Item, Boundary crossing probabilities for general exponential families, Mistake bounds on the noise-free multi-armed bandit game, An online algorithm for the risk-aware restless bandit, Multi-Armed Bandit for Species Discovery: A Bayesian Nonparametric Approach, Active ranking from pairwise comparisons and when parametric assumptions do not help, Adaptive policies for perimeter surveillance problems, New bounds on the price of bandit feedback for mistake-bounded online multiclass learning, A note on a tight lower bound for capacitated MNL-bandit assortment selection models, Concentration bounds for empirical conditional value-at-risk: the unbounded case, Multi-armed bandits based on a variant of simulated annealing, Mechanisms with learning for stochastic multi-armed bandit problems, On Monte-Carlo tree search for deterministic games with alternate moves and complete information, Learning to Optimize via Information-Directed Sampling, Sequential Interdiction with Incomplete Information and Learning, Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games, Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback, Differentially Private and Budget-Limited Bandit Learning over Matroids, Dynamic Assortment Personalization in High Dimensions, Bayesian Incentive-Compatible Bandit Exploration, Multi-armed bandit with sub-exponential rewards, A revised approach for risk-averse multi-armed bandits under CVaR criterion, Online Collaborative Filtering on Graphs, On the Prior Sensitivity of Thompson Sampling, Nonasymptotic sequential tests for overlapping hypotheses applied to near-optimal arm identification in bandit models, On testing transitivity in online preference learning, Online Learning over a Finite Action Set with Limited Switching, Response prediction for low-regret agents, Learning Unknown Service Rates in Queues: A Multiarmed Bandit Approach, Matching While Learning, Meta-inductive prediction based on attractivity weighting: mathematical and empirical performance evaluation, Online Allocation and Pricing: Constant Regret via Bellman Inequalities, Nonparametric Pricing Analytics with Customer Covariates, From reinforcement learning to optimal control: a unified framework for sequential decisions, Trading utility and uncertainty: applying the value of information to resolve the exploration-exploitation dilemma in reinforcement learning, Unnamed Item, Noisy zeroth-order optimization for non-smooth saddle point problems, A Bandit-Learning Approach to Multifidelity Approximation, Optimal Policy for Dynamic Assortment Planning Under Multinomial Logit Models, On the efficiency of a randomized mirror descent algorithm in online optimization problems, Order scoring, bandit learning and order cancellations