Exploration-exploitation policies with almost sure, arbitrarily slow growing asymptotic regret

From MaRDI portal
Publication:5070864

DOI10.1017/S0269964818000529zbMATH Open1484.62039arXiv1505.02865OpenAlexW2914435863WikidataQ128505023 ScholiaQ128505023MaRDI QIDQ5070864FDOQ5070864


Authors: Wesley Cowan, Michael N. Katehakis Edit this on Wikidata


Publication date: 14 April 2022

Published in: Probability in the Engineering and Informational Sciences (Search for Journal in Brave)

Abstract: The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies pi. For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as noinfty, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.


Full work available at URL: https://arxiv.org/abs/1505.02865




Recommendations




Cites Work


Cited In (1)





This page was built for publication: Exploration-exploitation policies with almost sure, arbitrarily slow growing asymptotic regret

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5070864)