Chasing Ghosts: Competing with Stateful Policies
From MaRDI portal
Publication:2968152
Abstract: We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like in terms of the number of decision rounds . The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in , of the form . Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than .
Recommendations
- State-policy dynamics in evolutionary games
- Foundations of Software Science and Computation Structures
- Extreme state aggregation beyond MDPs
- On the Complexity of Reasoning About Dynamic Policies
- A state space distribution policy based on abstract interpretation
- State observation accuracy and finite-memory policy performance
- Learning with policy prediction in continuous state-action multi-agent decision processes
Cites work
- A decision-theoretic generalization of on-line learning and an application to boosting
- Algorithmic Game Theory
- Bandits with switching costs, \(T^{2/3}\) regret
- Combining expert advice in reactive environments
- Differential privacy under continual observation
- High-confidence predictions under adversarial uncertainty
- How to use expert advice
- Markov chains and mixing times. With a chapter on ``Coupling from the past by James G. Propp and David B. Wilson.
- On sequential strategies for loss functions with memory
- Online Markov Decision Processes Under Bandit Feedback
- Online Markov decision processes
- Prediction, Learning, and Games
- The Nonstochastic Multiarmed Bandit Problem
- Universal prediction
- Universal prediction of individual sequences
- Why are images smooth?
Cited in
(3)
This page was built for publication: Chasing Ghosts: Competing with Stateful Policies
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2968152)