Doubly robust policy evaluation and optimization
From MaRDI portal
Abstract: We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.
Recommendations
Cites work
- scientific article; zbMATH DE number 6253908 (Why is no real title available?)
- A Generalization of Sampling Without Replacement From a Finite Universe
- A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect
- A robust method for estimating optimal treatment regimes
- Counterfactual reasoning and learning systems: the example of computational advertising
- Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data
- Doubly robust policy evaluation and optimization
- Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
- Improving predictive inference under covariate shift by weighting the log-likelihood function
- Marginal Mean Models for Dynamic Regimes
- On model selection and model misspecification in causal inference
- On tail probabilities for martingales
- Optimal Dynamic Treatment Regimes
- Pattern-recognizing stochastic learning automata
- Semiparametric Efficiency in Multivariate Regression Models with Missing Data
- Semiparametric regression estimation in the presence of dependent censoring
- Some aspects of the sequential design of experiments
- Some results on generalized difference estimation and generalized regression estimation for finite populations
- The Nonstochastic Multiarmed Bandit Problem
Cited in
(24)- scientific article; zbMATH DE number 7415076 (Why is no real title available?)
- Constructing effective personalized policies using counterfactual inference from biased data sets with many features
- Offline Multi-Action Policy Learning: Generalization and Optimization
- A Single-Index Model With a Surface-Link for Optimizing Individualized Dose Rules
- Importance sampling in reinforcement learning with an estimated behavior policy
- scientific article; zbMATH DE number 7306868 (Why is no real title available?)
- Debiasing in-sample policy performance for small-data, large-scale optimization
- PAC-Bayesian lifelong learning for multi-armed bandits
- Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
- Nonparametric causal effects based on incremental propensity score interventions
- Batch policy learning in average reward Markov decision processes
- Toward theoretical understandings of robust Markov decision processes: sample complexity and asymptotics
- Augmented direct learning for conditional average treatment effect estimation with double robustness
- Statistical inference for online decision making: in a contextual bandit setting
- Doubly robust policy evaluation and optimization
- Learning when-to-treat policies
- More efficient policy learning via optimal retargeting
- Partially observable environment estimation with uplift inference for reinforcement learning based recommendation
- A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets
- Optimal policy trees
- Doubly Robust Crowdsourcing
- Selecting and ranking individualized treatment rules with unmeasured confounding
- Constrained Bayesian optimization with noisy experiments
- Bayesian optimization for policy search via online-offline experimentation
This page was built for publication: Doubly robust policy evaluation and optimization
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q252797)