Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

DOI10.1287/OPRE.2021.2249OpenAlexW2994709386MaRDI QIDQ5060503FDOQ5060503

Authors: Nathan Kallus, Masatoshi Uehara

Publication date: 10 January 2023

Published in: Operations Research (Search for Journal in Brave)

Full work available at URL: https://arxiv.org/abs/1909.05850

Recommendations

Double reinforcement learning for efficient off-policy evaluation in Markov decision processes
Doubly robust policy evaluation and optimization
scientific article; zbMATH DE number 1753153
Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning
Policy learning for time-bounded reachability in continuous-time Markov decision processes via doubly-stochastic gradient ascent
An emphatic approach to the problem of off-policy temporal-difference learning
scientific article; zbMATH DE number 1753152
Off-policy linear temporal difference learning algorithms with a generalized oblique projection
Reinforcement learning in sparse-reward environments with hindsight policy gradients

zbMATH Keywords

semiparametric efficiency Markov decision processes infinite horizon off-policy evaluation

Mathematics Subject Classification ID

Applications of mathematical programming (90C90) Markov and semi-Markov decision processes (90C40)

Cites Work

Introduction to empirical processes and semiparametric inference
Asymptotic Statistics
Markov Chains and Stochastic Stability
Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models
Double/debiased machine learning for treatment and structural parameters
Efficient estimation of panel data models with sequential moment restrictions
Semiparametric theory and missing data.
Semiparametric efficiency bounds
Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score
Title not available (Why is that?)
Title not available (Why is that?)
On the Markov chain central limit theorem
Basic properties of strong mixing conditions. A survey and some open questions
Irregular identification, support conditions, and inverse weight estimation
Marginal Mean Models for Dynamic Regimes
Comment: Understanding OR, PS and DR
Sieve Extremum Estimates for Weakly Dependent Data
Optimal Dynamic Treatment Regimes
Doubly robust policy evaluation and optimization
Least squares policy evaluation algorithms with linear function approximation
10.1162/1532443041827907
Reinforcement learning. An introduction
Dynamic programming and optimal control. Vol. 2
Generalized TD learning
Consistent estimation of the influence function of locally asymptotically linear estimators
Least squares temporal difference methods: An analysis under general conditions
Estimating dynamic treatment regimes in mobile health using V-learning
Title not available (Why is that?)
Characterization of parameters with a mixed bias property

Cited In (9)

Predicting and optimizing marketing performance in dynamic markets
A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets
Reliable off-policy evaluation for reinforcement learning
Proximal reinforcement learning: efficient off-policy evaluation in partially observed Markov decision processes
Off-policy evaluation in partially observed Markov decision processes under sequential ignorability
Projected state-action balancing weights for offline reinforcement learning
Online Bootstrap Inference For Policy Evaluation In Reinforcement Learning
Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
Deep spectral Q-learning with application to mobile health

Uses Software

OpenAI Gym

This page was built for publication: Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5060503)