Off-Policy Evaluation¶
Many real-world recommender systems need to be highly scalable: matching millions of items with billions of users, with milliseconds latency. The scalability requirement has led to widely used two-stage recommender systems, consisting of efficient candidate generation model(s) in the first stage and a more powerful ranking model in the second stage.
Logged user feedback, e.g., user clicks or dwell time, are often used to build both candidate generation and ranking models for recommender systems. While it’s easy to collect large amount of such data, they are inherently biased because the feedback can only be observed on items recommended by the previous systems. Recently, off-policy correction on such biases have attracted increasing interest in the field of recommender system research.
In machine learning jargon, decision making systems are called “policies”. A policy simply takes in some context (e.g. time of day) and outputs a decision (e.g. send a push notification). A perfectly data-driven company would measure the impact of any and every change to a policy.
Decision-making algorithms (e.g. contextual bandit algorithms) are designed to interact with the real world in an automated way. However, care must be taken in how these algorithms are deployed, because their automated nature means that errors can lead to runaway algorithms, with adverse consequences. We would like to have the ability to evaluate how an algorithm will perform, without the risk of deploying a poorly performing algorithm to a production system. The ability to rapidly iterate such a process would also help optimise the algorithm, as different parameters could be tuned quickly without the risk of harm to the production system.
While the most accurate demonstration of how an automated algorithm will perform is to actually deploy it, there are often circumstances where this is impractical. An alternative approach is to create a simulation of the environment that the algorithm will interact with. However, these environments are typically highly complex, and creating unbiased simulations of them is not feasible.
In contextual bandits, a learning algorithm repeatedly observes a context, takes an action, and observes a reward for the chosen action. An example is content personalization: the context describes a user, actions are candidate stories, and the reward measures how much the user liked the recommended story. In essence, the algorithm is a policy that picks the best action given a context.
Given different policies, the metric of interest is their reward. One way to measure the reward is to deploy such policy online and let it choose actions (for example, recommend stories to users). However, such online evaluation can be costly for two reasons: It exposes users to an untested, experimental policy; and it doesn’t scale to evaluating multiple target policies.
The alternative is off-policy evaluation: Given data logs collected by using a logging policy, off-policy evaluation can estimate the expected rewards for different target policies and provide confidence intervals around such estimates.
Instead of relying on simulation, a simpler approach is to apply the algorithm to historical data, which is a technique called off-policy evaluation.
You’re probably wasting time, resources, and revenue running unnecessary A/B tests. Offline policy evaluation can predict how changes to your production systems will affect metrics and help you A/B test only the most promising changes. Just like A/B tests became standard practice in the 2010s, offline policy evaluation (OPE) is going to become standard practice in the 2020s as part of every experimentation stack.
Offline policy evaluation (OPE) is an active area of research in reinforcement learning. The aim, in a contextual bandit setting, is to take bandit data generated by some policy (let’s call it the production policy) and estimate the value of a new candidate policy offline. The use case is clear: before you deploy a policy, you want to estimate its performance, and compare to what’s already deployed.
The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, a covariate shift often exists, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data.
Some domains need OPE more than others as the risk level of deploying a bad policy varies by problem space. While sending some annoying push notifications might not be too big of a deal, deploying a bad fraud policy or loan approval policy can lead to severely negative outcomes like financial loss or unfair bias.
The earliest OPE methods rely on classical importance sampling to handle the distribution mismatch between the target and behavior policies (Precup et al., 2000). Many advanced OPE methods have since been proposed for both contextual bandits and reinforcement learning settings.

The setting is of particular interest in recommendation systems (e.g., movie recommendations at Netflix), in which the logged actions are recommended items (e.g., movies) and the logged rewards capture a metric of interest (e.g., watch time). Off-policy evaluation allows testing a much larger number of candidate policies than would be possible by online A/B testing.
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems.
In supervised learning settings, the standard approach to offline evaluation is to train on a train set and estimate generalisation performance on a holdout set. In online learning settings, one typically uses progressive validation. In contextual bandit settings, neither is directly possible, because like all reinforcement learning, there is a partial information problem: you never get to see rewards of actions you didn’t take. Your only source of information is the bandit data generated by your production policy, which might make entirely different choices than your candidate policy.
It doesn’t, then, seem possible to reliably evaluate contextual bandit policies offline. But it is! The key is to use estimators that fill in fake rewards for actions that weren’t taken, thereby creating a “fake” supervised learning dataset, against which you can estimate performance, either using progressive validation (more on that later) or a holdout set.
VW library implements several estimators to reduce policy evaluation to supervised learning-type evaluation. The simplest method, the direct method (DM), simply trains a regression model that estimates the cost (negative reward) of an (action, context) pair. As you might suspect, this method is generally biased, because the partial information problem means you typically see many more rewards for good actions than bad ones (assuming your production policy is working normally). Biased estimators should not be used for offline policy evaluation, but VW implements provably unbiased estimators like inverse propensity weighting (IPS) and doubly robust (DR) that can be used for this purpose.
Finally, before we get into how to run offline policy evaluation in VW, note that in this tutorial, by policies we mean contextual bandit models, not the exploration layer (e.g. epsilon-greedy) that is usually part of a contextual bandit system to tackle the explore-exploit tradeoff.
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance.
We study decision making in environments where we receive feedback only for chosen actions. For example, in Internet advertising, we find only whether a user clicked on some of the presented ads, but receive no information about the ads that were not presented. In health care, we only find out success rates for patients who received the treatments, but not for the alternatives. Both of these problems are instances of contextual bandits (Auer et al., 2002; Langford & Zhang, 2008). The context refers to additional information about the user or patient. Here, we focus on the offline version: we assume access to historic data, but no ability to gather new data (Langford et al., 2008; Strehl et al., 2011).
Two basic kinds of approaches address offline learning in contextual bandits. The first, which we call the direct method (DM), estimates the reward function from given data and uses this estimate in place of actual reward to evaluate the policy value on a set of contexts. The second kind, called inverse propensity score (IPS) (Horvitz & Thompson, 1952), uses importance weighting to correct for the incorrect proportions of actions in the historic data. The first approach requires an accurate model of rewards, whereas the second approach requires an accurate model of the past policy. In general, it might be difficult to accurately model rewards, so the first assumption can be too restrictive. On the other hand, it is usually possible to model the past policy quite well. However, the second kind of approach often suffers from large variance especially when the past policy differs significantly from the policy being evaluated.
Doubly Robust is a statistical approach for estimation from incomplete data with an important property: if either one of the two estimators (in DM and IPS) is correct, then the estimation is unbiased. This method thus increases the chances of drawing reliable inference.
For this discussion, the decision-making algorithms we are primarily interested in are contextual bandit algorithms. Here, the task is to sequentially observe a context and choose an appropriate action, with the goal of maximising some reward. The context refers to the input data available to the algorithm to make a decision, and action refers to an option chosen by the algorithm (based on the input context). The reward is the measure by which the algorithm is evaluated.
For example, the algorithm’s task on a news website might be to observe user demographic and behavioural attributes (context) and recommend news articles (actions) with the goal of maximising the click-through rate (reward).
For decision-making algorithms, a policy is a mapping between the algorithm’s past observations and current context to an action recommendation. Note that policies can be deterministic (the same context would receive the same action every time) or probabilistic (the same context would only have some probability of receiving the same action every time). The goal of policy evaluation is to estimate the expected total reward of a given policy.
Suppose we have a stream of instances (with context data associated with each instance) and policy 1 has been deployed on this stream. This means that it recommends actions for each instance (based the associated context data), and there is a log of the corresponding rewards for each instance. Calculating the total reward in this situation would be called “on-policy evaluation”. Now suppose we apply a different policy 2 to this stream of instances (with associated context data). Since it is a different policy, it may or may not recommend the same action for instances with the same context data. We are effectively creating a synthetic dataset, representing the counterfactual log we would have collected if policy 2 had been deployed. Estimating the total reward of this synthetic dataset is called “off-policy evaluation”, as illustrated below.

Practitioners can improve their automated decision making systems using online/batch bandit policies implemented in the policy module. Moreover, they can easily evaluate such bandit policies using historical logged bandit feedback data and OPE without A/B testing. Specifically, one can implement OPE of batch bandit algorithms with the standard OPE procedure
OPE on real world problems can get hard fast¶
As with any good data science problem, OPE can get much more challenging to do well as you apply it to more complicated problems. Real-world problems may have some — or god help us, all — of the elements below:
Huge action spaces: Many recommender systems have tons of content that could be recommended at any given time, meaning the action space could be really large (not just “send” or “don’t send”). In these cases actions are typically described by a set of features, just as the context is. Figuring out how to score actions, which actions to score, and how many actions to score in this case is another “fun” research problem.
Slate ranking: Often in the case of ranking problems you don’t just pick the single best piece of content. Usually, you are picking a “slate” of items (e.g. the top “N”). Facebook newsfeed, Netflix, Reddit, and YouTube all work like this. The OPE methods above work well when you are trying to evaluate the effect of picking 1 action, but what happens when you need to pick the best slate of “N” actions? Luckily, enough crazy researchers work on this stuff and Langford et al come to the rescue with Off-policy evaluation for slate recommendation 🙏.
Sequential problems: Above we focused on non-sequential decision making problems, typically solved with contextual bandits or rules-based systems. That is, we ignored optimizing over a “sequence” of decisions as you would do in a reinforcement learning setup. OPE methods must be adjusted when policies are optimizing over a sequence of actions.
Intuition¶
Suppose we have a policy P1 in production and following data was logged:
All labels (ground-truth): 0️⃣1️⃣1️⃣0️⃣1️⃣0️⃣0️⃣1️⃣1️⃣0️⃣. This reward is based on the actions of P1.
Now, if we want to train a new policy, we will split this logged data into train and test as follows:
0️⃣1️⃣1️⃣0️⃣1️⃣0️⃣0️⃣ 1️⃣1️⃣0️⃣. The new policy will use first 7 labels as training labels and try to minimize the loss by predicting the last 3 labels. Whatever the performance though, it is on the historical dataset and it is possible that the environment is changed now. So how to know if our policy will be as effective it was on the historical data. The best option at the moment is to perform A/B and MAB tests.
But, these are costly and time-consuming, so we devised some techniques to reduce the time and cost by heuristically gaining some insights to help us decide if it is worth moving to A/B and MAB tests or there is still some room of improvement in the policy design. This new and emerging field of techniques are known as Off-policy evaluation (OPE) a.k.a Counterfactual policy evaluation (CPE).

Problem definition¶
Let \(\mathcal{X}\) be an input space and \(\mathcal{A} = \{1,...,k\}\) a finite action space. A contextual bandit problem is specified by a distribution \(\mathcal{D}\) over pairs \((x,\vec{r})\) where \(x \in \mathcal{X}\) is the context and \(\vec{r} \in [0,1]^\mathcal{A}\) is a vector of rewards. The input data has been generated using some unknown policy (possibly adaptive and randomized) as follows:

Note that neither the distribution D nor the policy p is known. Given a data set S collected as above, we are interested in two tasks: policy evaluation and policy optimization. In policy evaluation, we are interested in estimating the value of a stationary policy π. On the other hand, the goal of policy optimization is to find an optimal policy with maximum value.
Evaluating Online Metrics Offline¶
Online: On-policy A/B Test

Offline: Off-policy Counterfactual Estimates

Concepts
Datasets
Tools
Tutorials
- Offline Policy Evaluation with VW Command Line
- OBP Library Workshop Tutorials
- Evaluating a New Fraud Policy with DM, IPW, and DR Methods
- Evaluating Standard Off-Policy Estimators with Small Sample Open-Bandit Dataset
- Evaluation of Multiple Off-Policy Estimators on Synthetic Dataset
- Off-Policy Learning in Two-stage Recommender Systems
- Batch Learning from Bandit Feedback
- Evaluating the Robustness of Off-Policy Evaluation
- Adaptive Estimator Selection for Off-Policy Evaluation
- Optimal Off-Policy Evaluation from Multiple Logging Policies
References¶
Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. Nathan Kallus, and Masatoshi Uehara. arXiv. 2019. CausalML (2019) Intrinsically Efficient Stable OPE [Source code].
Representation balancing MDPs for off-policy policy evaluation. Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, Emma Brunskill. arXiv. 2019. StanfordAI4HI (2018) Representation Balancing MDPs for Off-Policy Policy Evaluation [Source code].
Off-Policy Evaluation and Learning for External Validity under a Covariate Shift. Masahiro Kato, Masatoshi Uehara, and Shota Yasui. 2020. arXiv. MasaKat0 (2021) Off-policy Evaluation and Learning for External Validity under a Covariate Shift [Source code].
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. Cameron Voloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. 2020. arXiv. clvoloshin (2021) Caltech OPE Benchmarking Suite (COBS) [Source code].
Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling. Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. 2020. arXiv. https://tengyangxie.github.io/resources/MIS_poster_2019 [Poster]
More Robust Doubly Robust Off-policy Evaluation. Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. arXiv.
On the Design of Estimators for Bandit Off-Policy Evaluation. Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, and Tony Jebara. 2019. MLR.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. arXiv.
A Practical Guide of Off-Policy Evaluation for Bandit Problems. Masahiro Kato, Kenshi Abe, Kaito Ariu, and Shota Yasui. 2020. arXiv.
Reliable Off-policy Evaluation for Reinforcement Learning. Jie Wang, Rui Gao, and Hongyuan Zha. 2020. arXiv.
Benchmarks for Deep Off-policy Evaluation. Fu et. al. 2021. ICLR. Google Research (2021) Benchmarks for Deep Off-Policy Evaluation [Source code]
Off-Policy Evaluation via Off-Policy Classification. Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine. 2019. NeurIPS.
Toward Minimax Off-policy Value Estimation. http://proceedings.mlr.press/v38/li15b.pdf
Off-Policy Estimation for Infinite-Horizon Reinforcement Learning. Ali Mousavi, Lihong Li, Qiang Liu, and Dengyong Zhou. 2020. ICLR. Google AI Blog (2020) Off-Policy Estimation for Infinite-Horizon Reinforcement Learning [Blog].
Off-policy Learning in Two-stage Recommender Systems. https://dl.acm.org/doi/abs/10.1145/3366423.3380130
Universal Off-Policy Evaluation. Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. 2021. arXiv.
Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting. Ilja Kuzborskij, Claire Vernade, András György, and Csaba Szepesvári. 2020. arXiv.
Approaching OPE as a regression problem using meta-learning. QasimWani (2021) Approaching OPE as a regression problem using meta-learning [Source code]
Implementations and examples of common offline policy evaluation methods in Python. Bandit ML (2021) Implementations and examples of common offline policy evaluation methods in Python [Source code]
Average-Reward Off-Policy Policy Evaluation with Function Approximation. Shangtong Zhang, Yi Wan, Richard S. Sutton, and Shimon Whiteson. 2021. arXiv. Shangtong Zhang (2021) Modularized Implementation of Deep RL Algorithms in PyTorch [Source code]
Off-policy evaluation for slate recommendation. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. 2017. NeurIPS. adith387 (2017) Semi-synthetic experiments to test several approaches for off-policy evaluation and optimization of slate recommenders [Source code].
Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes. Nathan Kallus, and Masatoshi Uehara. 2019. arXiv. CausalML (2019) Double Reinforcement Learning MDP [Source code].
State Relevance for Off-Policy Evaluation. Simon P. Shen, Yecheng Jason Ma, Omer Gottesman, and Finale Doshi-Velez. 2021. arXiv. dtak (2021) OSIRIS [Source code].
Counterfactual Reasoning and Learning Systems. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson. 2012. arXiv.
Off-policy policy gradient with state distribution correction. Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill. 2019. arXiv.
https://www.cs.cornell.edu/courses/cs7792/2020sp/lectures/01-intro.pdf
https://www.cs.cornell.edu/courses/cs7792/2020sp/lectures/02-causalml.pdf
https://www.cs.cornell.edu/courses/cs7792/2020sp/lectures/03-unbiasedLTR.pdf
https://www.cs.cornell.edu/courses/cs7792/2016fa/lectures/01-intro_6up.pdf
https://www.cs.cornell.edu/courses/cs7792/2016fa/lectures/02-dueling_coactive_6up.pdf
https://www.cs.cornell.edu/courses/cs7792/2016fa/lectures/03-counterfactualmodel_6up.pdf
https://www.cs.cornell.edu/courses/cs7792/2016fa/lectures/06-blbf_poem_6up.pdf
http://www.cs.cornell.edu/courses/cs7792/2018fa/lectures/01-intro.pdf
http://www.cs.cornell.edu/courses/cs7792/2018fa/lectures/02-counterfactualmodel.pdf
http://www.cs.cornell.edu/courses/cs7792/2018fa/lectures/04-blbf.pdf
http://www.cs.cornell.edu/courses/cs7792/2018fa/lectures/05-unbiasedLTR.pdf