Online vs Offline Evaluation¶

If we deploy the policy \(\pi_e\) for a while (say, a week) on a real system to get an online estimate of the performance, we can get the log data and estimate the value as average reward as follows:

\[\hat{V}_{A/B} (\pi_e;\mathcal{D}_e) := \dfrac{1}{n}\sum_{i=1}^nr_i\]

In case of off-policy evaluation, the goal is to evaluate policy \(\pi_e\) with data collected from different policy \(\pi_0\).