Online vs Offline Evaluation¶
If we deploy the policy \(\pi_e\) for a while (say, a week) on a real system to get an online estimate of the performance, we can get the log data and estimate the value as average reward as follows:
\[\hat{V}_{A/B} (\pi_e;\mathcal{D}_e) := \dfrac{1}{n}\sum_{i=1}^nr_i\]
In case of off-policy evaluation, the goal is to evaluate policy \(\pi_e\) with data collected from different policy \(\pi_0\).
\[V_{OPE} (\pi_e) \approx \hat{V} (\pi_e;\mathcal{D}_0)\]
where,
\(\hat{V}\) is the off-policy estimator
\(\pi_e\) is the new evaluation policy
\(\mathcal{D}_0\) is the logged data collected with logging policy \(\pi_0\)
Stages¶

