Online vs Offline Evaluation

If we deploy the policy \(\pi_e\) for a while (say, a week) on a real system to get an online estimate of the performance, we can get the log data and estimate the value as average reward as follows:

\[\hat{V}_{A/B} (\pi_e;\mathcal{D}_e) := \dfrac{1}{n}\sum_{i=1}^nr_i\]

In case of off-policy evaluation, the goal is to evaluate policy \(\pi_e\) with data collected from different policy \(\pi_0\).

\[V_{OPE} (\pi_e) \approx \hat{V} (\pi_e;\mathcal{D}_0)\]

where,

  • \(\hat{V}\) is the off-policy estimator

  • \(\pi_e\) is the new evaluation policy

  • \(\mathcal{D}_0\) is the logged data collected with logging policy \(\pi_0\)

Flow

Policy Learning -> Offline Evaluation -> Online Evaluation

Stages