Evaluation
Online vs Offline Evaluation
If we deploy the policy for a while (say, a week) on a real system to get an online estimate of the performance, we can get the log data and estimate the value as average reward as follows:
In case of off-policy evaluation, the goal is to evaluate policy with data collected from different policy .
where,
- is the off-policy estimator
- is the new evaluation policy
- is the logged data collected with logging policy
Flow
Policy Learning -> Offline Evaluation -> Online Evaluation

Stages
