Skip to main content

Evaluation

Online vs Offline Evaluation

If we deploy the policy πe\pi_e for a while (say, a week) on a real system to get an online estimate of the performance, we can get the log data and estimate the value as average reward as follows:

V^A/B(πe;De):=1ni=1nri\hat{V}_{A/B} (\pi_e;\mathcal{D}_e) := \dfrac{1}{n}\sum_{i=1}^nr_i

In case of off-policy evaluation, the goal is to evaluate policy πe\pi_e with data collected from different policy π0\pi_0.

VOPE(πe)V^(πe;D0)V_{OPE} (\pi_e) \approx \hat{V} (\pi_e;\mathcal{D}_0)

where,

  • V^\hat{V} is the off-policy estimator
  • πe\pi_e is the new evaluation policy
  • D0\mathcal{D}_0 is the logged data collected with logging policy π0\pi_0

Flow

Policy Learning -> Offline Evaluation -> Online Evaluation

Stages