Evaluating non-stationary policies

Many decision-making algorithms perform online learning, where an internal model is continuously updated. Such policies are nonstationary, because they depend on the history of contexts, as well as the current context. These policies are more difficult to evaluate, because of their interactive nature.

The current best practice for evaluating non-stationary policies is called doubly robust nonstationary, as it extends the doubly robust technique. The nonstationarity is handled by replaying the target policy iteratively over the historical data with rejection sampling, which is a way to simulate data from an unknown distribution.

The technique works by iterating over the historical data and probabilistically accepting/rejecting instances to include in the synthetic dataset. By performing this rejection sampling iteratively over the historical data, the nonstationary policy under evaluation can be updated for each instance, thereby simulating online learning. In contrast, the techniques for stationary policies in the previous sections involved non-iterative, batch calculations on the entire historical dataset.

The rejection sampling is done in proportion to exploration and target action probabilities. Specifically, it tends to reject instances with actions that are more likely in exploration than target policies, so that the distribution of accepted instances more closely matches the target policy, rather than the exploration policy.

To minimise the variance in this technique, it efficiently uses all historical instances (accepted or rejected) in the doubly robust reward estimate. These reward estimates are averaged over each round, to further lower the variance. To control the bias in this technique, there are tunable parameters to control the level of bias in the estimate (at the cost of more or less efficient use of the historical data). The result is an evaluator that balances bias and variance, enabling nonstationary policies to be evaluated on historical data.