Open Bandit Dataset

Open Bandit Dataset is constructed in an A/B test of two multi-armed bandit policies on a large-scale fashion e-commerce platform, ZOZOTOWN. It currently consists of a total of about 26M rows, each one representing a user impression with some feature values, selected items as actions, true propensity scores, and click indicators as an outcome. This is especially suitable for evaluating off-policy evaluation (OPE), which attempts to estimate the counterfactual performance of hypothetical algorithms using data generated by a different algorithm.

OBD is a set of logged bandit feedback datasets collected on a largescale fashion e-commerce platform provided by Saito et al. There are three campaigns, “ALL”, “Men”, and “Women”. We use size 30,000 and 300,000 of randomly sub-sampled data from the “ALL” campaign. The dataset contains user context as feature vector 𝑥 ∈ X, fashion item recommendation as action 𝑎 ∈ A, and click indicator as reward 𝑟 ∈ {0, 1}. The dimensions of the feature vector 𝑥 is 20, and the number of actions is 80.

The dataset consists of subsets of data collected by two different policies, the uniform random policy and the Bernoulli Thompson Sampling policy. We let \(\mathcal{D}_𝐴\) denote the dataset collected by uniform random policy \(\pi_A\) and \(\mathcal{D}_B\) denote that collected by Bernoulli Thompson Sampling policy \(\pi_B\).