Evaluating a New Fraud Policy with DM, IPW, and DR Methods

IPS

from typing import Callable, Dict, List
import pandas as pd
import statistics


P95_Z_SCORE = 1.96


def compute_list_stats(input: List):
    """Compute mean and P95 CI of mean for a list of floats."""
    mean = statistics.mean(input)
    std_dev = statistics.stdev(input) if len(input) > 1 else None
    ci_low = round(mean - P95_Z_SCORE * std_dev, 2) if std_dev else None
    ci_high = round(mean + P95_Z_SCORE * std_dev, 2) if std_dev else None

    return {"mean": round(mean, 2), "ci_low": ci_low, "ci_high": ci_high}


def evaluate(
    df: pd.DataFrame, action_prob_function: Callable, num_bootstrap_samples: int = 0
) -> Dict[str, Dict[str, float]]:

    results = [
        evaluate_raw(df, action_prob_function, sample=True)
        for _ in range(num_bootstrap_samples)
    ]

    if not results:
        results = [evaluate_raw(df, action_prob_function, sample=False)]

    logging_policy_rewards = [result["logging_policy"] for result in results]
    new_policy_rewards = [result["new_policy"] for result in results]

    return {
        "expected_reward_logging_policy": compute_list_stats(logging_policy_rewards),
        "expected_reward_new_policy": compute_list_stats(new_policy_rewards),
    }


def evaluate_raw(
    df: pd.DataFrame, action_prob_function: Callable, sample: bool
) -> Dict[str, float]:

    tmp_df = df.sample(df.shape[0], replace=True) if sample else df

    cum_reward_new_policy = 0
    for _, row in tmp_df.iterrows():
        action_probabilities = action_prob_function(row["context"])
        cum_reward_new_policy += (
            action_probabilities[row["action"]] / row["action_prob"]
        ) * row["reward"]

    return {
        "logging_policy": tmp_df.reward.sum() / len(tmp_df),
        "new_policy": cum_reward_new_policy / len(tmp_df),
    }

Scenario

Assume we have a fraud model in production that blocks transactions if the P(fraud) > 0.05.

Let’s build some sample logs from that policy running in production. One thing to note, we need some basic exploration in the production logs (e.g. epsilon-greedy w/ε = 0.1). That is, 10% of the time we take a random action. Rewards represent revenue gained from allowing the transaction. A negative reward indicates the transaction was fraud and resulted in a chargeback.

import pandas as pd

logs_df = pd.DataFrame([
    {"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
    {"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
    {"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 10},    
    {"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20}, # only allowed due to exploration 
])

logs_df
context action action_prob reward
0 {'p_fraud': 0.08} blocked 0.9 0
1 {'p_fraud': 0.03} allowed 0.9 20
2 {'p_fraud': 0.01} allowed 0.9 10
3 {'p_fraud': 0.09} allowed 0.1 -20

Now let’s use IPS to score a more lenient fraud model that blocks transactions only if the P(fraud) > 0.10.

IPS requires that we know P(action | context) for the new policy. We can easily describe our new policy:

def action_probabilities(context):
    epsilon = 0.10
    if context["p_fraud"] > 0.10:
        return {"allowed": epsilon, "blocked": 1 - epsilon}    
    
    return {"allowed": 1 - epsilon, "blocked": epsilon}

We can now get the probability that the new policy takes the same action that was taken in the production logs above.

logs_df["new_action_prob"] = logs_df.apply(
    lambda row: action_probabilities(row["context"])[row["action"]],
    axis=1
)
logs_df
context action action_prob reward new_action_prob
0 {'p_fraud': 0.08} blocked 0.9 0 0.1
1 {'p_fraud': 0.03} allowed 0.9 20 0.9
2 {'p_fraud': 0.01} allowed 0.9 10 0.9
3 {'p_fraud': 0.09} allowed 0.1 -20 0.9

We see that the new policy lets through a fraud example (row: 3) at a much higher probability. This should make the new model get penalized in offline evaluation. We also see that for row: 0, the new model has a 90% chance of allowing the transaction, but we don’t have the counterfactual knowledge of whether or not this would have been a non-fraud transaction since in production this transaction was blocked. This demonstrates one of the drawbacks of offline policy evaluation, but with more data we’d ideally see a different action taken in the same situation (due to exploration).

Now we will score the new model using IPS:

evaluate(logs_df, action_probabilities, num_bootstrap_samples=100)
{'expected_reward_logging_policy': {'ci_high': 17.59,
  'ci_low': -11.19,
  'mean': 3.2},
 'expected_reward_new_policy': {'ci_high': 46.68,
  'ci_low': -109.88,
  'mean': -31.6}}

The expected reward per observation for the new policy is much worse than the logging policy (due to the observation that allowed fraud to go through (row: 3)) so we wouldn’t roll out this new policy into an A/B test or production and instead should test some different policies offline.

However, the confidence intervals around the expected rewards for our old and new policies overlap. If we want to be really certain, it’s might be best to gather some more data to ensure the difference is signal and not noise. In this case, fortunately, we have strong reason to suspect the new policy is worse, but these confidence intervals can be important in cases where we have less prior certainty.

DM

from typing import NoReturn, Tuple, Dict, Callable

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn import ensemble
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.linear_model import Ridge


def fit_gbdt_regression(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn GBDT regressor."""

    clf = ensemble.GradientBoostingRegressor()
    clf.fit(X_train, y_train)

    mse_train = mean_squared_error(y_train, clf.predict(X_train))

    mse_test = None
    if X_test and y_test:
        mse_test = mean_squared_error(y_test, clf.predict(X_test))

    return {"model": clf, "mse_train": mse_train, "mse_test": mse_test}


def fit_gbdt_classification(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn GBDT classifier."""

    clf = ensemble.GradientBoostingClassifier()
    clf.fit(X_train, y_train)

    acc_train = accuracy_score(y_train, clf.predict(X_train))

    acc_test = None
    if X_test and y_test:
        acc_test = accuracy_score(y_test, clf.predict(X_test))

    return {"model": clf, "acc_train": acc_train, "acc_test": acc_test}


def fit_ridge_regression(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn Ridge regression."""

    clf = Ridge()
    clf.fit(X_train, y_train)

    mse_train = mean_squared_error(y_train, clf.predict(X_train))

    mse_test = None
    if X_test and y_test:
        mse_test = mean_squared_error(y_test, clf.predict(X_test))

    return {"model": clf, "mse_train": mse_train, "mse_test": mse_test}


def fit_ridge_classification(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn Ridge classifier."""

    clf = RidgeClassifier()
    clf.fit(X_train, y_train)

    acc_train = accuracy_score(y_train, clf.predict(X_train))

    acc_test = None
    if X_test and y_test:
        acc_test = accuracy_score(y_test, clf.predict(X_test))

    return {"model": clf, "acc_train": acc_train, "acc_test": acc_test}


class Predictor:
    def _preprocess_data(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        # preprocess context
        context_df = df.context.apply(pd.Series)
        self.context_column_order = list(context_df.columns)

        # preprocess actions
        self.action_preprocessor = OneHotEncoder(sparse=False)
        action_values = df.action.values.reshape(-1, 1)
        self.possible_actions = set(action_values.squeeze().tolist())
        one_hot_action_values = self.action_preprocessor.fit_transform(action_values)

        X_train = np.concatenate((context_df.values, one_hot_action_values), axis=1)
        y_train = df.reward.values

        return X_train, y_train

    def fit(self, df: pd.DataFrame) -> NoReturn:
        X_train, y_train = self._preprocess_data(df)
        results = fit_gbdt_regression(X_train, y_train)
        self.model = results.pop("model")
        self.training_stats = results

    def predict(self, input: np.ndarray) -> float:
        return self.model.predict(input)[0]


def evaluate(
    df: pd.DataFrame, action_prob_function: Callable, num_bootstrap_samples: int = 0
) -> Dict[str, Dict[str, float]]:

    # train a model that predicts reward given (context, action)
    reward_model = Predictor()
    reward_model.fit(df)

    results = [
        evaluate_raw(df, action_prob_function, sample=True, reward_model=reward_model)
        for _ in range(num_bootstrap_samples)
    ]

    if not results:
        results = [
            evaluate_raw(
                df, action_prob_function, sample=False, reward_model=reward_model
            )
        ]

    logging_policy_rewards = [result["logging_policy"] for result in results]
    new_policy_rewards = [result["new_policy"] for result in results]

    return {
        "expected_reward_logging_policy": compute_list_stats(logging_policy_rewards),
        "expected_reward_new_policy": compute_list_stats(new_policy_rewards),
    }


def evaluate_raw(
    df: pd.DataFrame,
    action_prob_function: Callable,
    sample: bool,
    reward_model: Predictor,
) -> Dict[str, float]:

    tmp_df = df.sample(df.shape[0], replace=True) if sample else df

    context_df = tmp_df.context.apply(pd.Series)
    context_array = context_df[reward_model.context_column_order].values
    cum_reward_new_policy = 0

    for idx, row in tmp_df.iterrows():
        observation_expected_reward = 0
        action_probabilities = action_prob_function(row["context"])
        for action, action_probability in action_probabilities.items():
            one_hot_action = reward_model.action_preprocessor.transform(
                np.array(action).reshape(-1, 1)
            )
            observation = np.concatenate((context_array[idx], one_hot_action.squeeze()))
            predicted_reward = reward_model.predict(observation.reshape(1, -1))
            observation_expected_reward += action_probability * predicted_reward
        cum_reward_new_policy += observation_expected_reward

    return {
        "logging_policy": tmp_df.reward.sum() / len(tmp_df),
        "new_policy": cum_reward_new_policy / len(tmp_df),
    }

Scenario

Assume we have a fraud model in production that blocks transactions if the P(fraud) > 0.05.

Let’s build some sample logs from that policy running in production. One thing to note, we need some basic exploration in the production logs (e.g. epsilon-greedy w/ε = 0.1). That is, 10% of the time we take a random action. Rewards represent revenue gained from allowing the transaction. A negative reward indicates the transaction was fraud and resulted in a chargeback.

logs_df = pd.DataFrame([
    {"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
    {"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
    {"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob": 0.90, "reward": 10}, 
    {"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 20},     
    {"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20}, # only allowed due to exploration 
    {"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob": 0.10, "reward": -10}, # only allowed due to exploration     
])

logs_df
context action action_prob reward
0 {'p_fraud': 0.08} blocked 0.9 0
1 {'p_fraud': 0.03} allowed 0.9 20
2 {'p_fraud': 0.02} allowed 0.9 10
3 {'p_fraud': 0.01} allowed 0.9 20
4 {'p_fraud': 0.09} allowed 0.1 -20
5 {'p_fraud': 0.4} allowed 0.1 -10

Now let’s use the direct method to score a more lenient fraud model that blocks transactions only if the P(fraud) > 0.10.

The direct method requires that we have a function that computes P(action | context)for all possible actions under our new policy. We can define that for our new policy easily here:

def action_probabilities(context):
    epsilon = 0.10
    if context["p_fraud"] > 0.10:
        return {"allowed": epsilon, "blocked": 1 - epsilon}    
    
    return {"allowed": 1 - epsilon, "blocked": epsilon}

We will use the same production logs above and run them through the new policy:

evaluate(logs_df, action_probabilities, num_bootstrap_samples=100)
{'expected_reward_logging_policy': {'ci_high': 15.89,
  'ci_low': -7.89,
  'mean': 4.0},
 'expected_reward_new_policy': {'ci_high': 19.34,
  'ci_low': -11.12,
  'mean': 4.11}}

The direct method estimates that the expected reward per observation for the new policy is slightly better than the logging policy so we would think about rolling out this new policy into an A/B test or production.

However, the confidence intervals around the expected rewards for our old and new policies overlap heavily. If we want to be really certain, it’s might be best to gather some more data to ensure the difference is signal and not noise. In this case, fortunately, we have strong reason to suspect the new policy is worse, but these confidence intervals can be important in cases where we have less prior certainty.

DR

from typing import NoReturn, Tuple, Dict, Callable

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn import ensemble
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.linear_model import Ridge


def fit_gbdt_regression(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn GBDT regressor."""

    clf = ensemble.GradientBoostingRegressor()
    clf.fit(X_train, y_train)

    mse_train = mean_squared_error(y_train, clf.predict(X_train))

    mse_test = None
    if X_test and y_test:
        mse_test = mean_squared_error(y_test, clf.predict(X_test))

    return {"model": clf, "mse_train": mse_train, "mse_test": mse_test}


def fit_gbdt_classification(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn GBDT classifier."""

    clf = ensemble.GradientBoostingClassifier()
    clf.fit(X_train, y_train)

    acc_train = accuracy_score(y_train, clf.predict(X_train))

    acc_test = None
    if X_test and y_test:
        acc_test = accuracy_score(y_test, clf.predict(X_test))

    return {"model": clf, "acc_train": acc_train, "acc_test": acc_test}


def fit_ridge_regression(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn Ridge regression."""

    clf = Ridge()
    clf.fit(X_train, y_train)

    mse_train = mean_squared_error(y_train, clf.predict(X_train))

    mse_test = None
    if X_test and y_test:
        mse_test = mean_squared_error(y_test, clf.predict(X_test))

    return {"model": clf, "mse_train": mse_train, "mse_test": mse_test}


def fit_ridge_classification(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray = None,
    y_test: np.ndarray = None,
) -> Dict:
    """Off the shelf sklearn Ridge classifier."""

    clf = RidgeClassifier()
    clf.fit(X_train, y_train)

    acc_train = accuracy_score(y_train, clf.predict(X_train))

    acc_test = None
    if X_test and y_test:
        acc_test = accuracy_score(y_test, clf.predict(X_test))

    return {"model": clf, "acc_train": acc_train, "acc_test": acc_test}


class Predictor:
    def _preprocess_data(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        # preprocess context
        context_df = df.context.apply(pd.Series)
        self.context_column_order = list(context_df.columns)

        # preprocess actions
        self.action_preprocessor = OneHotEncoder(sparse=False)
        action_values = df.action.values.reshape(-1, 1)
        self.possible_actions = set(action_values.squeeze().tolist())
        one_hot_action_values = self.action_preprocessor.fit_transform(action_values)

        X_train = np.concatenate((context_df.values, one_hot_action_values), axis=1)
        y_train = df.reward.values

        return X_train, y_train

    def fit(self, df: pd.DataFrame) -> NoReturn:
        X_train, y_train = self._preprocess_data(df)
        results = fit_gbdt_regression(X_train, y_train)
        self.model = results.pop("model")
        self.training_stats = results

    def predict(self, input: np.ndarray) -> float:
        return self.model.predict(input)[0]


def evaluate(
    df: pd.DataFrame, action_prob_function: Callable, num_bootstrap_samples: int = 0
) -> Dict[str, Dict[str, float]]:

    # train a model that predicts reward given (context, action)
    reward_model = Predictor()
    reward_model.fit(df)

    results = [
        evaluate_raw(df, action_prob_function, sample=True, reward_model=reward_model)
        for _ in range(num_bootstrap_samples)
    ]

    if not results:
        results = [
            evaluate_raw(
                df, action_prob_function, sample=False, reward_model=reward_model
            )
        ]

    logging_policy_rewards = [result["logging_policy"] for result in results]
    new_policy_rewards = [result["new_policy"] for result in results]

    return {
        "expected_reward_logging_policy": compute_list_stats(logging_policy_rewards),
        "expected_reward_new_policy": compute_list_stats(new_policy_rewards),
    }


def evaluate_raw(
    df: pd.DataFrame,
    action_prob_function: Callable,
    sample: bool,
    reward_model: Predictor,
) -> Dict[str, float]:

    tmp_df = df.sample(df.shape[0], replace=True) if sample else df

    context_df = tmp_df.context.apply(pd.Series)
    context_array = context_df[reward_model.context_column_order].values
    cum_reward_new_policy = 0

    for idx, row in tmp_df.iterrows():
        observation_expected_reward = 0
        processed_context = context_array[idx]

        # first compute the left hand term, which is the direct method
        action_probabilities = action_prob_function(row["context"])
        for action, action_probability in action_probabilities.items():
            one_hot_action = reward_model.action_preprocessor.transform(
                np.array(action).reshape(-1, 1)
            )
            observation = np.concatenate((processed_context, one_hot_action.squeeze()))
            predicted_reward = reward_model.predict(observation.reshape(1, -1))
            observation_expected_reward += action_probability * predicted_reward

        # then compute the right hand term, which is similar to IPS
        logged_action = row["action"]
        new_action_probability = action_probabilities[logged_action]
        weight = new_action_probability / row["action_prob"]
        one_hot_action = reward_model.action_preprocessor.transform(
            np.array(row["action"]).reshape(-1, 1)
        )
        observation = np.concatenate((processed_context, one_hot_action.squeeze()))
        predicted_reward = reward_model.predict(observation.reshape(1, -1))
        observation_expected_reward += weight * (row["reward"] - predicted_reward)

        cum_reward_new_policy += observation_expected_reward

    return {
        "logging_policy": tmp_df.reward.sum() / len(tmp_df),
        "new_policy": cum_reward_new_policy / len(tmp_df),
    }

Scenario

Assume we have a fraud model in production that blocks transactions if the P(fraud) > 0.05.

Let’s build some sample logs from that policy running in production. One thing to note, we need some basic exploration in the production logs (e.g. epsilon-greedy w/ε = 0.1). That is, 10% of the time we take a random action. Rewards represent revenue gained from allowing the transaction. A negative reward indicates the transaction was fraud and resulted in a chargeback.

logs_df = pd.DataFrame([
    {"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
    {"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
    {"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob": 0.90, "reward": 10}, 
    {"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 20},     
    {"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20}, # only allowed due to exploration 
    {"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob": 0.10, "reward": -10}, # only allowed due to exploration     
])

logs_df
context action action_prob reward
0 {'p_fraud': 0.08} blocked 0.9 0
1 {'p_fraud': 0.03} allowed 0.9 20
2 {'p_fraud': 0.02} allowed 0.9 10
3 {'p_fraud': 0.01} allowed 0.9 20
4 {'p_fraud': 0.09} allowed 0.1 -20
5 {'p_fraud': 0.4} allowed 0.1 -10

Now let’s use the doubly robust method to score a more lenient fraud model that blocks transactions only if the P(fraud) > 0.10.

The doubly robust method requires that we have a function that computes P(action | context)for all possible actions under our new policy. We can define that for our new policy easily here:

def action_probabilities(context):
    epsilon = 0.10
    if context["p_fraud"] > 0.10:
        return {"allowed": epsilon, "blocked": 1 - epsilon}    
    
    return {"allowed": 1 - epsilon, "blocked": epsilon}

We will use the same production logs above and run them through the new policy.

evaluate(logs_df, action_probabilities, num_bootstrap_samples=50)
{'expected_reward_logging_policy': {'ci_high': 10.58,
  'ci_low': -8.24,
  'mean': 1.17},
 'expected_reward_new_policy': {'ci_high': 52.14,
  'ci_low': -110.68,
  'mean': -29.27}}

The doubly robust method estimates that the expected reward per observation for the new policy is much worse than the logging policy so we wouldn’t roll out this new policy into an A/B test or production and instead should test some different policies offline.

However, the confidence intervals around the expected rewards for our old and new policies overlap heavily. If we want to be really certain, it’s might be best to gather some more data to ensure the difference is signal and not noise. In this case, fortunately, we have strong reason to suspect the new policy is worse, but these confidence intervals can be important in cases where we have less prior certainty.