Bayesian AB Testing with PyMC

Author

Matthew Harris

Published

November 2, 2024

Abstract
Using PyMC to analyze A/B test results.

Intro

Many organizations use A/B tests to determine which design and product changes yield the best performance. But how can you tell if the effect seen in your A/B test are significant? And, more importantly, do your stakeholders even care about identifying statistically significant effects? These are difficult questions, but the goal is to provide answers that stakeholders can easily digest, reason, and act on.

One extremely powerful approach is Bayesian inference. Bayesian models allow us to tackle complex processes and provide intuitive outputs. At the heart of the Bayesian framework is the idea of using our prior knowledge and observed data to update our beliefs on uncertainty.

My posts on Bayesian inference will mostly focus on the practical aspects of how this framework can be used to answer real world questions. Below are links to some incredible free resources that provide more detailed information on the subject:

Background

You work for a pet focused E-commerce site called WagMart. You primarily support the mobile app development team. One of the key metrics for the app is average add-to-cart actions per session. Product changes can impact this metric by either effecting the add-to-cart rate per session (0% to 100%) and/or the average add-to-cart actions per conversion (1 to Inf).

Package Imports

Code
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import random
import statsmodels.api as sm
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px

Synthetic Data

Normally at this point you would query your data to begin your analysis. Since I don’t have any data for this example I’ll need to create some. The specifics of the code below aren’t important for this post, but we will revisit this process of generating synthetic data in the future. Even though I’m creating my own test data, in subsequent sections we will treat this data as “real” and assume that it is generated solely from the actions of users who user the app.

Code
# atc_rate = add-to-cart rate
# avg_apc = average add-to-cart actions per conversion

seed = 1119
sample_size_per_variant = 300_000
ctrl_p_atc_rate = 0.07
ctrl_psi_avg_apc = 0.80
ctrl_mu_avg_apc = 2.5
chal_p_atc_rate_prc_delta = 0.07
chal_psi_avg_apc_prc_delta = 0.05
chal_mu_avp_apc_prc_delta = -0.05

def make_synthetic_dist(unique_conv_rate, psi, mu, sample_size, seed):
    num_convs = int(np.floor(unique_conv_rate * sample_size))
    num_zeros = sample_size - num_convs
    dist = pm.draw(
        pm.ZeroInflatedPoisson.dist(psi=psi, mu=mu), draws=num_convs, 
        random_seed=seed
    )
    dist = [c + 1 for c in dist]
    dist.extend([0] * num_zeros)
    random.seed(seed)
    random.shuffle(dist)

    return dist


ctrl_total_atc_dist = make_synthetic_dist(
    unique_conv_rate=ctrl_p_atc_rate,
    psi=ctrl_psi_avg_apc,
    mu=ctrl_mu_avg_apc,
    sample_size=sample_size_per_variant,
    seed=seed,
)

chal_total_atc_dist = make_synthetic_dist(
    unique_conv_rate=ctrl_p_atc_rate * (1 + chal_p_atc_rate_prc_delta),
    psi=ctrl_psi_avg_apc * (1 + chal_psi_avg_apc_prc_delta),
    mu=ctrl_mu_avg_apc * (1 + chal_mu_avp_apc_prc_delta),
    sample_size=sample_size_per_variant,
    seed=seed,
)

pop_data_dict = {
    "variant": (["control"] * sample_size_per_variant)
    + (["challenger"] * sample_size_per_variant),
    "total_atc": ctrl_total_atc_dist + chal_total_atc_dist,
}

pop_data_df = pd.DataFrame(pop_data_dict)

sample_size = 30_000 # 12_000
pre_df = pop_data_df[pop_data_df["variant"] == "control"].sample(
    n=35_000, random_state=seed
).reset_index(drop=True)
ctrl_sample = pop_data_df[pop_data_df["variant"] == "control"].sample(
    n=sample_size, random_state=seed
)
chal_sample = pop_data_df[pop_data_df["variant"] == "challenger"].sample(
    n=sample_size, random_state=seed
)

obs_data_df = pd.concat([ctrl_sample, chal_sample]).reset_index(drop=True)
obs_sample_size_str = f'{sample_size:,}'
Code
test_results = (
  obs_data_df
  .groupby("variant")["total_atc"]
  .mean()
  )

prc_diff = f"{(test_results[0] / test_results[1]) - 1:.1%}"

The Problem

The product team has a series of changes that they are looking to implement. They are currently running an A/B test on the first phase of these changes.

The team is more interested in moving forward with the next phase than having assurances of a higher performing experience. You have some additional conversations and determine that they are comfortable with at most a -2.0% drop in current performance. Any loss of performance beyond -2.0% would be unacceptable for the challenger, and would require them to end the test and design a new experience.

It’s now two weeks since the test started. The challenger experience is seeing a 5.7% lift in average add-to-cart actions per session, with 30,000 sessions per variant. The stakeholders are understandably excited with the performance as they believe a lift in the point estimate should be enough to say that the challenger variant isn’t worse than control. They want to launch this new experience as soon as possible but want your input before moving forward. Their main question is deceptively simple:

Is it “safe” for us to launch our challenger experience given the data that we have?

Frequentist Approach

Prospective Power Analysis

Code
pre_mean = pre_df["total_atc"].mean()
pre_std = pre_df["total_atc"].std()
exp_prc_change = 0.10

effect_size = ((pre_mean * (1 + exp_prc_change)) - pre_mean) / pre_std

req_sample_size = sm.stats.tt_ind_solve_power(
  effect_size=effect_size,
  power=0.8,
  alpha=0.05,
  alternative="two-sided",
  ratio=1.0,
  nobs1=None
)

req_sample_size_str = f'{round(int(req_sample_size), -3):,}'
exp_prc_change_str = f'{exp_prc_change:.0%}'

In this situation you’ve decided to analyze the test results using a t-test. You ran a power analysis before the start of the test and determined that you would need at least 29,000 session per variant to detect a 10% change in average add-to-cart actions per session.

We have met our sample size requirement as determined by the power analysis but our observed lift is lower than our expected lift. Let’s take a look at the p-value of our observed data.

Code
_, obs_p_value, _ = sm.stats.ttest_ind(
  x1=obs_data_df.loc[obs_data_df["variant"] == "challenger", "total_atc"],
  x2=obs_data_df.loc[obs_data_df["variant"] == "control", "total_atc"],
  alternative="two-sided",
  usevar="unequal"
)

print(f'p-value = {round(obs_p_value, 3)}')
p-value = 0.101

Analyzing p-values

The p-value only allows us to speak about the probability of the observed data given that there is no difference in the performance between the control and challenger experiences. A p-value is extremely easy calculate but also just as easy to misinterpret.

In our situation, the p-value > 0.05 which means we are unable to reject the null hypothesis. We could wait for more data and perform more significance tests but we would need to make adjustments to our alpha to account for the additional “peeks”. Additional significance tests won’t answer the real question: What is the probability that our challenger experience is more than 2% worse than our control experience?

This is where we turn to the Bayesian framework to better understand and quantify our uncertainty.

Bayesian Inference

Using a Bayesian approach we can shift from speaking about the probability of observing our data to speaking about the probability of our beliefs. I’ll save the specifics for how my PyMC model is created for a future post and focus more what inference we can draw from it.

Data Prep

Code
def format_data(df, variant):
    data_dict = {}
    convs = df.loc[
        (df["variant"] == variant) & (df["total_atc"] > 0), "total_atc"
    ].values
    data_dict["unqiue_trials"] = len(df.loc[df["variant"] == variant, "total_atc"])
    data_dict["unqiue_convs"] = len(convs)
    data_dict["total_atc"] = convs - 1
    return data_dict


obs_data_dict_ctrl = format_data(obs_data_df, "control")
obs_data_dict_chal = format_data(obs_data_df, "challenger")

Model Creation

Code
with pm.Model(coords={"variant": ["control", "challenger"]}) as model:
    data_unique_trials_ctrl = pm.MutableData(
        "data_unique_trials_ctrl", obs_data_dict_ctrl["unqiue_trials"]
    )
    data_unique_convs_ctrl = pm.MutableData(
        "data_unique_convs_ctrl", obs_data_dict_ctrl["unqiue_convs"]
    )
    data_total_atc_ctrl = pm.MutableData(
        "data_total_atc_ctrl", obs_data_dict_ctrl["total_atc"]
    )

    data_unique_trials_chal = pm.MutableData(
        "data_unique_trials_chal", obs_data_dict_chal["unqiue_trials"]
    )
    data_unique_convs_chal = pm.MutableData(
        "data_unique_convs_chal", obs_data_dict_chal["unqiue_convs"]
    )
    data_total_atc_chal = pm.MutableData(
        "data_total_atc_chal", obs_data_dict_chal["total_atc"]
    )

    p_atc_rate = pm.Beta("p_atc_rate", alpha=1, beta=1, dims="variant")
    psi_avg_apc = pm.Beta("psi_avg_apc", alpha=1, beta=1, dims="variant")
    mu_avg_apc = pm.Uniform("mu_avg_apc", lower=0, upper=20, dims="variant")

    avg_atca_per_session = pm.Deterministic(
        "avg_atca_per_session",
        p_atc_rate * (psi_avg_apc * mu_avg_apc + 1),
        dims="variant",
    )

    prc_delta_avg_atca_per_session = pm.Deterministic(
        "prc_delta_avg_atca_per_session",
        (avg_atca_per_session[1] / avg_atca_per_session[0]) - 1,
    )

    obs_atc_rate_ctrl = pm.Binomial(
        "obs_atc_rate_ctrl",
        p=p_atc_rate[0],
        n=data_unique_trials_ctrl,
        observed=data_unique_convs_ctrl,
    )
    obs_avg_apc_ctrl = pm.ZeroInflatedPoisson(
        "obs_avg_apc_ctrl",
        psi=psi_avg_apc[0],
        mu=mu_avg_apc[0],
        observed=data_total_atc_ctrl,
    )

    obs_atc_rate_chal = pm.Binomial(
        "obs_atc_rate_chal",
        p=p_atc_rate[1],
        n=data_unique_trials_chal,
        observed=data_unique_convs_chal,
    )
    obs_avg_apc_chal = pm.ZeroInflatedPoisson(
        "obs_avg_apc_chal",
        psi=psi_avg_apc[1],
        mu=mu_avg_apc[1],
        observed=data_total_atc_chal,
    )

Posterior Distribution Sampling

Code
with model:
  idata = pm.sample(chains=4, random_seed = seed)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p_atc_rate, psi_avg_apc, mu_avg_apc]
100.00% [8000/8000 00:05<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.

What is the posterior distribution?

The posterior distribution represents our updated beliefs given the data we have observed. In order to accurately analyze relative differences between our control and challenger variants the posterior distribution has to looks at outcomes from the same chain/draw combination. Luckily PyMC handles this multi-dimensional array math for us within the model.

The priors (beliefs) used in our model assume that there isn’t a difference in the average ATC actions per session between our control and challenger variants. Collecting data allows us to update our priors. Armed with our updated beliefs on the relative performance we can make informed decisions.

The entire posterior distribution is the estimate of the Bayesian inference. It provides a wealth of information but stakeholders will likely be confused if they’re only shown the distribution. We’ll need an intuitive way to guide them through the results. Even if our findings are actionable, they are ultimately meaningless if not communicated well.

Communicating Findings

The plot below provides us with everything we need to answer our original question.

The blue curve is the posterior distribution of the percent difference of our challenger from our control. The red dashed line is our under performance threshold of -2%. The red area to the left of the dashed line captures the probability of our challenger being worse than our threshold. This allows us to say that there is a 1.4% probability that our challenger is more than -2% worse than our control, given our data.

You can inform the stakeholders that the challenger is “safe” to launch as it has a low probability of being worse than your current experience.

Code
rope_low = -0.02

posterior_stacked = az.extract(idata.posterior)
prc_delta_posterior_dist = posterior_stacked["prc_delta_avg_atca_per_session"].values

n_samples = len(prc_delta_posterior_dist)
n_samples_worse = len([s for s in prc_delta_posterior_dist if s < rope_low])

prob_worse = n_samples_worse / n_samples

hist_data = [prc_delta_posterior_dist]
group_labels = ["X"]

fig = ff.create_distplot(
    hist_data, group_labels, show_rug=False, show_hist=False, colors=["#3c62c2"]
)

x_low = [x for x in fig.data[0].x if x < rope_low]
y_low = fig.data[0].y[:len(x_low)]

fig.add_scatter(x=x_low, y=y_low, fill="tozeroy", mode="none", fillcolor="#ebbdb9")
fig.add_vline(
    x=rope_low,
    line_dash="dot",
    line_color="#c9483c",
    line_width=4,
    opacity=0.8,
)

fig.add_vline(
    x=np.mean(prc_delta_posterior_dist),
    line_dash="dot",
    line_color="black",
    line_width=4,
    opacity=0.3,
)
fig.add_annotation(
    x=np.mean(prc_delta_posterior_dist),
    y=9.0,
    font_size=12,
    text=format(f"Mean % Diff<br><b>{np.mean(prc_delta_posterior_dist):.1%}</b>"),
    showarrow=False,
)
fig.add_annotation(
    x=rope_low + ((min(prc_delta_posterior_dist) - rope_low) / 2),
    y=2.0,
    font_size=12,
    text=format(
        f"Probability of being<br>worse than control<br><b>{prob_worse:.1%}</b>"
    ),
    showarrow=False,
)
fig.update_layout(
    template="plotly_white",
    showlegend=False,
    title=dict(
        text="% Diff<br>Avg. Add-to-Cart Actions Per Session",
        x=0.5,
        font=dict(size=15, weight="bold"),
    ),
    xaxis=dict(tickformat=",.0%", nticks=20),
    yaxis=dict(showticklabels=False, showgrid=False, ticks=""),
    width=700,
    height=400
)

fig.show()
−6%−4%−2%0%2%4%6%8%10%12%14%16%
% DiffAvg. Add-to-Cart Actions Per SessionMean % Diff5.8%Probability of beingworse than control1.4%

Outro

That was a lot. If you made it to the end, thanks for sticking with me.

In this example we were able to provide actionable and intuitive business recommendations. There is nothing wrong with using Frequentist methods but a Bayesian approach was better suited for our question.

In future posts I’ll dig more into how to create and test a model. I’ll also examine a case where stakeholders want to look at the probability of existence and significance of an effect.