A/B Testing ML Models in Production

February 27, 2026 · Applied · 15 min read

Why Offline Metrics Lie

Your new model beats the baseline by 2% on the test set. You deploy it Monday. Revenue drops 5% by Wednesday. What happened?

Offline metrics lied. The test set didn't capture the real distribution of Monday-morning users. The proxy metric (accuracy) didn't correlate with the business metric (revenue). And the model had a subtle regression on high-value mobile users that averaged out in aggregate test-set numbers.

If you've been through the ML evaluation process — cross-validation, precision/recall tradeoffs, statistical significance tests — you know that offline evaluation is necessary. But it's not sufficient. The gap between "performs well on held-out data" and "performs well on real users in production" is where A/B testing lives.

A/B testing for ML models follows the same core idea as web A/B tests: split traffic between a control (your current model) and a treatment (the new model), collect outcomes, and compare statistically. But ML A/B tests are trickier than testing button colors. Model predictions cascade through downstream systems, metrics are often delayed (fraud detection, ad conversions), and feature interactions can create unexpected correlations.

Let's build the statistical toolkit from the ground up. First, the hypothesis testing foundations that every A/B test relies on.

Statistical Foundations of A/B Testing

Every A/B test boils down to a question: "Is the observed difference between groups real, or could it be random noise?" We answer with two workhorses: the z-test for proportions (conversion rates, click-through rates) and Welch's t-test (revenue, latency, continuous metrics).

import numpy as np
from scipy import stats

# Simulate A/B test: Model A vs Model B serving real users
np.random.seed(42)
n_users = 5000  # users per group

# True conversion rates (unknown to us in practice)
# Model A (control): 3.1%  |  Model B (treatment): 3.3%
conv_a = np.random.binomial(1, 0.031, n_users)
conv_b = np.random.binomial(1, 0.033, n_users)

# ── Two-sample z-test for proportions ──
p_a, p_b = conv_a.mean(), conv_b.mean()
p_pool = (conv_a.sum() + conv_b.sum()) / (2 * n_users)
se = np.sqrt(p_pool * (1 - p_pool) * (2 / n_users))
z = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print("=== Proportion Test (Conversion Rate) ===")
print(f"Model A: {p_a:.4f}  |  Model B: {p_b:.4f}")
print(f"Lift: {p_b - p_a:+.4f} ({(p_b - p_a) / p_a:+.1%} relative)")
print(f"Z = {z:.3f}, p-value = {p_value:.4f}")
print(f"Significant at alpha=0.05? {'Yes' if p_value < 0.05 else 'No'}")

# ── Welch's t-test for continuous metric (revenue per user) ──
rev_a = np.where(conv_a, np.random.exponential(45, n_users), 0)
rev_b = np.where(conv_b, np.random.exponential(48, n_users), 0)

t_stat, t_pval = stats.ttest_ind(rev_a, rev_b, equal_var=False)
pooled_std = np.sqrt((rev_a.std()**2 + rev_b.std()**2) / 2)
cohens_d = (rev_b.mean() - rev_a.mean()) / pooled_std

se_diff = np.sqrt(rev_a.var() / n_users + rev_b.var() / n_users)
ci_lo = (rev_b.mean() - rev_a.mean()) - 1.96 * se_diff
ci_hi = (rev_b.mean() - rev_a.mean()) + 1.96 * se_diff

print("\n=== Welch's t-test (Revenue per User) ===")
print(f"Model A: ${rev_a.mean():.2f}  |  Model B: ${rev_b.mean():.2f}")
print(f"t = {t_stat:.3f}, p = {t_pval:.4f}, Cohen's d = {cohens_d:.3f}")
print(f"95% CI for difference: (${ci_lo:.2f}, ${ci_hi:.2f})")

With only 5,000 users per group and a tiny 0.2 percentage-point difference, this test almost certainly comes back "not significant." That's not a bug — it's the reality of small effects and limited data. The z-test catches whether the conversion rate differs; Welch's t-test handles the revenue comparison without assuming equal variances (which real-world data almost never has). Cohen's d tells you the effect size: under 0.2 is small, 0.2–0.5 is medium, above 0.5 is large.

The confidence interval is the real prize — it tells you the plausible range for the true difference. If the CI includes zero, you can't rule out that Model B is the same as (or worse than) Model A. The question then becomes: how many users do we actually need?

Sample Size — How Long Should You Run the Test?

The most common A/B testing mistake isn't a wrong statistical test — it's running the test for the wrong amount of time. Stop too early, and random fluctuations masquerade as real effects (false positives). Run too long, and you're wasting traffic on a model you could have shipped weeks ago.

The key concept is Minimum Detectable Effect (MDE): the smallest improvement that's worth detecting. If your baseline conversion rate is 5% and a 0.5 percentage-point lift would justify the engineering cost of the new model, your MDE is 0.5pp. A smaller lift isn't worth the traffic to measure. A larger lift is easy to detect with fewer users.

Sample size depends on four things: baseline rate, MDE, significance level (α, typically 0.05), and statistical power (1−β, typically 0.80). Here's the exact formula and a calculator:

import numpy as np
from scipy.stats import norm

def required_sample_size(p1, p2, alpha=0.05, power=0.80):
    """Minimum users per group to detect a difference p1 vs p2."""
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    num = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
    den = (p2 - p1) ** 2
    return int(np.ceil(num / den))

# Your scenario: baseline 3.1%, hoping for 3.3%
n = required_sample_size(0.031, 0.033)
print(f"To detect 3.1% -> 3.3%: {n:,} users/group ({2*n:,} total)")
print(f"At 1,000 users/day: {2*n / 1000:.0f} days to complete\n")

# Tradeoff table: how MDE affects required sample size
baseline = 0.05  # 5% baseline conversion rate
print(f"Baseline: {baseline:.0%} conversion rate")
print(f"{'MDE (abs)':>12} {'Relative':>10} {'N/group':>10} {'Total':>10}")
print("-" * 46)
for mde in [0.005, 0.01, 0.015, 0.02, 0.03, 0.05]:
    n = required_sample_size(baseline, baseline + mde)
    rel = mde / baseline
    print(f"{mde:>11.1%} {rel:>9.0%} {n:>10,} {2*n:>10,}")

# Output:
# MDE (abs)   Relative    N/group      Total
# ----------------------------------------------
#       0.5%       10%     31,231     62,462
#       1.0%       20%      8,155     16,310
#       1.5%       30%      3,778      7,556
#       2.0%       40%      2,210      4,420
#       3.0%       60%      1,057      2,114
#       5.0%      100%        432        864

The tradeoff is dramatic. Detecting a half-percentage-point lift requires 62,000+ total users. Detecting a five-point lift needs under 900. This is why choosing your MDE is the most important decision in experiment design — it determines whether your test runs for two days or two months.

Try the interactive explorer below to build intuition for how these four parameters interact. Pay attention to how the power curve flattens — there's a point of diminishing returns where adding more users barely increases your confidence.

Try It: Sample Size Explorer

Baseline rate: 5.0% MDE (absolute): 1.0 pp Significance (α): 0.05 Power (1−β): 80%

Sequential Testing — Don't Peek (Or Peek Correctly)

Here's a scenario every data scientist has lived through: you set up a beautiful A/B test, calculate the required sample size, and tell the product manager it'll take three weeks. On day four, the PM opens the dashboard, sees p=0.03, and declares victory. "Ship it!"

The problem: that p-value is a lie. When you peek at a test before it reaches full sample size and declare significance at the first crossing of 0.05, you inflate your false positive rate from 5% to roughly 30%. Every additional peek is another lottery ticket — eventually you'll see significance by chance even when there's no real effect.

The solution is sequential testing. Instead of a fixed significance threshold, you use boundaries that start very strict and relax as data accumulates. O'Brien-Fleming is the gold standard — it allocates your error budget (α) across pre-planned interim analyses using a spending function.

import numpy as np
from scipy.stats import norm

def obrien_fleming_bounds(n_looks, alpha=0.05):
    """O'Brien-Fleming spending function boundaries."""
    z_final = norm.ppf(1 - alpha / 2)
    info_fracs = np.linspace(1 / n_looks, 1.0, n_looks)
    boundaries = z_final / np.sqrt(info_fracs)
    return info_fracs, boundaries

# Show how boundaries change at each interim look
fracs, bounds = obrien_fleming_bounds(5)
print("O'Brien-Fleming boundaries (5 interim looks):")
print(f"{'Look':>6} {'Data%':>8} {'Z-bound':>10} {'p-thresh':>12}")
print("-" * 40)
for i, (f, b) in enumerate(zip(fracs, bounds)):
    p_thresh = 2 * (1 - norm.cdf(b))
    print(f"{i+1:>6} {f:>7.0%} {b:>10.3f} {p_thresh:>12.6f}")

# Simulate: naive peeking vs sequential testing on A/A tests
np.random.seed(123)
n_sims, n_total = 10000, 5000
false_pos_naive, false_pos_seq = 0, 0

for _ in range(n_sims):
    # A/A test: BOTH groups from the same distribution (no real effect)
    a = np.random.binomial(1, 0.05, n_total)
    b = np.random.binomial(1, 0.05, n_total)

    naive_hit = seq_hit = False
    for look in range(5):
        end = n_total * (look + 1) // 5
        p_a, p_b = a[:end].mean(), b[:end].mean()
        p_pool = (a[:end].sum() + b[:end].sum()) / (2 * end)
        if p_pool == 0 or p_pool == 1:
            continue
        se = np.sqrt(p_pool * (1 - p_pool) * 2 / end)
        z = abs(p_b - p_a) / se

        # Naive: standard alpha at every peek
        if not naive_hit and z > norm.ppf(0.975):
            false_pos_naive += 1
            naive_hit = True
        # Sequential: use O'Brien-Fleming boundary
        if not seq_hit and z > bounds[look]:
            false_pos_seq += 1
            seq_hit = True

print(f"\nA/A test false positive rate ({n_sims:,} simulations):")
print(f"  Naive peeking:   {false_pos_naive/n_sims:.1%}  (should be 5%!)")
print(f"  O'Brien-Fleming: {false_pos_seq/n_sims:.1%}  (correctly controlled)")

Look at that first interim boundary: at 20% of the data, you'd need a z-score above 4.39 to declare significance — that's a p-value below 0.00001. Essentially, O'Brien-Fleming says "you can peek early, but only an overwhelming signal should stop you." By the final look, the boundary relaxes close to the standard 1.96.

The simulation drives it home: peeking 5 times at a test with no real effect inflates your false positive rate to around 14% with naive peeking. O'Brien-Fleming keeps it close to the promised 5%. The cost is a slightly larger required sample size (roughly 3–5% more), which is almost always worth the peace of mind.

Rule of thumb: plan 3–5 interim looks, space them evenly across your planned sample. This gives stakeholders regular updates while maintaining statistical validity.

Multi-Armed Bandits — Beyond Simple A/B

Standard A/B testing has an uncomfortable property: it deliberately sends 50% of traffic to the worse model for the entire experiment. If Model B is clearly better after 500 requests, you still keep sending half your traffic to the inferior Model A for the remaining 9,500 requests. That's a lot of wasted conversions.

Multi-armed bandits solve this by adaptively shifting traffic toward the winner as evidence accumulates. Instead of a fixed split, you maintain a belief about each model's quality and use those beliefs to decide where to route each request. The most elegant variant is Thompson Sampling: maintain a Beta posterior for each model's conversion rate, sample from each posterior, and route traffic to whichever sample is higher.

If you've read the Bayesian inference or reinforcement learning posts, this connects two ideas: Beta posteriors from Bayesian updating and the exploration–exploitation tradeoff from bandits.

import numpy as np

np.random.seed(42)
n_requests = 10000
true_rates = {"Model A": 0.031, "Model B": 0.038}  # B is better

# ── Thompson Sampling ──
alpha_ts = {"Model A": 1, "Model B": 1}  # Beta(1,1) = uniform prior
beta_ts = {"Model A": 1, "Model B": 1}
ts_choices = []

for i in range(n_requests):
    # Sample from each model's posterior
    samples = {m: np.random.beta(alpha_ts[m], beta_ts[m]) for m in true_rates}
    choice = max(samples, key=samples.get)

    # Observe outcome
    reward = np.random.binomial(1, true_rates[choice])
    if reward:
        alpha_ts[choice] += 1
    else:
        beta_ts[choice] += 1
    ts_choices.append(choice)

# ── Fixed 50/50 split (standard A/B) ──
np.random.seed(42)
ab_choices = ["Model A" if i % 2 == 0 else "Model B" for i in range(n_requests)]

# Compare cumulative regret
best_rate = max(true_rates.values())
ts_regret = sum(best_rate - true_rates[c] for c in ts_choices)
ab_regret = sum(best_rate - true_rates[c] for c in ab_choices)

ts_b_pct = sum(1 for c in ts_choices if c == "Model B") / n_requests
print(f"After {n_requests:,} requests:")
print(f"  Thompson routed {ts_b_pct:.1%} traffic to Model B (the winner)")
print(f"  Cumulative regret -- A/B: {ab_regret:.1f} | Thompson: {ts_regret:.1f}")
print(f"  Regret reduction: {(1 - ts_regret / ab_regret):.0%}")

# When did Thompson figure it out? (>90% traffic to B)
window = 200
for i in range(window, n_requests):
    recent = ts_choices[i - window:i]
    if sum(1 for c in recent if c == "Model B") / window > 0.9:
        print(f"  Thompson routing >90% to B by request {i}")
        break

Thompson Sampling discovers the winner surprisingly fast. Within a few hundred requests, it figures out Model B is better and starts routing the vast majority of traffic there. The cumulative regret — the total "loss" from sending traffic to the inferior model — is dramatically lower than the fixed 50/50 split.

The tradeoff? Bandits provide weaker statistical guarantees. Because the traffic split is adaptive, calculating a clean p-value is harder. In practice, most teams use bandits for optimization (maximize reward during the experiment) and switch to fixed-horizon A/B tests when they need rigorous statistical proof for a launch decision. The demo below lets you see both approaches racing head-to-head.

Try It: Bandit vs A/B Race

Model A true rate: 3.0% Model B true rate: 5.0%

Press Start to begin the race

ML-Specific Testing Challenges

Web A/B tests are (relatively) simple: change a button color, measure clicks, done. ML model A/B tests have unique complications that can invalidate your results if you're not careful.

Metric lag. Some outcomes take time to materialize. Fraud detection models might not know if a prediction was correct until a chargeback arrives weeks later. Ad conversion attribution can take hours or days. If you end your test before lagged outcomes arrive, you'll undercount the treatment effect.

Segment effects. Aggregate metrics can hide that Model B is worse for a significant user segment. Maybe the new recommendation model improves clicks by 3% overall, but it's 8% worse for mobile users — and you only discover this after launch when the mobile team files a bug. The solution is segment-level analysis with Bonferroni correction to control for multiple comparisons.

Guardrail metrics. Your primary metric might improve while secondary metrics degrade. The new model catches more fraud but adds 40ms of latency. The improved search ranking gets more clicks but increases page load time. Guardrail metrics are red lines that must not be crossed, regardless of how good the primary metric looks.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Simulated segment-level A/B results
segments = {
    "Desktop":  {"n": 3000, "rate_a": 0.052, "rate_b": 0.058,
                 "lat_a": 120, "lat_b": 125},
    "Mobile":   {"n": 4000, "rate_a": 0.038, "rate_b": 0.035,
                 "lat_a": 200, "lat_b": 240},
    "Tablet":   {"n": 1000, "rate_a": 0.045, "rate_b": 0.049,
                 "lat_a": 180, "lat_b": 185},
}

n_segments = len(segments)
alpha_bonf = 0.05 / n_segments  # Bonferroni correction

print(f"Segment analysis (Bonferroni alpha = {alpha_bonf:.4f})")
print(f"{'Segment':>10} {'A':>8} {'B':>8} {'Lift':>8} {'p-val':>10} {'Sig':>5}")
print("-" * 53)

for seg, d in segments.items():
    n = d["n"]
    obs_a = np.random.binomial(n, d["rate_a"]) / n
    obs_b = np.random.binomial(n, d["rate_b"]) / n
    p_pool = (obs_a + obs_b) / 2
    se = np.sqrt(p_pool * (1 - p_pool) * 2 / n) if p_pool > 0 else 1
    z = (obs_b - obs_a) / se
    p_val = 2 * (1 - norm.cdf(abs(z)))
    sig = "Yes" if p_val < alpha_bonf else "No"
    print(f"{seg:>10} {obs_a:>7.1%} {obs_b:>7.1%} {obs_b-obs_a:>+7.1%} "
          f"{p_val:>10.4f} {sig:>5}")

# ── Guardrail check ──
print("\n=== Guardrail Check (max 10% latency increase) ===")
for seg, d in segments.items():
    inc = d["lat_b"] - d["lat_a"]
    pct = inc / d["lat_a"]
    status = "FAIL" if pct > 0.10 else "Pass"
    print(f"  {seg}: {d['lat_a']}ms -> {d['lat_b']}ms "
          f"({pct:+.0%}) [{status}]")

This is exactly the scenario that sinks production deployments. The aggregate numbers look fine — Model B improves conversion overall. But the segment analysis reveals that Mobile users are converting worse under Model B. And the guardrail check catches a 20% latency increase on mobile (200ms → 240ms), which crosses the 10% threshold.

Without segment analysis, you ship Model B and wonder why your mobile metrics tanked the following week. Bonferroni correction ensures you're not flagging spurious segment effects — with three segments, each comparison uses α/3 = 0.0167 instead of 0.05, which is stricter but prevents false alarms from multiple testing.

Putting It All Together — A Complete Testing Pipeline

Now let's combine everything into a reusable testing pipeline. The ABTestRunner class handles the full lifecycle: deterministic user assignment (so a user always sees the same model), outcome recording, sequential analysis at interim looks, and guardrail monitoring.

The key design decisions: user assignment uses MD5 hashing for consistency (the same user_id always maps to the same group), and the runner pre-computes O'Brien-Fleming boundaries at initialization so each analyze() call knows its stopping threshold.

import hashlib
import numpy as np
from scipy.stats import norm
from collections import defaultdict

class ABTestRunner:
    def __init__(self, alpha=0.05, n_looks=5):
        self.alpha = alpha
        self.n_looks = n_looks
        self.outcomes = defaultdict(list)
        self.guardrails = {}
        # Pre-compute O'Brien-Fleming boundaries
        fracs = np.linspace(1 / n_looks, 1.0, n_looks)
        z_final = norm.ppf(1 - alpha / 2)
        self.boundaries = z_final / np.sqrt(fracs)

    def assign(self, user_id):
        """Deterministic assignment: same user always gets same group."""
        h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        return "treatment" if h % 2 == 0 else "control"

    def record(self, user_id, metric, value):
        group = self.assign(user_id)
        self.outcomes[group].append({"metric": metric, "value": value})

    def add_guardrail(self, metric, max_regression):
        self.guardrails[metric] = max_regression

    def analyze(self, look_index, metric="primary"):
        ctrl = [o["value"] for o in self.outcomes["control"]
                if o["metric"] == metric]
        treat = [o["value"] for o in self.outcomes["treatment"]
                 if o["metric"] == metric]
        mean_c, mean_t = np.mean(ctrl), np.mean(treat)
        se = np.sqrt(np.var(ctrl)/len(ctrl) + np.var(treat)/len(treat))
        z = (mean_t - mean_c) / se if se > 0 else 0

        boundary = self.boundaries[min(look_index, self.n_looks - 1)]
        sig = abs(z) > boundary

        # Check guardrails
        guardrails_ok = True
        for gm, threshold in self.guardrails.items():
            gc = [o["value"] for o in self.outcomes["control"]
                  if o["metric"] == gm]
            gt = [o["value"] for o in self.outcomes["treatment"]
                  if o["metric"] == gm]
            if gc and gt and (np.mean(gt) - np.mean(gc)) > threshold:
                guardrails_ok = False

        return {"z": z, "boundary": boundary, "significant": sig,
                "lift": mean_t - mean_c, "guardrails_pass": guardrails_ok,
                "verdict": ("Ship it!" if sig and guardrails_ok
                            else "Blocked by guardrail" if sig
                            else "Keep testing")}

# ── Simulated fraud detection test ──
runner = ABTestRunner(alpha=0.05, n_looks=5)
runner.add_guardrail("latency", max_regression=10)

np.random.seed(42)
for i in range(5000):
    uid = f"user_{i}"
    group = runner.assign(uid)
    # Fraud v2 catches more fraud (+3pp) but is slower (+12ms avg)
    fraud_rate = 0.12 if group == "control" else 0.155
    base_lat = 50 if group == "control" else 62
    runner.record(uid, "fraud_caught", np.random.binomial(1, fraud_rate))
    runner.record(uid, "latency", np.random.exponential(base_lat))

result = runner.analyze(look_index=4, metric="fraud_caught")  # Final look
print(f"Z = {result['z']:.3f} (boundary: {result['boundary']:.3f})")
print(f"Significant: {result['significant']}")
print(f"Guardrails pass: {result['guardrails_pass']}")
print(f"Verdict: {result['verdict']}")

In this simulation, the new fraud model catches significantly more fraud (the z-statistic exceeds the boundary), but it adds enough latency to trip the guardrail. The verdict: "Blocked by guardrail." The model is effective but not ready to ship — the latency regression needs to be fixed first. That single guardrail check just saved you from a production incident where fraud detection improves but user experience degrades.

This pattern — sequential significance + guardrail gates — is how production ML teams at companies like Google, Netflix, and Uber actually run experiments. The class is simple enough to modify: add segment-level breakdowns from Section 5, swap in Thompson Sampling from Section 4, or connect it to your actual model serving infrastructure.

Conclusion

A/B testing ML models isn't just "regular A/B testing with models." The statistical foundations are the same, but the practice diverges in important ways: you need proper sample size planning (not just "run it for a week"), sequential testing for valid early stopping, awareness of segment-level effects that aggregate metrics hide, and guardrail metrics that prevent shipping improvements that break something else.

The key takeaways:

Don't trust offline metrics alone. A test set can't capture distribution shifts, metric lag, or segment effects.
Calculate your sample size upfront. The MDE you choose determines whether your test runs for days or months.
If you must peek, use sequential testing. O'Brien-Fleming boundaries let you check results early without inflating false positives.
Consider bandits for optimization. Thompson Sampling minimizes regret during the experiment, but use fixed A/B tests for launch decisions.
Always check segments and guardrails. Aggregate wins can hide segment losses, and primary metric gains can come at the cost of secondary metrics.

The full pipeline — sample size calculation, sequential testing, adaptive routing, segment analysis, and guardrail monitoring — is what separates a rigorous experiment from "we ran it for a week and the numbers looked good." Your users deserve better than crossed fingers and cached dashboard screenshots.

References & Further Reading

Kohavi, Tang, Xu — Trustworthy Online Controlled Experiments — The definitive book on A/B testing at scale (Cambridge Press, 2020)
Johari et al. — Always Valid Inference: Continuous Monitoring of A/B Tests — Sequential testing framework used at Optimizely and Spotify (2017)
Russo et al. — A Tutorial on Thompson Sampling — Comprehensive introduction to Thompson Sampling and its applications (2018)
Howard et al. — Time-uniform Confidence Sequences — Theoretical foundations for anytime-valid sequential inference (2021)
Tang et al. — Overlapping Experiment Infrastructure — How Google runs thousands of concurrent A/B tests (KDD 2010)