← Back to Blog

Differential Privacy from Scratch

Why "Just Remove Names" Fails

In 2006, Netflix released 100 million "anonymized" movie ratings from 480,000 users, offering a $1 million prize to anyone who could improve their recommendation algorithm by 10%. Names were stripped. User IDs were replaced with random numbers. Surely that was enough?

It wasn't. Arvind Narayanan and Vitaly Shmatikov at UT Austin showed that by cross-referencing the "anonymous" Netflix ratings with public IMDb reviews, they could re-identify individual users. If you rated a handful of obscure movies on both platforms, you were uniquely fingerprintable. Netflix eventually settled a class-action lawsuit and cancelled a planned sequel competition.

The same year, AOL released 20 million "anonymized" search queries for research. A New York Times reporter identified User #4417749 as 62-year-old Thelma Arnold of Lilburn, Georgia, by piecing together her searches for "landscapers in Lilburn, GA" and "homes sold in shadow lake subdivision." The dataset was pulled within days, but the damage was done.

These disasters exposed a fundamental problem: syntactic anonymization doesn't work. Techniques like k-anonymity (making every record look like at least k-1 others) and l-diversity (ensuring diversity in sensitive attributes) all fail against an attacker who brings outside knowledge. If an attacker knows you're in the dataset and has even a few auxiliary data points about you, removing your name isn't enough.

What we need is not a better anonymization trick, but a mathematical definition of privacy — one that holds regardless of what the attacker already knows. That definition is differential privacy.

But before we get to the formal definition, let's see an ancient technique that contains the seed of the right idea. Randomized response was invented by Stanley Warner in 1965 for surveying sensitive topics (drug use, tax evasion) where people might lie. The trick: inject randomness before collecting the answer, so no individual response can be trusted — but the aggregate statistics can still be recovered.

import numpy as np

def randomized_response(true_answer):
    """Classic randomized response (Warner 1965).
    Flip coin 1: heads (p=0.5) -> answer truthfully.
    Tails -> flip coin 2, report that coin instead."""
    if np.random.random() < 0.5:
        return true_answer       # truth
    else:
        return np.random.random() < 0.5  # random coin

# Simulate: 1000 people, 40% truly answer "Yes"
np.random.seed(42)
n = 1000
true_rate = 0.40
true_answers = np.random.random(n) < true_rate
noisy = [randomized_response(a) for a in true_answers]
observed_yes = sum(noisy) / n

# Recover the true rate from noisy responses:
# E[observed] = 0.5 * true_rate + 0.5 * 0.5
#             = 0.5 * true_rate + 0.25
estimated_rate = (observed_yes - 0.25) / 0.5

print(f"True rate:      {true_rate:.3f}")
print(f"Observed rate:  {observed_yes:.3f}")
print(f"Estimated rate: {estimated_rate:.3f}")
# Privacy: max ratio = P(Yes|true=Yes)/P(Yes|true=No)
#        = 0.75 / 0.25 = 3, so epsilon = ln(3) ~ 1.10

Each person has plausible deniability: even if they reported "Yes," there's a 25% chance that was just a random coin flip, not their true answer. Yet across 1000 people, the noise averages out and we recover a close estimate of the true rate. This is the core intuition behind differential privacy: calibrated noise protects individuals while preserving aggregate truth.

The Definition of Differential Privacy

The breakthrough came from Cynthia Dwork, who in 2006 formalized what "privacy" should actually mean for statistical databases. Her definition is elegant and surprisingly simple:

A randomized mechanism M is ε-differentially private if for all neighboring datasets D1, D2 (differing in exactly one record) and all possible outputs S:

P[M(D1) ∈ S] ≤ eε · P[M(D2) ∈ S]

In plain English: whether or not your data is in the dataset, the output looks almost the same. The parameter ε (epsilon) is the privacy budget — smaller ε means stronger privacy. At ε = 0, the mechanism reveals nothing about any individual (but is also useless). At ε = ln(3) ≈ 1.1, you get the randomized response guarantee from above. In practice, ε ≤ 1 is considered strong privacy, ε ≤ 10 is moderate, and ε > 100 is effectively non-private.

The definition has a beautiful consequence: it holds regardless of what the attacker already knows. Even if an adversary has the entire dataset except your record, the mechanism's output doesn't tell them much more about you. This is qualitatively different from anonymization, which crumbles against background knowledge.

There's also a relaxed version, (ε, δ)-differential privacy, which allows a small probability δ of catastrophic failure. The bound becomes:

P[M(D1) ∈ S] ≤ eε · P[M(D2) ∈ S] + δ

Typically δ is set to something tiny like 1/n2 (much less than 1/n, where n is the dataset size), so the "catastrophic leak" event is vanishingly unlikely. This relaxation turns out to be crucial for making Gaussian noise work and for training neural networks with DP.

Two more properties make differential privacy a joy to work with. Post-processing immunity: any computation you apply to a DP output remains DP — you can't make it less private by staring at it harder. And group privacy: if a mechanism is ε-DP for one record, it's kε-DP for a group of k records, so privacy degrades gracefully.

The Laplace and Gaussian Mechanisms

The definition tells us what privacy means, but how do we actually achieve it? The key insight is the sensitivity framework. If we know how much a single record can change a query's output, we know how much noise to add.

The global sensitivity of a function f is the maximum it can change when one record is added or removed:

Δf = max |f(D1) - f(D2)| over all neighboring D1, D2

For a counting query ("how many records satisfy X?"), the sensitivity is 1 — one person joining or leaving changes the count by at most 1. For a mean query on values in [a, b] with n records, the sensitivity is (b - a) / n — one person can shift the average by at most this amount.

The Laplace mechanism adds noise drawn from a Laplace distribution scaled to Δf / ε. The Laplace distribution has heavier tails than a Gaussian, which provides tight privacy bounds. It achieves pure ε-DP.

The Gaussian mechanism adds noise from N(0, σ2) where σ = Δf · √(2 ln(1.25/δ)) / ε. It achieves (ε, δ)-DP. The trade-off: Gaussian noise is analytically nicer (sums of Gaussians stay Gaussian, which matters for composition), but requires the relaxed definition.

import numpy as np

def laplace_mechanism(true_value, sensitivity, epsilon):
    """Add Laplace noise for pure epsilon-DP."""
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return true_value + noise

def gaussian_mechanism(true_value, sensitivity, epsilon, delta=1e-5):
    """Add Gaussian noise for (epsilon, delta)-DP."""
    sigma = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
    noise = np.random.normal(0, sigma)
    return true_value + noise

# Example: private mean salary query
np.random.seed(42)
salaries = np.random.lognormal(mean=11.0, sigma=0.4, size=100)
salaries = np.clip(salaries, 30000, 200000)
true_mean = np.mean(salaries)
sensitivity = (200000 - 30000) / 100  # max change from one record

print(f"True mean salary: ${true_mean:,.0f}\n")
for eps in [0.1, 1.0, 10.0]:
    lap = laplace_mechanism(true_mean, sensitivity, eps)
    gau = gaussian_mechanism(true_mean, sensitivity, eps)
    print(f"  epsilon={eps:>4}  Laplace: ${lap:,.0f}  Gaussian: ${gau:,.0f}")

At ε = 0.1 (strong privacy), the noise dwarfs the signal — estimates are off by tens of thousands of dollars. At ε = 10 (weak privacy), the noise is just a few hundred dollars. This is the fundamental privacy-utility tradeoff: more privacy costs more accuracy, always.

Composition Theorems — The Privacy Budget

Here's the catch that makes differential privacy genuinely hard in practice: every query you ask spends from your privacy budget, and budgets accumulate. Ask one ε-DP query, and you've spent ε. Ask two, and a smart attacker could combine the answers to learn more.

Basic (sequential) composition says the simplest thing: k queries at ε each cost kε total. This is tight — there exist mechanisms where the privacy loss really does grow linearly.

But in 2010, Cynthia Dwork, Guy Rothblum, and Salil Vadhan proved something better. Advanced composition shows that for k adaptive queries, each at ε-DP, the total privacy cost is approximately ε · √(2k · ln(1/δ)), which grows as √k instead of k. For a thousand queries, that's roughly a 7x improvement over linear scaling — and the savings grow with more queries.

There's another powerful trick: subsampling amplification. If each query only touches a random fraction q of the data, the effective privacy cost drops to roughly O(q · ε). This is why minibatch SGD is naturally more private than full-batch gradient descent — each step only uses a random subset of examples.

import numpy as np

def basic_composition(epsilon, k):
    """Total privacy cost under basic composition: linear."""
    return k * epsilon

def advanced_composition(epsilon, k, delta=1e-5):
    """Tighter bound: grows as sqrt(k) instead of k.
    From Dwork, Rothblum, Vadhan (2010)."""
    return epsilon * np.sqrt(2 * k * np.log(1.0 / delta))

epsilon_per_query = 0.1

# Compare at k = 100 and k = 1000 queries
for k in [100, 1000]:
    basic = basic_composition(epsilon_per_query, k)
    advanced = advanced_composition(epsilon_per_query, k)
    print(f"After {k} queries at epsilon={epsilon_per_query} each:")
    print(f"  Basic composition:    epsilon_total = {basic:.2f}")
    print(f"  Advanced composition: epsilon_total = {advanced:.2f}")
    print(f"  Improvement: {basic / advanced:.1f}x tighter\n")

# Subsampling amplification
q = 0.01  # use 1% of data per query
effective_eps = 2 * q * epsilon_per_query
print(f"With {q:.0%} subsampling:")
print(f"  Effective epsilon per query: {effective_eps:.4f}")
print(f"  100 queries (basic): {100 * effective_eps:.3f} total")

The composition story gets even tighter with modern tools like Rényi Differential Privacy (Mironov, 2017), which tracks privacy loss via Rényi divergences, and the moments accountant, which gives the tightest known bounds for iterative algorithms like SGD. These are the tools that make training billion-parameter models with DP feasible.

Try It: Privacy Budget Explorer

Issue queries against a synthetic salary database of 100 people. Each query costs ε = 0.5 from your total budget of ε = 5.0. Watch the budget deplete and the noise grow relative to each answer.

Budget:
ε: 0.00 / 5.00
No queries yet. Press Run Query to start.

DP-SGD — Private Deep Learning

In 2016, Martin Abadi and colleagues at Google asked the big question: can we train a deep neural network while guaranteeing differential privacy for the training data? Their answer was DP-SGD (Differentially Private Stochastic Gradient Descent), which modifies standard SGD in three ways:

  1. Per-example gradient clipping: Compute the gradient for each training example individually, then clip each gradient vector to have L2 norm at most C. This bounds the influence any single example can have on the update — it's the "sensitivity" of one gradient step.
  2. Noise addition: After averaging the clipped gradients, add Gaussian noise calibrated to the clipping threshold: N(0, σ2C2I). The noise masks any individual's contribution.
  3. Privacy accounting: Track the cumulative privacy loss across all training steps using a moments accountant, which gives much tighter bounds than basic composition.

Why clipping? Without it, a single outlier could produce an enormous gradient that dominates the entire batch update. Clipping ensures bounded sensitivity, which is the prerequisite for adding calibrated noise.

import numpy as np

def dp_sgd_step(X_batch, y_batch, weights, clip_norm, noise_sigma, lr):
    """One step of DP-SGD with per-example clipping and noise."""
    clipped_grads = []
    for xi, yi in zip(X_batch, y_batch):
        # Per-example logistic regression gradient
        pred = 1.0 / (1.0 + np.exp(-xi @ weights))
        grad = (pred - yi) * xi

        # Clip to bound sensitivity
        grad_norm = np.linalg.norm(grad)
        if grad_norm > clip_norm:
            grad = grad * (clip_norm / grad_norm)
        clipped_grads.append(grad)

    # Average clipped gradients + calibrated Gaussian noise
    avg_grad = np.mean(clipped_grads, axis=0)
    noise = np.random.normal(0, noise_sigma * clip_norm / len(X_batch),
                             size=weights.shape)
    return weights - lr * (avg_grad + noise)

# Synthetic 2D classification data
np.random.seed(42)
X = np.vstack([np.random.randn(50, 2) + [2, 2],
               np.random.randn(50, 2) + [-2, -2]])
y = np.array([1]*50 + [0]*50, dtype=float)

# Train both models
w_sgd = np.zeros(2)
w_dp = np.zeros(2)
for epoch in range(50):
    idx = np.random.permutation(100)
    for i in range(0, 100, 20):
        batch = idx[i:i+20]
        # Standard SGD
        preds = 1.0 / (1.0 + np.exp(-X[batch] @ w_sgd))
        grad = X[batch].T @ (preds - y[batch]) / 20
        w_sgd -= 0.1 * grad
        # DP-SGD (clip=1.0, sigma=1.0)
        w_dp = dp_sgd_step(X[batch], y[batch], w_dp,
                           clip_norm=1.0, noise_sigma=1.0, lr=0.1)

acc_sgd = np.mean((1/(1+np.exp(-X @ w_sgd)) > 0.5) == y)
acc_dp = np.mean((1/(1+np.exp(-X @ w_dp)) > 0.5) == y)
print(f"Standard SGD accuracy: {acc_sgd:.1%}")
print(f"DP-SGD accuracy:       {acc_dp:.1%}")

In practice, DP-SGD requires tuning. Larger batch sizes improve the privacy-utility tradeoff (each example's contribution is diluted). Pre-training on public data and then fine-tuning with DP-SGD reduces the number of private training steps needed. And some architectural choices matter: group normalization works better than batch normalization under DP, since batch norm leaks information across examples.

Try It: DP-SGD Training Visualizer

Compare standard SGD (left) vs DP-SGD (right) training a logistic classifier on two clusters. DP-SGD clips per-example gradients and adds noise — watch the decision boundary wobble.

Standard SGD
DP-SGD (ε = 3.0)
Press Train to start. Adjust sliders to change privacy parameters.

The Exponential Mechanism — Private Selection

The Laplace and Gaussian mechanisms work for numerical outputs. But what if you need to select something — the best model from a set of candidates, or the most common category in a dataset? You can't just add noise to a category label.

The exponential mechanism (McSherry & Talwar, 2007) solves this elegantly. Given a set of candidates, each with a quality score, it samples a candidate with probability proportional to exp(ε · score / (2Δ)), where Δ is the sensitivity of the scoring function. Higher-quality candidates are exponentially more likely to be chosen, but the randomness provides DP.

This is equivalent to "report noisy max" — add Gumbel noise to each candidate's score and return the argmax. Applications include private hyperparameter tuning, private feature selection, and private model selection.

import numpy as np

def exponential_mechanism(scores, epsilon, sensitivity):
    """Sample a candidate proportional to exp(eps * score / (2*sensitivity)).
    Returns the index of the selected candidate."""
    log_probs = (epsilon * np.array(scores)) / (2 * sensitivity)
    log_probs -= np.max(log_probs)  # numerical stability
    probs = np.exp(log_probs)
    probs /= probs.sum()
    return np.random.choice(len(scores), p=probs)

# Private mode: find the most common color without revealing individuals
np.random.seed(42)
candidates = ["red", "blue", "green", "yellow"]
true_counts = [5, 4, 3, 2]  # 14 people surveyed

# Run 1000 trials to see the selection distribution
results = np.zeros(4)
for _ in range(1000):
    idx = exponential_mechanism(true_counts, epsilon=1.0, sensitivity=1)
    results[idx] += 1

print("True counts:", dict(zip(candidates, true_counts)))
print(f"\nSelection frequency (epsilon=1.0, 1000 trials):")
for i, c in enumerate(candidates):
    print(f"  {c:>6}: {results[i]/10:.1f}%")

At ε = 1.0, "red" (the true mode with 5 votes out of 14) is selected around 45% of the time, while "yellow" (2 votes) is chosen about 10% of the time. The mechanism finds the right answer most often while providing plausible deniability. Increase ε to 10.0 and the mechanism almost always returns "red." Decrease it to 0.1 and the selection becomes nearly uniform — maximum privacy, minimum utility.

Differential Privacy in Practice

Differential privacy has moved from theory papers to production systems at massive scale:

For your own projects, several open-source libraries implement DP correctly: Opacus (PyTorch DP-SGD), TensorFlow Privacy, Google's DP library (C++/Java/Go), and OpenDP (a community-driven framework). These handle the tricky parts — privacy accounting, noise calibration, composition tracking — so you don't have to roll your own.

The honest tension: DP with truly small ε (strong privacy) often significantly degrades model quality. The field is actively working on tighter composition bounds, better pre-training strategies, and architectural innovations to close this gap. Differential privacy isn't free, but it's the only framework with provable guarantees — and that matters when you're handling people's data.

References & Further Reading

See also: Information Theory from Scratch (KL divergence connects directly to ε), Bayesian Inference from Scratch (posterior attacks that DP defends against), Implicit Bias of Gradient Descent and Optimizers from Scratch (the SGD mechanics that DP-SGD modifies), Regularization from Scratch (clipping as implicit regularization), Loss Functions from Scratch (quality scores in the exponential mechanism), and Monte Carlo Methods from Scratch (noise sampling foundations).