Conformal Prediction from Scratch: Distribution-Free Uncertainty with Guaranteed Coverage
The Promise of Guaranteed Uncertainty
What if your model could say "the answer is one of {cat, dog}" and you could mathematically prove that the true label is in that set at least 90% of the time — regardless of the model architecture, training procedure, or data distribution?
In our uncertainty quantification post, we built MC Dropout, Deep Ensembles, and temperature scaling — all useful tools for estimating how much to trust a model's predictions. But they're fundamentally heuristic: an MC Dropout variance of 0.3 tells you the model is somewhat uncertain, but it doesn't come with a mathematical guarantee about what that means for prediction correctness.
Conformal prediction flips the question entirely. Instead of asking "how confident is my model?", it asks: "what set of predictions is guaranteed to contain the true answer?"
The answer relies on a single assumption — exchangeability — which is weaker than the i.i.d. assumption most of machine learning already makes. If calibration data and test data are exchangeable (their joint distribution doesn't change under permutation), then conformal prediction produces prediction sets with exact finite-sample coverage guarantees. Not asymptotic. Not approximate. Exact.
Let's build it from scratch.
Nonconformity Scores — Measuring How Strange a Prediction Is
The core building block of conformal prediction is the nonconformity score: a number that measures how "surprising" a particular label would be for a given input, according to your model.
For classification, the most natural score is:
s(x, y) = 1 - f̂(x)_y
That's one minus the model's predicted probability for the true class. If the model assigns 95% probability to the correct label, the score is 0.05 (low — not surprising). If it assigns 10%, the score is 0.90 (high — very surprising).
For regression, the standard score is the absolute residual:
s(x, y) = |y - f̂(x)|
Here's the crucial insight: the choice of score function is entirely up to you. Any function that maps (input, label) to a real number works. Better scores produce smaller prediction sets, but the coverage guarantee holds regardless of score quality. You could use a random number generator as your score and still get valid coverage — you'd just get enormous, useless prediction sets.
Let's compute nonconformity scores on calibration data and visualize what they look like:
import numpy as np
# Simulate a 5-class classifier's softmax outputs on 500 calibration points
np.random.seed(42)
n_cal = 500
n_classes = 5
true_labels = np.random.randint(0, n_classes, n_cal)
# Generate fake softmax outputs (model is decent but not perfect)
logits = np.random.randn(n_cal, n_classes)
logits[np.arange(n_cal), true_labels] += 2.0 # boost true class
softmax = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
# Nonconformity scores: 1 - P(true class)
scores = 1 - softmax[np.arange(n_cal), true_labels]
print(f"Score statistics:")
print(f" Mean: {scores.mean():.3f}")
print(f" Median: {np.median(scores):.3f}")
print(f" 90th percentile: {np.quantile(scores, 0.9):.3f}")
print(f" 95th percentile: {np.quantile(scores, 0.95):.3f}")
# The quantile threshold for alpha=0.1 (90% coverage)
alpha = 0.1
q_hat = np.quantile(scores, np.ceil((1 - alpha) * (n_cal + 1)) / n_cal,
method="higher")
print(f"\nConformal quantile (alpha={alpha}): {q_hat:.3f}")
print(f" Include label y if 1 - f(x)_y <= {q_hat:.3f}")
print(f" Equivalently: include if f(x)_y >= {1 - q_hat:.3f}")
The score distribution tells us how well-calibrated our model is. A peak near zero means the model is usually confident and correct. A heavy right tail means there are cases where the model badly misses. The quantile threshold q̂ is where conformal prediction draws its line: include any label whose score falls below this threshold.
Split Conformal Prediction — The Core Algorithm
Here's the full algorithm in four steps:
- Split your data into a training set and a calibration set
- Train your model on the training set (any model, any procedure)
- Score the calibration set: compute si = s(xi, yi) for each calibration point
- Predict: for a new test point x, include label y in the prediction set C(x) if s(x, y) ≤ q̂, where q̂ is the ⌈(1-α)(n+1)⌉/n quantile of the calibration scores
That "+1" in the quantile formula is easy to miss but mathematically critical — it accounts for the test point itself joining the pool of exchangeable scores.
The One-Line Proof
Why does this work? The proof is almost comically short:
By exchangeability, the rank of the test score sn+1 among all n+1 scores {s1, ..., sn, sn+1} is uniformly distributed over {1, 2, ..., n+1}. Therefore P(sn+1 ≤ q̂) ≥ 1-α.
That's it. No distributional assumptions. No model assumptions. No asymptotic arguments. The test point's score has no special position among the n+1 exchangeable scores, so it's equally likely to land at any rank. The quantile threshold ensures at least a (1-α) fraction of ranks fall below it.
This is a finite-sample guarantee: it works for n=50 just as well as n=50,000. The only price you pay is "wasting" calibration data that could have been used for training.
Let's implement it:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# Generate synthetic 2D classification data (3 classes)
np.random.seed(42)
centers = np.array([[0, 2], [-2, -1], [2, -1]])
X, y = [], []
for c in range(3):
pts = centers[c] + np.random.randn(300, 2) * 0.8
X.append(pts)
y.append(np.full(300, c))
X, y = np.vstack(X), np.concatenate(y)
# Split: 60% train, 20% calibration, 20% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_cal, X_test, y_cal, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Train any classifier
clf = MLPClassifier(hidden_layer_sizes=(32, 16), max_iter=500, random_state=0)
clf.fit(X_train, y_train)
# Step 1: Compute calibration scores
cal_probs = clf.predict_proba(X_cal)
cal_scores = 1 - cal_probs[np.arange(len(y_cal)), y_cal]
# Step 2: Find conformal quantile
alpha = 0.10
n = len(cal_scores)
q_hat = np.quantile(cal_scores, np.ceil((1 - alpha) * (n + 1)) / n,
method="higher")
# Step 3: Build prediction sets for test data
test_probs = clf.predict_proba(X_test)
prediction_sets = []
for i in range(len(X_test)):
pset = [c for c in range(3) if 1 - test_probs[i, c] <= q_hat]
prediction_sets.append(pset)
# Verify coverage
covered = sum(y_test[i] in prediction_sets[i] for i in range(len(y_test)))
print(f"Target coverage: {1-alpha:.0%}")
print(f"Empirical coverage: {covered/len(y_test):.1%}")
print(f"Average set size: {np.mean([len(s) for s in prediction_sets]):.2f}")
print(f"Singleton sets: {sum(len(s)==1 for s in prediction_sets)}/{len(y_test)}")
Run this and you'll see something remarkable: the empirical coverage will be at or above 90%, every single time. Change the model, change the data distribution, change the random seed — the guarantee holds. That's the power of conformal prediction.
Adaptive Prediction Sets — Smarter Sets for Harder Examples
The basic conformal method above has a limitation: it uses the same threshold for every test point. A clearly recognizable digit "7" gets the same threshold as an ambiguous scrawl that could be a "3", "5", or "8". This produces uniform-sized prediction sets, which is wasteful.
Adaptive Prediction Sets (APS), introduced by Romano, Sesia, and Candès in 2020, fix this with a clever score function:
- Sort classes by decreasing predicted probability: π1, π2, ...
- The score for the true class y = πk is the cumulative probability up to position k: s(x, y) = Σj=1..k f̂(x)πj
For prediction, you include classes in decreasing probability order until the cumulative probability exceeds the conformal quantile. Easy examples where the model is very confident get singleton sets. Ambiguous examples where probability is spread across multiple classes get larger sets.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# Same data setup as before
np.random.seed(42)
centers = np.array([[0, 2], [-2, -1], [2, -1]])
X = np.vstack([c + np.random.randn(300, 2) * 0.8 for c in centers])
y = np.concatenate([np.full(300, c) for c in range(3)])
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.4)
X_cal, X_te, y_cal, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5)
clf = MLPClassifier(hidden_layer_sizes=(32, 16), max_iter=500, random_state=0)
clf.fit(X_tr, y_tr)
alpha = 0.10
# APS calibration scores
cal_probs = clf.predict_proba(X_cal)
cal_scores_aps = []
for i in range(len(y_cal)):
sorted_idx = np.argsort(-cal_probs[i]) # descending probability
cumsum = np.cumsum(cal_probs[i, sorted_idx])
rank = np.where(sorted_idx == y_cal[i])[0][0]
cal_scores_aps.append(cumsum[rank]) # cumulative prob up to true class
cal_scores_aps = np.array(cal_scores_aps)
n = len(cal_scores_aps)
q_aps = np.quantile(cal_scores_aps, np.ceil((1-alpha)*(n+1))/n, method="higher")
# APS prediction sets
te_probs = clf.predict_proba(X_te)
sets_aps = []
for i in range(len(X_te)):
sorted_idx = np.argsort(-te_probs[i])
cumsum = np.cumsum(te_probs[i, sorted_idx])
# Include classes until cumulative probability exceeds threshold
k = np.searchsorted(cumsum, q_aps) + 1
sets_aps.append(sorted_idx[:k].tolist())
covered = sum(y_te[i] in sets_aps[i] for i in range(len(y_te)))
print(f"APS coverage: {covered/len(y_te):.1%}")
print(f"APS avg set size: {np.mean([len(s) for s in sets_aps]):.2f}")
sizes = [len(s) for s in sets_aps]
print(f" Singletons: {sizes.count(1)}, Pairs: {sizes.count(2)}, Triples: {sizes.count(3)}")
APS typically produces smaller average set sizes than naive conformal while maintaining the same coverage guarantee. The sets adapt to difficulty: easy examples get tight, informative sets; hard examples get appropriately larger ones.
Try It: Prediction Set Explorer
Conformal Regression — Prediction Intervals with Coverage Guarantees
Conformal prediction isn't just for classification. For regression, instead of prediction sets, we produce prediction intervals.
The basic approach uses absolute residuals as scores: s(x, y) = |y - f̂(x)|. The conformal quantile q̂ then gives constant-width intervals: [f̂(x) - q̂, f̂(x) + q̂]. Simple, but the intervals have the same width everywhere — which is wrong for heteroscedastic data where noise varies across the input space.
Conformalized Quantile Regression (CQR), from Romano, Patterson, and Candès (2019), solves this by combining quantile regression with conformal calibration:
- Train a quantile regression model that predicts lower and upper bounds: [q̂lo(x), q̂hi(x)]
- Score: s(x, y) = max(q̂lo(x) - y, y - q̂hi(x)) — how far y falls outside the predicted interval
- The conformal quantile q adjusts the interval: [q̂lo(x) - q, q̂hi(x) + q]
The result: intervals that are narrow in low-noise regions and wide in high-noise regions, all with guaranteed coverage.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
# Heteroscedastic data: noise grows with x
np.random.seed(42)
X = np.random.uniform(0, 5, 600).reshape(-1, 1)
noise_scale = 0.3 + 0.5 * X.ravel() # noise increases with x
y = np.sin(X.ravel()) * 2 + np.random.randn(600) * noise_scale
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.4)
X_cal, X_te, y_cal, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5)
alpha = 0.10
# --- Method 1: Basic conformal regression (constant width) ---
reg = GradientBoostingRegressor(n_estimators=100, random_state=0)
reg.fit(X_tr, y_tr)
cal_residuals = np.abs(y_cal - reg.predict(X_cal))
n = len(cal_residuals)
q_basic = np.quantile(cal_residuals, np.ceil((1-alpha)*(n+1))/n, method="higher")
# --- Method 2: CQR (adaptive width) ---
lo_model = GradientBoostingRegressor(n_estimators=100, loss="quantile",
alpha=alpha/2, random_state=0)
hi_model = GradientBoostingRegressor(n_estimators=100, loss="quantile",
alpha=1-alpha/2, random_state=0)
lo_model.fit(X_tr, y_tr)
hi_model.fit(X_tr, y_tr)
cal_lo = lo_model.predict(X_cal)
cal_hi = hi_model.predict(X_cal)
cqr_scores = np.maximum(cal_lo - y_cal, y_cal - cal_hi)
q_cqr = np.quantile(cqr_scores, np.ceil((1-alpha)*(n+1))/n, method="higher")
# Evaluate both on test set
te_pred = reg.predict(X_te)
te_lo = lo_model.predict(X_te) - q_cqr
te_hi = hi_model.predict(X_te) + q_cqr
basic_cov = np.mean((y_te >= te_pred - q_basic) & (y_te <= te_pred + q_basic))
cqr_cov = np.mean((y_te >= te_lo) & (y_te <= te_hi))
basic_width = 2 * q_basic
cqr_width = np.mean(te_hi - te_lo)
print(f"Basic conformal: coverage={basic_cov:.1%}, avg width={basic_width:.2f}")
print(f"CQR: coverage={cqr_cov:.1%}, avg width={cqr_width:.2f}")
Both methods achieve the target coverage. But CQR's intervals are narrow on the left (where noise is low) and wide on the right (where noise is high) — exactly matching the data's heteroscedasticity. Compare this to Gaussian process uncertainty bands, which require specific kernel assumptions. Conformal intervals make no distributional assumptions at all.
The Conditional Coverage Challenge
Before you declare conformal prediction the solution to all uncertainty problems, there's an important caveat: the guarantee is marginal, not conditional.
Conformal prediction guarantees P(Y ∈ C(X)) ≥ 1-α on average over X. But for a specific subgroup — say, a rare class or a particular region of input space — the coverage might be much lower. An overall 90% coverage could mean 99% on common classes and 50% on rare ones.
This is not a bug in conformal prediction — it's a fundamental impossibility. Lei and Wasserman (2014) proved that exact conditional coverage P(Y ∈ C(X) | X = x) ≥ 1-α for all x is impossible without distributional assumptions. Any method that claims to achieve it must be making hidden assumptions or producing trivially large prediction sets.
Mondrian conformal prediction offers a practical middle ground: run separate conformal procedures for each group. This guarantees per-group marginal coverage at the cost of needing enough calibration data in each group.
import numpy as np
from sklearn.neural_network import MLPClassifier
# Imbalanced 3-class problem: 70% / 20% / 10% split
np.random.seed(42)
n_samples = [700, 200, 100]
centers = np.array([[0, 2], [-2, -1], [2, -1]])
X = np.vstack([centers[c] + np.random.randn(n, 2) * 0.7 for c, n in enumerate(n_samples)])
y = np.concatenate([np.full(n, c) for c, n in enumerate(n_samples)])
# Shuffle and split
perm = np.random.permutation(len(y))
X, y = X[perm], y[perm]
X_tr, X_cal, X_te = X[:600], X[600:800], X[800:]
y_tr, y_cal, y_te = y[:600], y[600:800], y[800:]
clf = MLPClassifier(hidden_layer_sizes=(32,), max_iter=500, random_state=0)
clf.fit(X_tr, y_tr)
alpha = 0.10
# Standard conformal: one threshold for all classes
cal_probs = clf.predict_proba(X_cal)
cal_scores = 1 - cal_probs[np.arange(len(y_cal)), y_cal]
n = len(cal_scores)
q_global = np.quantile(cal_scores, np.ceil((1-alpha)*(n+1))/n, method="higher")
# Mondrian conformal: separate threshold per class
q_per_class = {}
for c in range(3):
mask = y_cal == c
sc = cal_scores[mask]
nc = len(sc)
q_per_class[c] = np.quantile(sc, np.ceil((1-alpha)*(nc+1))/nc, method="higher")
# Evaluate per-class coverage
te_probs = clf.predict_proba(X_te)
for c in range(3):
mask = y_te == c
if mask.sum() == 0:
continue
# Standard: same threshold
std_cov = np.mean(1 - te_probs[mask, c] <= q_global)
# Mondrian: per-class threshold
mon_cov = np.mean(1 - te_probs[mask, c] <= q_per_class[c])
print(f"Class {c} (n={mask.sum()}): Standard={std_cov:.0%}, Mondrian={mon_cov:.0%}")
You'll typically see that the standard method over-covers common classes and under-covers rare classes. Mondrian conformal restores per-class coverage guarantees — at the cost of needing sufficient calibration data in each class.
Putting It All Together — Conformal Prediction in Practice
Here's when to reach for conformal prediction:
- Any pre-trained model — conformal prediction is a post-hoc wrapper. No retraining required.
- Classification or regression — prediction sets or prediction intervals, your choice.
- You need reliability guarantees — medical diagnosis, autonomous driving, financial risk.
- Calibration set size — rule of thumb: n ≥ 1/α. For 90% coverage, you need at least 10 calibration points; for 99%, at least 100.
Conformal prediction also pairs naturally with selective prediction: if |C(x)| is large (many labels in the prediction set), the model is confused and you should defer to a human. If |C(x)| = 1, the model is confident and you can trust it.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# Build a full conformal pipeline with selective prediction
np.random.seed(42)
centers = np.array([[0, 2], [-2, -1], [2, -1]])
X = np.vstack([c + np.random.randn(400, 2) * 0.9 for c in centers])
y = np.concatenate([np.full(400, c) for c in range(3)])
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.4)
X_cal, X_te, y_cal, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5)
clf = MLPClassifier(hidden_layer_sizes=(32, 16), max_iter=500, random_state=0)
clf.fit(X_tr, y_tr)
# Sweep alpha to see the coverage-efficiency tradeoff
for alpha in [0.01, 0.05, 0.10, 0.20, 0.50]:
cal_probs = clf.predict_proba(X_cal)
scores = 1 - cal_probs[np.arange(len(y_cal)), y_cal]
n = len(scores)
q = np.quantile(scores, np.ceil((1-alpha)*(n+1))/n, method="higher")
te_probs = clf.predict_proba(X_te)
sets = []
for i in range(len(X_te)):
pset = [c for c in range(3) if 1 - te_probs[i, c] <= q]
sets.append(pset)
coverage = np.mean([y_te[i] in sets[i] for i in range(len(y_te))])
avg_size = np.mean([len(s) for s in sets])
singletons = sum(len(s) == 1 for s in sets) / len(sets)
print(f"alpha={alpha:.2f}: coverage={coverage:.0%}, "
f"avg_size={avg_size:.2f}, singletons={singletons:.0%}")
As α increases (less coverage demanded), prediction sets shrink and more become singletons. As α decreases (more coverage demanded), sets grow to maintain the guarantee. This is the fundamental tradeoff: tighter coverage demands larger, less informative prediction sets.
Try It: Coverage Guarantee Checker
Conclusion
Conformal prediction gives you something rare in machine learning: a guarantee. Not an asymptotic result that requires infinite data. Not a heuristic that works "most of the time." A finite-sample, distribution-free, model-agnostic guarantee that your prediction sets contain the true answer with probability at least 1-α.
The toolbox we built:
- Nonconformity scores — the bridge between any model and conformal guarantees
- Split conformal prediction — the core algorithm, with a one-line proof
- Adaptive Prediction Sets (APS) — smarter sets that adapt to difficulty
- Conformalized Quantile Regression — adaptive-width intervals for regression
- Mondrian conformal — per-group coverage for fairness
And the intellectual honesty: exact conditional coverage is impossible without distributional assumptions. Conformal prediction isn't magic — it's mathematics. The guarantee is marginal, prediction sets can be large when the model is bad, and you sacrifice calibration data to get the guarantee.
But when a doctor needs to know "which diagnoses should I consider?", or an autonomous vehicle needs to know "am I confident enough to proceed?", conformal prediction turns any black-box model into one that knows what it doesn't know — with a proof, not a prayer.
References & Further Reading
- Vovk, Gammerman & Shafer — Algorithmic Learning in a Random World (2nd ed.) — the original conformal prediction book, now updated
- Lei, G'Sell, Rinaldo, Tibshirani & Wasserman 2018 — Distribution-Free Predictive Inference for Regression — foundational theory for conformal regression
- Romano, Patterson & Candès 2019 — Conformalized Quantile Regression — adaptive-width conformal intervals
- Romano, Sesia & Candès 2020 — Classification with Valid and Adaptive Coverage — the APS method
- Angelopoulos & Bates 2021 — A Gentle Introduction to Conformal Prediction — the definitive tutorial, highly recommended
- Barber, Candès, Ramdas & Tibshirani 2023 — Conformal Prediction Beyond Exchangeability — extending guarantees to distribution shift
- Gibbs & Candès 2021 — Adaptive Conformal Inference Under Distribution Shift — online conformal prediction with shifting data