Uncertainty Quantification from Scratch: Teaching Neural Networks to Say "I Don't Know"

February 27, 2026 · Elementary · 16 min read

The Overconfidence Problem

You train a classifier that reports 99% confidence on a test image—then discover it's completely wrong. The problem isn't the prediction; it's that the model never learned to express doubt.

Softmax outputs look like probabilities—they're positive and sum to one—but they're just normalized logits. And ReLU networks produce piecewise linear logit surfaces, which means confidence grows unboundedly as you move away from the decision boundary. A model that's 99.99% confident about an input it has never seen before isn't certain. It's delusional.

This isn't a minor nuisance. In 2017, Chuan Guo and colleagues showed that modern deep networks are systematically overconfident. The deeper and wider the network, the worse the calibration. That's a problem when your model is making decisions about medical diagnoses, autonomous driving, or financial risk. A wrong prediction is bad. A confidently wrong prediction is dangerous.

What we actually need is a model that's confident when it's right, uncertain when it's wrong, and says "I don't know" when it genuinely doesn't. In this post, we'll build that capability from scratch. We'll start by measuring overconfidence with calibration, then decompose uncertainty into two types, implement three practical methods (MC Dropout, Deep Ensembles, Temperature Scaling), and finish with a system that knows when to abstain. If you've read our posts on softmax temperature or adversarial examples, you've already seen what happens when models are overconfident. Now let's fix it.

Calibration — When Confidence Matches Reality

A model is calibrated if its confidence matches its actual accuracy. Formally: when the model says "80% confident," it should be right about 80% of the time. That sounds obvious, but most modern deep networks violate it badly.

The standard tool for diagnosing calibration is a reliability diagram. You bin all predictions by their confidence level (e.g., 0–10%, 10–20%, ..., 90–100%), then plot the actual accuracy within each bin. Perfect calibration produces a diagonal line—50% confident predictions are right 50% of the time, 90% confident predictions are right 90% of the time. Bars below the diagonal mean the model is overconfident. Bars above mean it's underconfident.

We condense this into a single number: Expected Calibration Error (ECE), the weighted average gap between accuracy and confidence across all bins. Smaller ECE = better calibration. There's also Maximum Calibration Error (MCE)—the worst-case bin—which matters for safety-critical applications.

Historically, simpler models like logistic regression and random forests were naturally well-calibrated. Modern deep nets lost this property as architectures grew deeper, used batch normalization, and trained with massive overparameterization. Cross-entropy loss is a proper scoring rule (minimized at the true probabilities), but finite data and excessive capacity push calibration off course. Let's see this in action.

import numpy as np

np.random.seed(42)
n_samples = 500
n_classes = 3

# Simulate an overconfident 3-class classifier
# Generate logits that are TOO sharp (model is too sure)
true_labels = np.random.randint(0, n_classes, n_samples)
logits = np.random.randn(n_samples, n_classes) * 0.8
# Boost the correct class logit (model is decent but not perfect)
for i in range(n_samples):
    logits[i, true_labels[i]] += 2.0
# Overconfidence: scale logits up so softmax is too peaky
logits *= 2.5

# Softmax
exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)
predictions = probs.argmax(axis=1)
confidences = probs.max(axis=1)

# Build reliability diagram with 10 bins
n_bins = 10
bin_boundaries = np.linspace(0, 1, n_bins + 1)
bin_accs, bin_confs, bin_counts = [], [], []

for b in range(n_bins):
    lo, hi = bin_boundaries[b], bin_boundaries[b + 1]
    mask = (confidences > lo) & (confidences <= hi)
    if mask.sum() == 0:
        bin_accs.append(0); bin_confs.append((lo + hi) / 2); bin_counts.append(0)
        continue
    bin_accs.append((predictions[mask] == true_labels[mask]).mean())
    bin_confs.append(confidences[mask].mean())
    bin_counts.append(mask.sum())

# Expected Calibration Error
total = sum(bin_counts)
ece = sum(n * abs(a - c) for a, c, n in zip(bin_accs, bin_confs, bin_counts)) / total
mce = max(abs(a - c) for a, c, n in zip(bin_accs, bin_confs, bin_counts) if n > 0)

print("Reliability Diagram (10 bins):")
print(f"{'Bin':>8} {'Count':>6} {'Confidence':>11} {'Accuracy':>9} {'Gap':>8}")
for b in range(n_bins):
    if bin_counts[b] == 0: continue
    gap = bin_confs[b] - bin_accs[b]
    print(f"  {b+1:>4}   {bin_counts[b]:>4}     {bin_confs[b]:>7.3f}   {bin_accs[b]:>7.3f}  {gap:>+7.3f}")
print(f"\nECE = {ece:.4f} -- predictions are {ece*100:.1f}% more confident than accurate")
print(f"MCE = {mce:.4f} -- worst bin is off by {mce*100:.1f}%")

You'll see that nearly every bin has confidence exceeding accuracy—the telltale signature of overconfidence. The ECE tells us how far off we are overall. Now let's understand what kinds of uncertainty we're dealing with.

Two Types of Uncertainty

Not all uncertainty is created equal. Machine learning distinguishes two fundamentally different kinds:

Aleatoric uncertainty (data uncertainty) is the irreducible noise inherent in the problem. Overlapping class boundaries, label noise, measurement error—even with infinite data, this uncertainty stays. If two classes truly overlap in feature space, no model can perfectly separate them there.

Epistemic uncertainty (model uncertainty) comes from limited data. The model doesn't know because it hasn't seen enough examples in this region of input space. Collect more data, and this uncertainty shrinks. This is the kind we can actually do something about.

The distinction matters for practical decisions. High aleatoric uncertainty means the task is inherently ambiguous here—don't expect a better answer. High epistemic uncertainty means you should either collect more data in this region or refuse to make a prediction. The law of total variance gives us the formal decomposition:

Var(Y|X) = E[Var(Y|X,θ)] + Var(E[Y|X,θ])
First term = aleatoric (average noise within each model). Second term = epistemic (disagreement across models).

Let's make this concrete. We'll train a small ensemble on two-moon data and measure both types of uncertainty across the input space.

import numpy as np

def make_moons(n=300, noise=0.15, seed=42):
    rng = np.random.RandomState(seed)
    n_each = n // 2
    # Top moon
    theta1 = np.linspace(0, np.pi, n_each)
    x1 = np.column_stack([np.cos(theta1), np.sin(theta1)])
    # Bottom moon (shifted)
    theta2 = np.linspace(0, np.pi, n_each)
    x2 = np.column_stack([1 - np.cos(theta2), 1 - np.sin(theta2) - 0.5])
    X = np.vstack([x1, x2]) + rng.randn(n, 2) * noise
    y = np.array([0]*n_each + [1]*n_each)
    return X, y

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def train_mlp(X, y, seed, hidden=16, epochs=300, lr=0.05):
    """Train a tiny 2-layer MLP for binary classification."""
    rng = np.random.RandomState(seed)
    W1 = rng.randn(2, hidden) * 0.5
    b1 = np.zeros(hidden)
    W2 = rng.randn(hidden, 1) * 0.5
    b2 = np.zeros(1)

    for _ in range(epochs):
        # Forward
        h = np.maximum(0, X @ W1 + b1)  # ReLU
        logit = (h @ W2 + b2).ravel()
        pred = sigmoid(logit)
        # Backward (binary cross-entropy gradient)
        grad_logit = (pred - y) / len(y)
        grad_W2 = h.T @ grad_logit.reshape(-1, 1)
        grad_b2 = grad_logit.sum(keepdims=True)
        grad_h = grad_logit.reshape(-1, 1) * W2.T
        grad_h[h <= 0] = 0  # ReLU gradient
        grad_W1 = X.T @ grad_h
        grad_b1 = grad_h.sum(axis=0)
        W1 -= lr * grad_W1; b1 -= lr * grad_b1
        W2 -= lr * grad_W2; b2 -= lr * grad_b2

    return W1, b1, W2, b2

def predict_mlp(X, W1, b1, W2, b2):
    h = np.maximum(0, X @ W1 + b1)
    return sigmoid((h @ W2 + b2).ravel())

# Train 5 MLPs with different seeds
X, y = make_moons(300, noise=0.15, seed=42)
models = [train_mlp(X, y, seed=s) for s in range(5)]

# Probe 3 diagnostic points
test_points = np.array([[0.5, 0.7], [0.5, 0.25], [-1.5, 0.5]])
labels = ["Near class-0 center", "Overlap region", "Far from data (OOD)"]

for pt, label in zip(test_points, labels):
    preds = np.array([predict_mlp(pt.reshape(1,-1), *m)[0] for m in models])
    mean_pred = preds.mean()
    epistemic = preds.var()  # Disagreement across models
    entropies = np.array([-(p*np.log(p+1e-10) + (1-p)*np.log(1-p+1e-10)) for p in preds])
    aleatoric = entropies.mean()  # Average individual uncertainty
    print(f"{label}: mean_p={mean_pred:.3f}, epistemic={epistemic:.4f}, aleatoric={aleatoric:.3f}")
    print(f"  Individual predictions: {[f'{p:.3f}' for p in preds]}")

Near a class center, all five models agree—low epistemic, low aleatoric. In the overlap region, individual models are uncertain (high aleatoric) but they roughly agree on how uncertain to be (moderate epistemic). Far from the data, models disagree wildly—high epistemic uncertainty signals "I've never seen anything like this."

MC Dropout — Bayesian Inference for Free

Here's one of the cleverest ideas in modern deep learning. In 2016, Yarin Gal and Zoubin Ghahramani showed that dropout at test time is mathematically equivalent to approximate variational Bayesian inference. You've already trained with dropout (our regularization post covers the mechanics). The twist: keep dropout on during inference.

Run T forward passes, each with a different random dropout mask. Each pass randomly zeros a different subset of neurons, effectively creating a different "model." The distribution of T predictions captures your uncertainty:

Mean of T predictions → better point estimate (implicit ensemble averaging)
Variance across T predictions → epistemic uncertainty (model disagreement)
Entropy of the mean → total uncertainty
Mean entropy of individuals → aleatoric uncertainty

The beauty: zero extra training cost. Any model trained with dropout already supports this. You just pay T× inference cost at test time. Let's implement it from scratch.

import numpy as np

def make_moons(n=300, noise=0.15, seed=42):
    rng = np.random.RandomState(seed)
    n_each = n // 2
    theta1 = np.linspace(0, np.pi, n_each)
    x1 = np.column_stack([np.cos(theta1), np.sin(theta1)])
    theta2 = np.linspace(0, np.pi, n_each)
    x2 = np.column_stack([1 - np.cos(theta2), 1 - np.sin(theta2) - 0.5])
    X = np.vstack([x1, x2]) + rng.randn(n, 2) * noise
    y = np.array([0]*n_each + [1]*n_each)
    return X, y

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def train_dropout_mlp(X, y, seed, hidden=32, epochs=400, lr=0.05, drop_p=0.2):
    """Train MLP with dropout on hidden layers."""
    rng = np.random.RandomState(seed)
    W1 = rng.randn(2, hidden) * 0.5; b1 = np.zeros(hidden)
    W2 = rng.randn(hidden, hidden) * 0.5; b2 = np.zeros(hidden)
    W3 = rng.randn(hidden, 1) * 0.5; b3 = np.zeros(1)

    for ep in range(epochs):
        # Forward with dropout
        h1 = np.maximum(0, X @ W1 + b1)
        mask1 = (rng.rand(*h1.shape) > drop_p) / (1 - drop_p)
        h1d = h1 * mask1
        h2 = np.maximum(0, h1d @ W2 + b2)
        mask2 = (rng.rand(*h2.shape) > drop_p) / (1 - drop_p)
        h2d = h2 * mask2
        logit = (h2d @ W3 + b3).ravel()
        pred = sigmoid(logit)
        # Backward
        g = (pred - y) / len(y)
        gW3 = h2d.T @ g.reshape(-1,1); gb3 = g.sum(keepdims=True)
        gh2 = g.reshape(-1,1) * W3.T; gh2[h2 <= 0] = 0; gh2 *= mask2
        gW2 = h1d.T @ gh2; gb2 = gh2.sum(axis=0)
        gh1 = gh2 @ W2.T; gh1[h1 <= 0] = 0; gh1 *= mask1
        gW1 = X.T @ gh1; gb1 = gh1.sum(axis=0)
        W1 -= lr*gW1; b1 -= lr*gb1; W2 -= lr*gW2; b2 -= lr*gb2
        W3 -= lr*gW3; b3 -= lr*gb3

    return W1, b1, W2, b2, W3, b3

def mc_predict(X, W1, b1, W2, b2, W3, b3, T=30, drop_p=0.2, seed=0):
    """Run T stochastic forward passes with dropout at test time."""
    rng = np.random.RandomState(seed)
    preds = []
    for _ in range(T):
        h1 = np.maximum(0, X @ W1 + b1)
        h1 *= (rng.rand(*h1.shape) > drop_p) / (1 - drop_p)
        h2 = np.maximum(0, h1 @ W2 + b2)
        h2 *= (rng.rand(*h2.shape) > drop_p) / (1 - drop_p)
        preds.append(sigmoid((h2 @ W3 + b3).ravel()))
    return np.array(preds)  # shape: (T, n_samples)

X, y = make_moons(300, noise=0.15, seed=42)
W1, b1, W2, b2, W3, b3 = train_dropout_mlp(X, y, seed=7)

test_points = np.array([[0.5, 0.7], [0.5, 0.25], [-1.5, 0.5]])
labels = ["In-distribution", "Overlap region", "Far OOD"]
all_preds = mc_predict(test_points, W1, b1, W2, b2, W3, b3, T=30, seed=99)

for i, label in enumerate(labels):
    pt_preds = all_preds[:, i]
    mean_p = pt_preds.mean()
    epistemic = pt_preds.var()
    ents = -(pt_preds*np.log(pt_preds+1e-10) + (1-pt_preds)*np.log(1-pt_preds+1e-10))
    aleatoric = ents.mean()
    agree = (pt_preds > 0.5).sum()
    print(f"{label}: {agree}/30 passes agree, "
          f"epistemic={epistemic:.4f}, aleatoric={aleatoric:.3f}")

For in-distribution points, almost all 30 passes agree. In the overlap region, passes are split—reflecting genuine ambiguity. Far from the data, passes diverge wildly—the model is telling us it doesn't have enough information. That's exactly the signal we want.

Deep Ensembles — Uncertainty Through Disagreement

If MC Dropout gives us approximate Bayesian uncertainty "for free," Deep Ensembles take the more direct approach: just train M separate networks and see where they disagree. Lakshminarayanan, Pritzel, and Blundell (2017) showed this embarrassingly simple method often produces better uncertainty estimates than MC Dropout.

The key insight is that different random initializations cause networks to converge to different modes of the loss landscape. Each member learns a slightly different function, and their disagreement is a natural measure of epistemic uncertainty. Unlike MC Dropout, each ensemble member operates at full capacity—no neurons are zeroed out.

The tradeoffs: M× training cost, M× storage, M× inference. But M=5 captures most of the benefit, and members train in parallel. In practice, ensembles consistently outperform MC Dropout on calibration benchmarks.

import numpy as np

def make_moons(n=300, noise=0.15, seed=42):
    rng = np.random.RandomState(seed)
    n_each = n // 2
    theta1 = np.linspace(0, np.pi, n_each)
    x1 = np.column_stack([np.cos(theta1), np.sin(theta1)])
    theta2 = np.linspace(0, np.pi, n_each)
    x2 = np.column_stack([1 - np.cos(theta2), 1 - np.sin(theta2) - 0.5])
    X = np.vstack([x1, x2]) + rng.randn(n, 2) * noise
    y = np.array([0]*n_each + [1]*n_each)
    return X, y

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def train_mlp(X, y, seed, hidden=24, epochs=350, lr=0.05):
    rng = np.random.RandomState(seed)
    W1 = rng.randn(2, hidden)*0.5; b1 = np.zeros(hidden)
    W2 = rng.randn(hidden, 1)*0.5; b2 = np.zeros(1)
    for _ in range(epochs):
        h = np.maximum(0, X @ W1 + b1)
        logit = (h @ W2 + b2).ravel(); pred = sigmoid(logit)
        g = (pred - y) / len(y)
        gW2 = h.T @ g.reshape(-1,1); gb2 = g.sum(keepdims=True)
        gh = g.reshape(-1,1) * W2.T; gh[h <= 0] = 0
        W1 -= lr*(X.T @ gh); b1 -= lr*gh.sum(axis=0)
        W2 -= lr*gW2; b2 -= lr*gb2
    return W1, b1, W2, b2

def predict(X, W1, b1, W2, b2):
    return sigmoid((np.maximum(0, X @ W1 + b1) @ W2 + b2).ravel())

# Train ensemble of M=5 networks
X, y = make_moons(300, noise=0.15, seed=42)
ensemble = [train_mlp(X, y, seed=s*17+3) for s in range(5)]

# Evaluate on diagnostic points
test_points = np.array([[0.5, 0.7], [0.5, 0.25], [-1.5, 0.5]])
labels = ["In-distribution", "Overlap region", "Far OOD"]

for pt, label in zip(test_points, labels):
    preds = np.array([predict(pt.reshape(1,-1), *m)[0] for m in ensemble])
    mean_p = preds.mean()
    epistemic = preds.var()
    print(f"{label}: mean_p={mean_p:.3f}, ensemble_var={epistemic:.4f}")
    print(f"  Members: {[f'{p:.3f}' for p in preds]}")

# Compare accuracy: single model vs ensemble
X_test, y_test = make_moons(200, noise=0.15, seed=99)
single_acc = ((predict(X_test, *ensemble[0]) > 0.5) == y_test).mean()
ens_preds = np.mean([predict(X_test, *m) for m in ensemble], axis=0)
ens_acc = ((ens_preds > 0.5) == y_test).mean()
print(f"\nSingle model accuracy: {single_acc:.1%}")
print(f"Ensemble accuracy:     {ens_acc:.1%}")

The ensemble typically beats any single member by 1–3% on accuracy—disagreement-weighted averaging is a natural form of error correction. But the real win is the uncertainty estimates: ensemble variance is highest exactly where we want it, far from the training data.

Temperature Scaling — The Simplest Calibration Fix

Sometimes you don't need a new model. You just need to fix the one you have. Temperature scaling is the simplest post-hoc calibration method: learn a single scalar T that divides logits before softmax. That's it. One parameter.

When T > 1, predictions are softened (less confident). When T < 1, predictions are sharpened (more confident). You find the optimal T by minimizing negative log-likelihood on a validation set. Guo et al. (2017) showed this often beats more complex calibration methods despite having just one degree of freedom.

Why does it work? Modern networks have good ranking—they know which class is most likely—but bad calibration—they're too confident about it. Temperature preserves the ranking (argmax doesn't change when you divide all logits by the same T) while correcting the confidence magnitudes. If you've read our softmax temperature post, you've seen how T controls sharpness. Here we learn the right T for calibration.

import numpy as np

np.random.seed(42)
n_samples = 500
n_classes = 3

# Same overconfident model as before
true_labels = np.random.randint(0, n_classes, n_samples)
logits = np.random.randn(n_samples, n_classes) * 0.8
for i in range(n_samples):
    logits[i, true_labels[i]] += 2.0
logits *= 2.5  # Overconfident scaling

def softmax_with_temp(logits, T):
    scaled = logits / T
    exp_l = np.exp(scaled - scaled.max(axis=1, keepdims=True))
    return exp_l / exp_l.sum(axis=1, keepdims=True)

def compute_ece(probs, labels, n_bins=10):
    confs = probs.max(axis=1)
    preds = probs.argmax(axis=1)
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for b in range(n_bins):
        mask = (confs > bins[b]) & (confs <= bins[b+1])
        if mask.sum() == 0: continue
        acc = (preds[mask] == labels[mask]).mean()
        conf = confs[mask].mean()
        ece += mask.sum() * abs(acc - conf)
    return ece / len(labels)

def nll(probs, labels):
    return -np.log(probs[np.arange(len(labels)), labels] + 1e-10).mean()

# Split into calibration and test sets
val_logits, test_logits = logits[:250], logits[250:]
val_labels, test_labels = true_labels[:250], true_labels[250:]

# Grid search for optimal temperature
best_T, best_nll = 1.0, float('inf')
for T in np.arange(0.5, 5.01, 0.05):
    val_probs = softmax_with_temp(val_logits, T)
    loss = nll(val_probs, val_labels)
    if loss < best_nll:
        best_T, best_nll = T, loss

# Compare before and after
before = softmax_with_temp(test_logits, T=1.0)
after = softmax_with_temp(test_logits, T=best_T)
acc_before = (before.argmax(axis=1) == test_labels).mean()
acc_after = (after.argmax(axis=1) == test_labels).mean()

print(f"Optimal temperature: T = {best_T:.2f}")
print(f"\nBefore (T=1.0):  ECE = {compute_ece(before, test_labels):.4f}, Accuracy = {acc_before:.1%}")
print(f"After  (T={best_T:.1f}):  ECE = {compute_ece(after, test_labels):.4f}, Accuracy = {acc_after:.1%}")
print(f"\nAccuracy is IDENTICAL -- temperature only changes confidence, not predictions")

The magic: ECE drops dramatically while accuracy stays exactly the same. Temperature scaling doesn't change what the model predicts, only how confident it is. One number, huge improvement. For anything beyond this, look into Platt scaling (logistic regression on logits) or vector scaling (per-class temperatures), but temperature scaling is remarkably hard to beat.

Putting It All Together — When to Trust Your Model

We now have calibrated uncertainty estimates. The practical question is: how do we use them? The most powerful application is selective prediction—letting the model abstain when it's not confident enough.

Set a confidence threshold: classify if confidence exceeds it, otherwise say "I don't know" and defer to a human or a fallback system. This creates a risk-coverage tradeoff: higher thresholds mean fewer predictions (lower coverage) but more accurate ones (lower risk). A risk-coverage curve plots error rate against the fraction of data you're willing to classify. Good uncertainty estimates produce steep curves—error drops fast as you exclude the most uncertain predictions.

Beyond selective prediction, epistemic uncertainty enables active learning (label the data the model is most uncertain about) and out-of-distribution detection (flag inputs that don't resemble training data). These connect directly to our anomaly detection post—but here the model's own uncertainty serves as the anomaly score.

import numpy as np

def make_moons(n, noise, seed):
    rng = np.random.RandomState(seed)
    n_each = n // 2
    theta1 = np.linspace(0, np.pi, n_each)
    x1 = np.column_stack([np.cos(theta1), np.sin(theta1)])
    theta2 = np.linspace(0, np.pi, n_each)
    x2 = np.column_stack([1 - np.cos(theta2), 1 - np.sin(theta2) - 0.5])
    X = np.vstack([x1, x2]) + rng.randn(n, 2) * noise
    y = np.array([0]*n_each + [1]*n_each)
    return X, y

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def train_mlp(X, y, seed, hidden=24, epochs=350, lr=0.05):
    rng = np.random.RandomState(seed)
    W1 = rng.randn(2, hidden)*0.5; b1 = np.zeros(hidden)
    W2 = rng.randn(hidden, 1)*0.5; b2 = np.zeros(1)
    for _ in range(epochs):
        h = np.maximum(0, X @ W1 + b1)
        logit = (h @ W2 + b2).ravel(); pred = sigmoid(logit)
        g = (pred - y) / len(y)
        gW2 = h.T @ g.reshape(-1,1); gb2 = g.sum(keepdims=True)
        gh = g.reshape(-1,1) * W2.T; gh[h <= 0] = 0
        W1 -= lr*(X.T @ gh); b1 -= lr*gh.sum(axis=0)
        W2 -= lr*gW2; b2 -= lr*gb2
    return W1, b1, W2, b2

def predict(X, W1, b1, W2, b2):
    return sigmoid((np.maximum(0, X @ W1 + b1) @ W2 + b2).ravel())

# Train ensemble and evaluate selective prediction
X_train, y_train = make_moons(300, 0.15, seed=42)
X_test, y_test = make_moons(200, 0.15, seed=99)
ensemble = [train_mlp(X_train, y_train, seed=s*17+3) for s in range(5)]

# Get ensemble predictions and confidence
ens_preds = np.array([predict(X_test, *m) for m in ensemble])  # (5, 200)
mean_pred = ens_preds.mean(axis=0)
confidence = np.maximum(mean_pred, 1 - mean_pred)  # Confidence in predicted class
predicted = (mean_pred > 0.5).astype(int)

# Risk-coverage curve
print(f"{'Coverage':>10} {'Classified':>11} {'Error Rate':>11}")
print("-" * 36)
for cov_target in [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3]:
    threshold = np.percentile(confidence, (1 - cov_target) * 100)
    mask = confidence >= threshold
    n_classified = mask.sum()
    if n_classified == 0: continue
    errors = (predicted[mask] != y_test[mask]).sum()
    error_rate = errors / n_classified
    print(f"  {cov_target:>6.0%}      {n_classified:>4}       {error_rate:>6.1%}")

print("\nBy refusing the 50% most uncertain predictions, error drops significantly.")

The risk-coverage curve reveals the value of good uncertainty: as we restrict to more confident predictions, error plummets. A model that "knows what it doesn't know" can dramatically improve reliability by simply staying quiet when it's unsure. In safety-critical applications, this tradeoff is everything.

Conclusion

Neural networks are powerful but overconfident. We've built a toolkit to diagnose and fix this:

Calibration (reliability diagrams, ECE) tells us how overconfident a model is
MC Dropout gives us Bayesian uncertainty for free from any dropout model
Deep Ensembles capture uncertainty through model disagreement
Temperature Scaling fixes calibration with a single parameter

The epistemic/aleatoric decomposition tells us whether uncertainty comes from limited data (fixable) or inherent ambiguity (not fixable). And selective prediction turns uncertainty estimates into a practical tool: abstain when uncertain, classify when confident.

Gaussian processes have built-in uncertainty—they know what they don't know by construction. Neural networks need to be taught. But with the methods in this post, they can learn. And a model that says "I don't know" when it should is far more trustworthy than one that's always certain.

Try It: Uncertainty Explorer

Passes: 10

Click anywhere on the plot to see uncertainty breakdown for that point.

Try It: Calibration Workshop

Temperature: 1.0

Bins: 10

ECE: 0.000 MCE: 0.000 Accuracy: 0.0%

References & Further Reading

Guo et al. — On Calibration of Modern Neural Networks (ICML 2017) — The seminal paper showing modern deep networks are systematically miscalibrated
Gal & Ghahramani — Dropout as a Bayesian Approximation (ICML 2016) — The theoretical foundation for MC Dropout
Lakshminarayanan, Pritzel & Blundell — Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (NeurIPS 2017) — Deep Ensembles: embarrassingly simple, surprisingly effective
Kendall & Gal — What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? (NeurIPS 2017) — Formalizes the epistemic/aleatoric decomposition
Ovadia et al. — Can You Trust Your Model's Uncertainty? (NeurIPS 2019) — Evaluating UQ methods under distribution shift
Minderer et al. — Revisiting the Calibration of Modern Neural Networks (NeurIPS 2021) — Updated calibration survey with newer architectures