← Back to Blog

Double Descent from Scratch: Why Bigger Models Generalize Better (And Classical Statistics Got It Wrong)

The Classical Wisdom — And Where It Breaks

Every machine learning textbook teaches the same lesson: don't make your model too big. Increase model complexity and test error follows a U-shaped curve — first decreasing as the model gains enough capacity to capture the true pattern, then increasing as it starts memorizing noise. This is the bias-variance tradeoff, and it's the central organizing principle of classical statistical learning theory.

There's just one problem: modern deep learning violates it spectacularly.

GPT-4 has hundreds of billions of parameters trained on datasets that are a fraction of that size, and it generalizes beautifully. ResNets with 10x more parameters than training examples outperform their smaller counterparts. The textbook says this shouldn't work. And yet it does.

In 2019, Mikhail Belkin and colleagues published a paper that resolved this paradox. They showed that the classical U-shaped curve is only half the story. If you keep increasing model complexity past the point where the model can perfectly fit the training data — the interpolation threshold — test error starts decreasing again. The full picture has two descents: the classical one, and a modern one in the overparameterized regime.

This is double descent, and it explains why scaling up works. Let's build the whole story from scratch.

We'll start with the simplest possible setting — polynomial regression — where double descent is analytically tractable and visually obvious. Then we'll show it appears in neural networks too, explain why the second descent happens, and connect it to the scaling laws that govern modern AI.

Seeing It with Polynomial Regression

The cleanest way to observe double descent is with polynomial regression. We generate n=30 noisy data points from a sine wave, then fit polynomials of increasing degree. Degree 1 is a line (underfitting). Degree 5 captures the curve. Degree 29 (= n-1) can fit every point exactly — this is the interpolation threshold. What happens when we go further?

import numpy as np

np.random.seed(42)
n = 30
x_train = np.linspace(0, 2 * np.pi, n)
y_train = np.sin(x_train) + np.random.randn(n) * 0.3
x_test = np.linspace(0, 2 * np.pi, 200)
y_test = np.sin(x_test)

degrees = list(range(1, 25)) + list(range(25, 120, 3))
train_errors, test_errors = [], []

for d in degrees:
    # Vandermonde matrix: each column is x^k
    X_tr = np.vander(x_train, d + 1, increasing=True)
    X_te = np.vander(x_test, d + 1, increasing=True)

    if d < n:
        # Overdetermined: least-squares solution
        w, _, _, _ = np.linalg.lstsq(X_tr, y_train, rcond=None)
    else:
        # Underdetermined: minimum-norm solution
        # w = X^T (X X^T)^{-1} y
        w = X_tr.T @ np.linalg.solve(X_tr @ X_tr.T + 1e-12 * np.eye(n), y_train)

    train_errors.append(np.mean((X_tr @ w - y_train) ** 2))
    test_errors.append(np.mean((X_te @ w - y_test) ** 2))

# Plot degrees vs test_errors to see the double descent curve
# Peak occurs near degree = n (interpolation threshold)
for d, tr, te in zip(degrees, train_errors, test_errors):
    regime = "UNDER" if d < 15 else ("CRITICAL" if 25 <= d <= 35 else "OVER")
    if d in [3, 10, 28, 30, 50, 100]:
        print(f"degree={d:>3d}  train={tr:.4f}  test={te:.4f}  [{regime}]")

The output reveals the characteristic double descent shape. Low degrees have moderate error (underfitting). Around degree 28–30 — the interpolation threshold — test error spikes dramatically. Then, as we keep adding parameters well past n, test error drops again. The model has more parameters than data points, perfectly memorizes every training example, and yet generalizes well.

The Three Regimes — Under, Critical, Over

Double descent carves the model complexity axis into three distinct regimes, each with fundamentally different behavior:

Underfitting regime (p « n): The model has far fewer parameters p than data points n. It can't fit the training data perfectly, so it compromises — capturing the broad pattern while ignoring noise. This is the classical territory where the bias-variance tradeoff applies. Increasing complexity reduces bias and improves generalization, up to a point.

Critical regime (p ≈ n): The model has just barely enough parameters to interpolate — to pass through every training point exactly. This is the interpolation threshold, and it's where the worst generalization occurs. Think of it like solving a system of n equations with n unknowns: there's exactly one solution, and it must pass through every noisy point. The model has no freedom to choose a smooth interpolant — it's forced to contort itself through every data point, including the noise.

Overparameterized regime (p » n): The model has many more parameters than data points. Now there are infinitely many solutions that perfectly fit the training data. Among these, gradient descent (or the pseudoinverse for linear models) finds the minimum-norm interpolator — the simplest solution that still passes through every point. This solution is smooth, and smooth functions generalize well.

The critical insight: it's not the parameter count that causes overfitting — it's the ratio p/n near 1.

Let's visualize this directly. For each regime, we'll fit a polynomial and plot the learned function:

import numpy as np

np.random.seed(42)
n = 20
x = np.linspace(0, 2 * np.pi, n)
y = np.sin(x) + np.random.randn(n) * 0.3
x_dense = np.linspace(0, 2 * np.pi, 500)

for label, d in [("UNDERFIT (p=5)", 5), ("CRITICAL (p=n)", n), ("OVERFIT-FREE (p=5n)", 5*n)]:
    X = np.vander(x, d + 1, increasing=True)
    X_dense = np.vander(x_dense, d + 1, increasing=True)

    if d < n:
        w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
    else:
        w = X.T @ np.linalg.solve(X @ X.T + 1e-12 * np.eye(n), y)

    y_pred = X_dense @ w
    # Clip extreme values for display
    y_pred = np.clip(y_pred, -3, 3)
    test_err = np.mean((y_pred - np.sin(x_dense)) ** 2)
    train_err = np.mean((X @ w - y) ** 2)
    print(f"{label:>25s}  train={train_err:.4f}  test={test_err:.4f}  ||w||={np.linalg.norm(w):.2f}")

The underfit model (degree 5) captures the sine wave's general shape but misses details — its training error is nonzero. The critical model (degree = n) passes through every point exactly but oscillates wildly between them — its weight norm ||w|| is enormous and test error spikes. The overparameterized model (degree = 5n) also passes through every point, but it does so smoothly — the minimum-norm solution keeps the weights small, and the function generalizes nearly as well as the underfit model.

The Interpolation Threshold — Where Everything Goes Wrong

Why is the critical regime (p ≈ n) so catastrophic? The answer lies in linear algebra.

For overparameterized linear regression (p > n), the minimum-norm solution is:

w = XT(XXT)-1y

When p = n, the matrix XXT is square and barely invertible. Its condition number — the ratio of the largest to smallest singular value — explodes. Inverting an ill-conditioned matrix amplifies noise in y, producing enormous weights that oscillate wildly. The weight norm ||w|| diverges at the interpolation threshold.

As p grows well past n, the system becomes increasingly underdetermined. There are infinitely many solutions, and the minimum-norm constraint selects the smoothest one. The matrix XXT becomes well-conditioned again, weights stay small, and generalization recovers.

Here's the connection to regularization: adding an L2 penalty (ridge regression) changes the solution to w = XT(XXT + λI)-1y. The λI term prevents the condition number from exploding, smoothing the interpolation peak entirely. This is why classical ML practitioners never observed double descent — they always regularized.

import numpy as np

np.random.seed(42)
n = 30
x = np.sort(np.random.uniform(0, 2 * np.pi, n))
y = np.sin(x) + np.random.randn(n) * 0.4
x_test = np.linspace(0, 2 * np.pi, 300)
y_test = np.sin(x_test)

ratios = np.concatenate([np.linspace(0.1, 0.9, 9), np.linspace(0.95, 1.05, 11), np.linspace(1.1, 5.0, 20)])
results = {lam: [] for lam in [0, 0.01, 0.1]}

for ratio in ratios:
    p = max(2, int(ratio * n))
    X_tr = np.vander(x, p + 1, increasing=True)
    X_te = np.vander(x_test, p + 1, increasing=True)

    for lam in [0, 0.01, 0.1]:
        if p <= n:
            w = np.linalg.solve(X_tr.T @ X_tr + lam * np.eye(p + 1), X_tr.T @ y)
        else:
            w = X_tr.T @ np.linalg.solve(X_tr @ X_tr.T + lam * np.eye(n), y)
        test_mse = np.mean((X_te @ w - y_test) ** 2)
        results[lam].append((ratio, min(test_mse, 5.0)))  # cap for display

# Print key values showing the peak and its smoothing
for lam in [0, 0.01, 0.1]:
    vals = results[lam]
    peak = max(vals, key=lambda v: v[1])
    final = vals[-1]
    print(f"lambda={lam:.2f}  peak_error={peak[1]:.3f} at p/n={peak[0]:.2f}  "
          f"final_error={final[1]:.3f} at p/n={final[0]:.2f}")

With no regularization (λ=0), the test error peak at p/n ≈ 1 is enormous. Adding even a tiny λ=0.01 dramatically reduces the peak. At λ=0.1, the peak nearly vanishes — the curve looks like the classical smooth U-shape. Regularization masks double descent, which explains why the phenomenon went undiscovered for decades despite being present in the simplest models.

Double Descent in Neural Networks

Polynomial regression is a clean sandbox, but the real excitement is that double descent appears in neural networks too — the models we actually care about. Nakkiran et al. (2020) showed that the phenomenon manifests along three axes:

  1. Model-wise: Fix the dataset, increase network width → test error shows the double descent pattern
  2. Epoch-wise: Fix the model, train longer → test error decreases, then increases (classical overfitting), then decreases again
  3. Sample-wise: Fix the model, add more training data → near the interpolation threshold, more data can temporarily hurt generalization

The sample-wise result is the most counterintuitive: there exist situations where giving the model more data makes it worse. This happens because adding data shifts the interpolation threshold, and if the model sits near that threshold, more data pushes it deeper into the critical regime.

Let's demonstrate model-wise double descent with a simple two-layer neural network. We'll train MLPs of increasing width on a small noisy dataset and watch the characteristic spike appear:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def train_mlp(X, y, hidden, lr=0.01, epochs=2000):
    """Train a 2-layer ReLU MLP: input -> hidden -> 1 output."""
    np.random.seed(7)
    d = X.shape[1]
    W1 = np.random.randn(d, hidden) * np.sqrt(2.0 / d)
    b1 = np.zeros(hidden)
    W2 = np.random.randn(hidden, 1) * np.sqrt(2.0 / hidden)
    b2 = np.zeros(1)

    for _ in range(epochs):
        # Forward
        h = relu(X @ W1 + b1)
        pred = h @ W2 + b2

        # MSE loss gradient
        err = pred - y.reshape(-1, 1)
        grad_W2 = h.T @ err / len(X)
        grad_b2 = err.mean(axis=0)
        grad_h = err @ W2.T
        grad_h[h == 0] = 0  # ReLU backward
        grad_W1 = X.T @ grad_h / len(X)
        grad_b1 = grad_h.mean(axis=0)

        W2 -= lr * grad_W2
        b2 -= lr * grad_b2
        W1 -= lr * grad_W1
        b1 -= lr * grad_b1

    return W1, b1, W2, b2

# Generate a small noisy dataset
np.random.seed(42)
n = 40
X_train = np.random.randn(n, 5)
true_w = np.array([1, -0.5, 0.3, 0, 0.8])
y_train = X_train @ true_w + np.random.randn(n) * 0.5
X_test = np.random.randn(200, 5)
y_test = X_test @ true_w

widths = [3, 5, 10, 20, 40, 60, 100, 200, 500]
for h in widths:
    W1, b1, W2, b2 = train_mlp(X_train, y_train, h)
    pred_tr = relu(X_train @ W1 + b1) @ W2 + b2
    pred_te = relu(X_test @ W1 + b1) @ W2 + b2
    tr_err = np.mean((pred_tr.flatten() - y_train) ** 2)
    te_err = np.mean((pred_te.flatten() - y_test) ** 2)
    params = 5 * h + h + h + 1  # W1 + b1 + W2 + b2
    print(f"width={h:>3d} params={params:>4d} (p/n={params/n:.1f})  "
          f"train={tr_err:.4f}  test={te_err:.4f}")

As width increases, total parameters grow from ~20 (underfitting) through ~40 (near the interpolation threshold, where params ≈ n) to ~3000 (heavily overparameterized). Test error spikes when p/n ≈ 1, then drops as the network enters the overparameterized regime. The same double descent curve we saw with polynomials emerges in neural networks.

Why Does the Second Descent Work? — Implicit Regularization

Here's the central theoretical question: in the overparameterized regime, infinitely many models perfectly fit the training data. Why does gradient descent find a good one?

Minimum-Norm Bias

For linear models, the answer is precise: gradient descent initialized at zero converges to the minimum-norm interpolating solution — the set of weights with the smallest ||w|| that still achieves zero training loss. Minimum-norm solutions are "smooth" in a precise mathematical sense: they don't assign large coefficients to any single feature, spreading the model's capacity evenly.

Think of it like choosing a path through a mountain range. There are infinitely many paths from A to B, but starting from the valley floor and taking the smallest steps leads you along the gentlest route — no unnecessary cliff scaling.

SGD Noise as a Regularizer

For neural networks, the story is richer. Stochastic gradient descent — with its noisy mini-batch estimates — introduces additional implicit regularization. The noise biases optimization toward wide, flat minima in the loss landscape. Wide minima generalize better than sharp ones because small perturbations to the weights (which inevitably happen with new test data) don't change the predictions much. Full-batch gradient descent, by contrast, can settle into sharp minima that generalize poorly.

This connects directly to the optimizers post: the very thing that makes SGD "noisier" than full-batch GD is what makes it better in the overparameterized regime.

Seeing It in Numbers

Let's verify the minimum-norm intuition. In overparameterized linear regression, we'll compare the unregularized minimum-norm solution to optimally-tuned ridge regression:

import numpy as np

np.random.seed(42)
n, p = 30, 150  # 5x overparameterized
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:5] = [1, -0.5, 0.3, -0.8, 0.6]  # only 5 features matter
y = X @ true_w + np.random.randn(n) * 0.3

X_test = np.random.randn(500, p)
y_test = X_test @ true_w

# Minimum-norm (no regularization): w = X^T (X X^T)^{-1} y
w_mn = X.T @ np.linalg.solve(X @ X.T + 1e-12 * np.eye(n), y)

# Ridge regression with optimal lambda (found by cross-validation)
best_lam, best_err = 0, float('inf')
for lam in [0.001, 0.01, 0.1, 0.5, 1.0, 5.0]:
    w_ridge = X.T @ np.linalg.solve(X @ X.T + lam * np.eye(n), y)
    err = np.mean((X_test @ w_ridge - y_test) ** 2)
    if err < best_err:
        best_lam, best_err = lam, err

w_ridge = X.T @ np.linalg.solve(X @ X.T + best_lam * np.eye(n), y)

print(f"Minimum-norm:  test_MSE={np.mean((X_test @ w_mn - y_test)**2):.4f}  ||w||={np.linalg.norm(w_mn):.3f}")
print(f"Ridge (lam={best_lam}): test_MSE={np.mean((X_test @ w_ridge - y_test)**2):.4f}  ||w||={np.linalg.norm(w_ridge):.3f}")
print(f"Train error:   min-norm={np.mean((X @ w_mn - y)**2):.6f}  ridge={np.mean((X @ w_ridge - y)**2):.6f}")

The minimum-norm solution achieves zero training error (it interpolates) while maintaining a test error close to the optimally-regularized ridge solution. Its weight norm is naturally small — not because we penalized it, but because gradient descent converges to the minimum-norm solution by default. The regularization post taught explicit regularization; here we see that implicit regularization from the optimizer achieves a similar effect for free.

Practical Implications — What This Means for You

Double descent isn't just a curiosity for theorists. It has direct practical consequences:

1. Don't fear large models. The interpolation peak is narrow. If your model is "too big," it's almost certainly in the overparameterized regime (past the peak), not at the peak. The danger zone is models that are just barely big enough to fit the data — a regime practitioners rarely land in.

2. More data can temporarily hurt. Near the interpolation threshold, adding training data shifts the threshold and can push the model into the critical regime. If you see test performance degrade after adding data, you might be near the peak. The fix: make the model bigger so it's safely overparameterized for the new dataset size.

3. Regularization masks the peak. Standard techniques — weight decay, dropout, early stopping — smooth the interpolation spike. This is why double descent went unobserved for decades: classical ML always regularized, and the peak vanished. If you're using any regularization, you probably won't see double descent at all.

4. This explains "just scale it up." The scaling laws show that bigger models, more data, and more compute all improve performance along predictable power laws. Double descent explains why: scaling pushes the model deeper into the overparameterized regime, further from the dangerous interpolation threshold and deeper into the benign second descent.

5. Train longer, even past overfitting. Epoch-wise double descent means that if your model overfits midway through training, continuing to train can actually recover generalization. The test loss curve goes down, then up, then down again. Don't always stop at the first minimum.

The Frontier — Beyond Double Descent

The double descent framework has opened several active research directions:

Multiple descent. In some settings — particularly with structured data or specific architectures — the test error curve shows three or more dips. The double descent may itself be a simplified picture of a richer phenomenon.

Grokking. Power et al. (2022) discovered that small transformers trained on algorithmic tasks (like modular arithmetic) first memorize the training data perfectly, then — thousands of epochs later — suddenly generalize. This looks like a dramatic version of epoch-wise double descent: the model sits in the critical regime for a long time, then snaps into the overparameterized-like regime where it generalizes.

Benign overfitting. Bartlett et al. (2020) formalized conditions under which interpolating noisy data doesn't hurt generalization. The key requirement is that the data lives in a high-dimensional space where most directions are "benign" — noise gets distributed across many irrelevant dimensions and doesn't corrupt the few signal dimensions.

Label noise amplifies the peak. Double descent is most dramatic when the training labels contain noise. With clean labels, the interpolation threshold is mild — the model just fits the true function. With noisy labels, the model is forced to memorize garbage at the threshold, producing wild overfitting. This explains why label quality matters more for medium-sized models than for very large ones.

The deep question remains: should we rewrite the bias-variance tradeoff for the overparameterized age? Perhaps the real picture isn't a U-shape or even a double-U, but something more nuanced — a function of model size, data quality, optimization method, and architecture that classical theory was never equipped to describe.

Try It: Double Descent Explorer

Drag the sliders to see how noise, regularization, and dataset size affect the double descent curve. The red dashed line marks the interpolation threshold (p/n = 1).

0.40
0
30
Peak error: -- Final error: --

Try It: Epoch-Wise Double Descent

Three models trained on the same noisy data. The small model follows the classical curve. The medium model (near the interpolation threshold) overfits then recovers — epoch-wise double descent. The large model descends smoothly.

0
Small: -- Medium: -- Large: --

Conclusion

Double descent reconciles the two most important claims in machine learning. The regularization post says "constrain your model to prevent overfitting." The scaling laws post says "bigger models generalize better." Both are correct — they're just describing different sides of the interpolation threshold.

In the classical regime (p < n), the bias-variance tradeoff reigns: more capacity can hurt. At the interpolation threshold (p ≈ n), things are worst. In the overparameterized regime (p » n), implicit regularization from gradient descent ensures that bigger models find smoother solutions and generalize better.

Modern deep learning lives almost entirely in the overparameterized regime. That's why scaling works. That's why GPT-4 generalizes despite having more parameters than any training set could justify. And that's why the classical bias-variance U-curve, while correct in its own domain, was never the full story.

The textbook wasn't wrong. It just stopped too early.

References & Further Reading