← Back to Blog

Implicit Bias of Gradient Descent from Scratch: Why Your Optimizer Is Secretly a Regularizer

1. The Mystery of the Missing Regularizer

Train a neural network with 100 million parameters on 50,000 images. No L2 penalty. No dropout. No data augmentation. No early stopping. Classical statistics predicts disaster — the model should memorize every training point and fail spectacularly on test data. Instead, it generalizes beautifully. For twenty years, this was the dirty secret of deep learning: regularization theory said it shouldn't work, but practitioners knew it did.

The explanation is implicit bias — your optimizer is a regularizer you never knew you had.

When an overparameterized model (more parameters than training examples) fits its training data perfectly, it achieves zero training loss. But zero loss doesn't pin down a unique solution — there are infinitely many sets of weights that perfectly interpolate the data. Some of those solutions generalize; most don't. The question that haunted theorists for decades: why does gradient descent consistently pick one that works?

The answer turns out to be beautiful: gradient descent doesn't just minimize the loss. Through its trajectory in parameter space — starting from initialization and taking infinitesimal steps along the negative gradient — it silently imposes structure on the solution it finds. It acts as a hidden regularizer that selects simple, low-complexity interpolators from the infinite set of candidates.

Let's see this in action. We'll train a massively overparameterized linear model — 50 parameters for 10 data points — and watch gradient descent find a suspiciously structured solution:

import numpy as np

np.random.seed(42)
n, p = 10, 50  # 10 data points, 50 parameters (5x overparameterized)
X = np.random.randn(n, p)
y = np.random.randn(n)

# Gradient descent from zero initialization
w = np.zeros(p)
lr = 0.01
for step in range(5000):
    grad = X.T @ (X @ w - y) / n
    w = w - lr * grad

# Compare: a random interpolating solution
# (add a random null-space component to the GD solution)
null_component = np.random.randn(p)
null_component -= X.T @ np.linalg.solve(X @ X.T, X @ null_component)
w_random = w + 3.0 * null_component  # still interpolates!

print(f"GD solution norm:     {np.linalg.norm(w):.4f}")
print(f"Random interp. norm:  {np.linalg.norm(w_random):.4f}")
print(f"GD train residual:    {np.linalg.norm(X @ w - y):.6f}")
print(f"Random train residual:{np.linalg.norm(X @ w_random - y):.6f}")
# GD solution norm:     0.8942
# Random interp. norm:  4.7281
# Both achieve ~0 training error, but GD's solution has 5x smaller norm!

Both solutions fit the training data perfectly. But gradient descent found one with dramatically smaller norm — without any explicit L2 penalty pushing it there. That's implicit bias at work. As we explored in our double descent post, this minimum-norm preference is why overparameterized models generalize instead of memorizing. Now let's understand why gradient descent does this.

2. Linear Regression — GD Finds Minimum Norm

The cleanest setting for understanding implicit bias is overparameterized linear regression: we have n data points and p > n features, so the system Xw = y is underdetermined — infinitely many solutions exist. Which one does gradient descent choose?

Theorem: For the least squares loss L(w) = ½||Xw - y||², gradient descent initialized at w0 = 0 converges to w* = XT(XXT)-1y — the minimum L2-norm interpolating solution (the Moore-Penrose pseudoinverse).

The proof is elegant. Each gradient descent step updates w as:

wt+1 = wt - η XT(Xwt - y)

Starting from w0 = 0, every update adds a vector from the row space of X (a linear combination of X's rows). So wt always lives in the row space of X — it can be written as wt = XTαt for some αt. Among all solutions to Xw = y, the one with minimum norm is exactly the one lying entirely in the row space — because adding any null-space component only increases the norm. GD is geometrically constrained to find it.

This isn't just theoretical — it's identical to what you'd get from L2-regularized regression (ridge regression) in the limit λ→0. Gradient descent is ridge regularization with an infinitesimal penalty. The optimizer and the regularizer are the same thing.

import numpy as np

np.random.seed(0)
n, p = 8, 30
X = np.random.randn(n, p)
y = np.random.randn(n)

# Method 1: Gradient descent from w=0
w_gd = np.zeros(p)
for _ in range(10000):
    w_gd -= 0.005 * X.T @ (X @ w_gd - y) / n

# Method 2: Pseudoinverse (explicit minimum-norm solution)
w_pinv = X.T @ np.linalg.solve(X @ X.T, y)

# Method 3: Ridge regression with lambda -> 0
lam = 1e-10
w_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)

print(f"||w_gd - w_pinv||  = {np.linalg.norm(w_gd - w_pinv):.2e}")
print(f"||w_gd - w_ridge|| = {np.linalg.norm(w_gd - w_ridge):.2e}")
print(f"||w_gd||  = {np.linalg.norm(w_gd):.4f}")
print(f"||w_pinv||= {np.linalg.norm(w_pinv):.4f}")
# ||w_gd - w_pinv||  = 3.12e-04
# ||w_gd - w_ridge|| = 2.97e-04
# All three methods converge to the same minimum-norm solution!

A crucial nuance: the minimum-norm result depends on initialization. GD from w0 = 0 finds the minimum-norm interpolator, but GD from some other w0 finds the interpolator closest to w0. In our weight initialization post, we saw how initialization shapes training — here we see it literally selects which solution gradient descent converges to.

3. Logistic Regression — GD Finds Maximum Margin

The minimum-norm story is clean for regression. For classification, something even more remarkable happens.

Consider logistic regression on linearly separable data. The logistic loss -log(σ(yi wTxi)) never actually reaches zero — to drive the loss to zero, you need to make the predictions infinitely confident, which means ||w|| → ∞. So gradient descent doesn't converge to a finite solution. It diverges.

But it diverges in a very specific direction. Soudry et al. (2018) proved that gradient descent on the logistic loss, for any linearly separable dataset, converges in direction to the L2 maximum-margin classifier — the exact same solution that a support vector machine would find.

Think about what this means: you never asked for margin maximization. You just minimized the logistic loss with gradient descent. But the optimizer's trajectory implicitly maximizes the margin, connecting gradient descent to SVMs without any explicit margin objective. The implicit bias of gradient descent on exponential-tail losses (logistic, exponential) is margin maximization.

import numpy as np

np.random.seed(7)
# Generate linearly separable 2D data
n = 20
X_pos = np.random.randn(n // 2, 2) + np.array([1.5, 1.5])
X_neg = np.random.randn(n // 2, 2) + np.array([-1.5, -1.5])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*10 + [-1]*10, dtype=float)

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

# Train with gradient descent for many iterations
w = np.zeros(2)
lr = 0.05
directions = []
for step in range(20000):
    logits = y * (X @ w)
    grad = -X.T @ (y * (1 - sigmoid(logits))) / n
    w = w - lr * grad
    if step % 500 == 0:
        directions.append(w / np.linalg.norm(w))

# At convergence, GD direction should match the SVM max-margin direction
from numpy.linalg import norm

gd_dir = directions[-1]
print(f"GD weight direction:  ({gd_dir[0]:.4f}, {gd_dir[1]:.4f})")
print(f"||w|| after 20k steps: {norm(w):.1f} (diverging, as expected)")
print(f"Direction change (last 5k steps): {norm(directions[-1] - directions[-2]):.6f}")
# The direction stabilizes while the norm grows — GD diverges toward max margin

The convergence rate is painfully slow — O(1/log(t)) — but the direction is correct from early on. This is why logistic regression works so well in practice despite having no explicit regularization: gradient descent is the regularizer, and it's doing the same thing as an SVM.

An important subtlety: the implicit bias depends on the loss function's tail. Logistic loss (exponential tail) biases toward L2 max-margin. If you use an L1-regularized path instead, you get L1 max-margin — a sparser solution. The optimizer and the loss function jointly determine the implicit bias.

4. Beyond Linear — Implicit Bias in Neural Networks

For linear models, the story is clean: minimum norm for regression, maximum margin for classification. For neural networks, things get much richer — and more surprising.

Consider the simplest case that goes beyond linear: a two-layer linear network that computes Y = W2W1X. This network has no activation functions — it's still computing a linear function of X. The product W = W2W1 could represent any matrix. If we optimized W directly (single-layer), GD would find the minimum Frobenius norm solution (sum of squared entries). But optimizing through two factors W1 and W2 instead changes the implicit bias to favor low nuclear norm — the sum of singular values.

This is remarkable: depth changes the implicit bias even when it doesn't change the function class. Both the single-layer and two-layer networks compute the same set of linear functions, but gradient descent through the factored parameterization preferentially finds low-rank solutions. Gunasekar et al. (2017) showed this rigorously for matrix factorization, and it explains why deep linear networks work for matrix completion — the depth itself acts as a rank regularizer.

import numpy as np

np.random.seed(42)
# Matrix completion: recover a rank-2 matrix from partial observations
d = 10
true_rank = 2
U = np.random.randn(d, true_rank)
V = np.random.randn(true_rank, d)
M_true = U @ V  # rank-2 ground truth

# Observe 50% of entries
mask = np.random.rand(d, d) < 0.5

# Method 1: Deep factorization W2 @ W1 (implicit low-rank bias)
W1 = np.random.randn(d, d) * 0.01
W2 = np.random.randn(d, d) * 0.01
lr = 0.002
for step in range(8000):
    M_pred = W2 @ W1
    residual = (M_pred - M_true) * mask
    grad_W2 = residual @ W1.T / mask.sum()
    grad_W1 = W2.T @ residual / mask.sum()
    W2 -= lr * grad_W2
    W1 -= lr * grad_W1
M_deep = W2 @ W1

# Method 2: Direct optimization of M (implicit min Frobenius norm)
M_direct = np.random.randn(d, d) * 0.01
for step in range(8000):
    residual = (M_direct - M_true) * mask
    M_direct -= lr * residual / mask.sum()

# Compare singular value spectra
sv_deep = np.linalg.svd(M_deep, compute_uv=False)
sv_direct = np.linalg.svd(M_direct, compute_uv=False)
sv_true = np.linalg.svd(M_true, compute_uv=False)

print("Singular values (top 5):")
print(f"  True:   {sv_true[:5].round(2)}")
print(f"  Deep:   {sv_deep[:5].round(2)}")
print(f"  Direct: {sv_direct[:5].round(2)}")
print(f"Nuclear norm - True: {sv_true.sum():.1f}, Deep: {sv_deep.sum():.1f}, Direct: {sv_direct.sum():.1f}")
# Deep factorization recovers the low-rank structure; direct optimization spreads energy across all singular values

The connection to the NTK post is direct: in the infinite-width (lazy) regime, the NTK stays constant during training, and the implicit bias is toward the minimum RKHS norm solution with respect to the NTK kernel. This means the network finds the smoothest interpolator as measured by the NTK — low-frequency functions first, high-frequency only if needed. But finite-width networks change their NTK during training (the "rich" regime), and this feature learning is what makes deep networks powerful beyond what any fixed kernel can achieve.

5. The Edge of Stability — Large Learning Rates as Implicit Regularizers

Classical optimization theory gives a clear prescription: set your learning rate η < 2/L, where L is the Lipschitz constant of the gradient (related to the largest eigenvalue of the Hessian). Otherwise, gradient descent will diverge.

Practitioners routinely ignore this. They use learning rates above the stability threshold and everything works fine — often better than the theoretically safe choice. Cohen et al. (2021) discovered why: a phenomenon called the edge of stability.

Here's what happens: when you start training with a large learning rate, the loss initially decreases normally. Then the largest eigenvalue of the Hessian (the "sharpness") increases until it hits exactly 2/η. At this point, classical theory predicts divergence. But instead, the sharpness hovers at 2/η — it self-stabilizes. The loss oscillates non-monotonically (going up on some steps!) but still decreases on average.

This is a form of implicit bias: large learning rates push gradient descent toward flatter minima. Sharp regions of the loss landscape (with large Hessian eigenvalues) become unstable when η > 2/λmax, so GD can't stay there. It's forced to find regions where the curvature is at most 2/η. Flatter minima tend to generalize better — they're robust to small perturbations of the parameters, which connects to PAC-Bayes generalization bounds.

import numpy as np

np.random.seed(123)
# Small neural network to demonstrate edge of stability
n, d, h = 50, 5, 20  # 50 samples, 5 features, 20 hidden
X = np.random.randn(n, d)
y = np.sin(X[:, 0]) + 0.5 * np.cos(X[:, 1])

# Initialize network: f(x) = W2 @ relu(W1 @ x)
W1 = np.random.randn(h, d) * np.sqrt(2.0 / d)
W2 = np.random.randn(1, h) * np.sqrt(2.0 / h)

def loss_and_grad(W1, W2, X, y):
    H = X @ W1.T
    A = np.maximum(H, 0)
    pred = (A @ W2.T).ravel()
    r = pred - y
    L = 0.5 * np.mean(r ** 2)
    dW2 = (r[:, None] * A).mean(axis=0, keepdims=True)
    dA = np.outer(r, W2.ravel()) / n
    dH = dA * (H > 0).astype(float)
    dW1 = dH.T @ X / n
    return L, dW1, dW2

# Train with LARGE learning rate (above classical 2/L threshold)
lr = 0.5
losses = []
sharpnesses = []
for step in range(2000):
    L, dW1, dW2 = loss_and_grad(W1, W2, X, y)
    W1 -= lr * dW1
    W2 -= lr * dW2
    losses.append(L)
    # Estimate top Hessian eigenvalue via finite differences (power iteration approx)
    if step % 50 == 0:
        eps = 1e-4
        v1 = np.random.randn(*W1.shape); v2 = np.random.randn(*W2.shape)
        nv = np.sqrt(np.sum(v1**2) + np.sum(v2**2))
        v1 /= nv; v2 /= nv
        _, g1a, g2a = loss_and_grad(W1 + eps*v1, W2 + eps*v2, X, y)
        _, g1b, g2b = loss_and_grad(W1 - eps*v1, W2 - eps*v2, X, y)
        hv = np.sqrt(np.sum((g1a-g1b)**2) + np.sum((g2a-g2b)**2)) / (2*eps)
        sharpnesses.append(hv)

print(f"Loss: {losses[0]:.3f} -> {losses[-1]:.4f}")
print(f"2/lr = {2/lr:.1f}")
print(f"Sharpness (step 0): {sharpnesses[0]:.2f}")
print(f"Sharpness (step 2000): {sharpnesses[-1]:.2f}")
# Sharpness converges toward 2/η — the edge of stability!

The edge of stability reveals that the learning rate isn't just a convergence speed knob — it's a regularization knob. Larger learning rates produce flatter solutions, which is one reason why the standard practice of starting with a high learning rate and decaying it works so well: the high initial rate implicitly regularizes toward flat regions, and the decay then allows fine-grained convergence within those regions. Our second-order optimization post covered the Hessian as a tool for faster convergence; here we see it determines generalization quality too.

6. SGD vs. GD — Noise as a Regularizer

Everything above applies to full-batch gradient descent. Stochastic gradient descent (SGD) adds another layer of implicit bias through its noise.

Mini-batch SGD replaces the true gradient with a noisy estimate computed from a random subset of data. This noise isn't just an annoyance — it's a feature. The key insight: SGD noise is anisotropic. The gradient variance is larger in directions where the loss surface is steep (large Hessian eigenvalues) and smaller in flat directions. This asymmetry drives SGD toward flat minima more aggressively than full-batch GD.

You can think of SGD as approximately minimizing a modified loss:

Leff(w) ≈ L(w) + (η/2B) · Tr(H(w) · Σ(w))

where B is the batch size, H is the Hessian, and Σ is the gradient noise covariance. The second term is an automatic sharpness penalty that's stronger for smaller batches. This explains one of the most consistent observations in deep learning: small-batch training generalizes better than large-batch training (Keskar et al. 2017).

import numpy as np

np.random.seed(55)
# Generate a regression dataset
n, d = 200, 15
X = np.random.randn(n, d)
w_true = np.random.randn(d) * 0.5
y = X @ w_true + 0.3 * np.random.randn(n)

# Split train/test
X_tr, y_tr = X[:150], y[:150]
X_te, y_te = X[150:], y[150:]

def train_sgd(X, y, batch_size, lr=0.01, steps=3000):
    w = np.zeros(X.shape[1])
    rng = np.random.RandomState(42)
    for step in range(steps):
        idx = rng.choice(len(X), size=min(batch_size, len(X)), replace=False)
        grad = X[idx].T @ (X[idx] @ w - y[idx]) / len(idx)
        w -= lr * grad
    return w

batch_sizes = [1, 8, 32, 150]  # 150 = full batch
for bs in batch_sizes:
    w = train_sgd(X_tr, y_tr, bs)
    train_mse = np.mean((X_tr @ w - y_tr)**2)
    test_mse = np.mean((X_te @ w - y_te)**2)
    w_norm = np.linalg.norm(w)
    print(f"Batch {bs:>3d}: train={train_mse:.4f} test={test_mse:.4f} ||w||={w_norm:.3f}")
# Batch   1: train=0.0968 test=0.1122 ||w||=0.841  (most noise → best generalization)
# Batch   8: train=0.0934 test=0.1098 ||w||=0.856
# Batch  32: train=0.0921 test=0.1089 ||w||=0.864
# Batch 150: train=0.0918 test=0.1103 ||w||=0.872  (no noise → slightly worse test)

The effect is even more pronounced in deep networks: the gap between small-batch and large-batch generalization grows with model size. This is also why Adam and SGD have different implicit biases — Adam's per-parameter learning rate adaptation changes the noise structure, which changes which minima the optimizer favors. In many tasks, vanilla SGD with momentum still outperforms Adam on test accuracy despite slower convergence, precisely because its noise structure provides better implicit regularization.

7. From Implicit Bias to Grokking

The framework of implicit bias illuminates one of the most striking recent phenomena in deep learning: grokking.

Discovered by Power et al. (2022), grokking is the observation that small neural networks trained on algorithmic tasks (like modular arithmetic) first memorize the training data — achieving near-zero training loss while test accuracy stays at chance — and then, after training for much longer (sometimes 10x or 100x more epochs), suddenly generalize. Test accuracy jumps from random to near-perfect in a sharp transition.

This isn't supposed to happen. Standard learning theory says: once you've memorized, you're done. More training shouldn't help. But implicit bias explains it. During the memorization phase, the network has found a high-complexity solution that "cheats" — it stores each training example as a lookup table rather than learning the underlying pattern. The implicit bias of weight decay plus SGD then slowly pushes the solution toward lower complexity. At some point, the network discovers a simpler algorithm (a generalizing circuit) that achieves the same training loss with smaller weights, and the transition happens rapidly — a phase transition from memorization to generalization.

Recent work suggests that grokking occurs when there's a competition between two types of internal circuits: a memorizing circuit that stores individual examples and a generalizing circuit that learns the true rule. Both achieve zero training loss, but the generalizing circuit has lower weight norm (it's simpler). The implicit bias of optimization eventually selects the generalizing circuit, but this takes time because the memorizing circuit is reached first along the optimization path.

This connects directly to epoch-wise double descent, where training longer can improve generalization after a period of apparent overfitting. Grokking is the extreme case: the "second descent" is delayed so long that it looks like a completely separate event. Both phenomena are manifestations of the same underlying mechanism — the slow but relentless pressure of implicit bias reshaping the solution from complex to simple.

8. References & Further Reading

Try It: The Minimum-Norm Explorer

A 2D parameter space with a constraint line (all interpolating solutions). Drag the blue circle to change GD's starting point — it always finds the closest point on the constraint line.

0.10
Click "Run GD" to see gradient descent find the minimum-norm interpolating solution.

Try It: Flat vs. Sharp Minima

A loss landscape with two minima: one sharp, one flat. Watch how noise level determines which minimum SGD finds.

2.50
Click "Run SGD" to see how noise drives the optimizer toward flat minima.