Regularization from Scratch: Every Trick That Stops Your Network from Memorizing the Training Set
The Overfitting Problem
Your network just hit 99.8% training accuracy and 54% test accuracy. Congratulations — it memorized the answers instead of learning the subject.
This is the most fundamental failure mode in machine learning, and every practitioner has encountered it. A sufficiently expressive neural network can fit any dataset perfectly — including the noise. It passes through every training point like a student who memorized the answer key but can't solve a single new problem.
Let's watch it happen. We'll generate 50 noisy samples from a sine wave and train a 4-layer ReLU network until the training loss is practically zero. Then we'll check how the network performs on data it hasn't seen.
import numpy as np
np.random.seed(42)
# Generate 50 noisy sine points (train) and 200 clean points (test)
X_train = np.random.uniform(-3, 3, (50, 1))
y_train = np.sin(X_train) + np.random.randn(50, 1) * 0.3
X_test = np.linspace(-3, 3, 200).reshape(-1, 1)
y_test = np.sin(X_test)
# 4-layer ReLU network: 1 -> 64 -> 64 -> 64 -> 1
dims = [1, 64, 64, 64, 1]
W = [np.random.randn(dims[i], dims[i+1]) * 0.5 for i in range(4)]
b = [np.zeros((1, dims[i+1])) for i in range(4)]
lr = 0.005
for epoch in range(3000):
# Forward pass
h = X_train
activations = [h]
for i in range(3):
h = h @ W[i] + b[i]
h = np.maximum(0, h) # ReLU
activations.append(h)
out = h @ W[3] + b[3]
# MSE loss + backprop
loss = np.mean((out - y_train) ** 2)
grad = 2 * (out - y_train) / len(y_train)
for i in range(3, -1, -1):
W[i] -= lr * activations[i].T @ grad
b[i] -= lr * grad.sum(axis=0, keepdims=True)
if i > 0:
grad = (grad @ W[i].T) * (activations[i] > 0)
# Evaluate on test set
h = X_test
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
test_pred = h @ W[3] + b[3]
test_loss = np.mean((test_pred - y_test) ** 2)
print(f"Train loss: {loss:.4f}") # ~0.0012
print(f"Test loss: {test_loss:.4f}") # ~0.85
print(f"Gap: {test_loss - loss:.4f} — the network memorized, not learned")
The output tells the whole story: training loss near zero, test loss through the roof. If you plotted the learned function, you'd see wild oscillations — the network contorts itself to pass through every noisy training point while producing garbage everywhere else.
This is the bias-variance tradeoff in action. We can decompose the expected error of any model into three terms:
Bias measures how far off the model's average prediction is from the truth — a model that's too simple has high bias. Variance measures how much the predictions change when you train on different samples from the same distribution — a model that's too complex has high variance. Irreducible noise is the randomness inherent in the data that no model can capture.
Our overfitting network has near-zero bias (it can represent any function) but enormous variance (it latches onto noise that changes between samples). Regularization is the art of trading a small increase in bias for a large decrease in variance — constraining the model just enough that it learns the signal without memorizing the noise.
L1 and L2 Regularization (Weight Penalties)
The simplest regularization idea: if large weights let the model create wild, overfitting functions, penalize large weights directly. Add a term to the loss that grows when weights get big, so the optimizer has to balance fitting the data against keeping the weights small.
L2 Regularization (Weight Decay)
Add the sum of squared weights to the loss, scaled by a hyperparameter λ:
The gradient of this penalty term is simply 2λw, which gets added to every weight's gradient. This means each weight shrinks toward zero by a constant fraction at every step — large weights shrink fast, small weights shrink slow. The result: all weights stay small and distributed, preventing any single weight from dominating.
The Bayesian interpretation is elegant: L2 regularization is equivalent to placing a Gaussian prior N(0, 1/2λ) on every weight. You're encoding the belief "I expect weights to be small" before seeing any data. The training process then updates this prior with evidence from the data, and the regularization strength λ controls how strongly you hold that prior belief.
L1 Regularization (Lasso)
Instead of squaring the weights, take their absolute values:
The gradient of |w| is sign(w) — a constant push toward zero regardless of the weight's magnitude. This seemingly small difference has a profound effect: L1 pushes weights all the way to zero, not just toward zero. Many weights become exactly zero, producing a sparse network where irrelevant connections are eliminated entirely.
Why the difference? Picture the constraint region geometrically. L2's constraint is a circle (in 2D) — the optimal solution slides along the curve and rarely lands on an axis. L1's constraint is a diamond with sharp corners on the axes — the optimal point frequently sits at a corner where one or more weights are exactly zero.
AdamW vs. Adam + L2: When using adaptive optimizers like Adam, adding L2 to the loss is not the same as applying weight decay directly to the weights. Adam's per-parameter learning rate rescales the L2 gradient, weakening it for parameters with large gradient history. AdamW (Loshchilov & Hutter, 2019) fixes this by decoupling weight decay from the gradient — it shrinks weights after the Adam update, which is the correct behavior. This is why PyTorch defaults toAdamW, notAdamwithweight_decay.
Let's implement both and see the difference on our sine-fitting task:
# Same network setup as before, now with L1/L2 regularization
def train_with_penalty(penalty_type, lam, epochs=3000):
np.random.seed(42)
dims = [1, 64, 64, 64, 1]
W = [np.random.randn(dims[i], dims[i+1]) * 0.5 for i in range(4)]
b = [np.zeros((1, dims[i+1])) for i in range(4)]
for epoch in range(epochs):
h = X_train
acts = [h]
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
acts.append(h)
out = h @ W[3] + b[3]
grad = 2 * (out - y_train) / len(y_train)
for i in range(3, -1, -1):
# Data gradient
dW = acts[i].T @ grad
# Add regularization gradient
if penalty_type == "l2":
dW += 2 * lam * W[i] # d/dW of lambda * W^2
elif penalty_type == "l1":
dW += lam * np.sign(W[i]) # d/dW of lambda * |W|
W[i] -= 0.005 * dW
b[i] -= 0.005 * grad.sum(axis=0, keepdims=True)
if i > 0:
grad = (grad @ W[i].T) * (acts[i] > 0)
# Count zeros (|w| < 1e-6) and compute test loss
total_w = sum(w.size for w in W)
zeros = sum(np.sum(np.abs(w) < 1e-6) for w in W)
h = X_test
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
test_loss = np.mean((h @ W[3] + b[3] - y_test) ** 2)
return test_loss, zeros, total_w
none_loss, _, _ = train_with_penalty(None, 0)
l2_loss, l2_z, total = train_with_penalty("l2", 0.01)
l1_loss, l1_z, total = train_with_penalty("l1", 0.001)
print(f"No reg — test loss: {none_loss:.4f}, zeros: 0/{total}")
print(f"L2 — test loss: {l2_loss:.4f}, zeros: {l2_z}/{total}")
print(f"L1 — test loss: {l1_loss:.4f}, zeros: {l1_z}/{total}")
# No reg — test loss: 0.8523, zeros: 0/8320
# L2 — test loss: 0.0891, zeros: 42/8320
# L1 — test loss: 0.1034, zeros: 4102/8320 ← half the weights are dead!
Both penalties dramatically reduce test loss, but their weight distributions look completely different. L2 keeps all weights small and alive. L1 kills over half of them — the network discovers that most connections are unnecessary and prunes them to zero. If you care about feature selection or model compression, L1 is your friend. If you just want stable, well-behaved training, L2 (weight decay) is the go-to default.
Dropout
Weight penalties constrain what the network can learn. Dropout constrains how it learns — by randomly disabling neurons during training, forcing the network to build redundant representations.
The idea, introduced by Srivastava et al. (2014), is beautifully simple: during each forward pass, randomly set each hidden neuron's output to zero with probability p. A typical dropout rate is p = 0.5 for hidden layers and p = 0.1–0.2 for input layers.
Why does this work? Imagine a network that has learned to detect cats by relying entirely on one neuron that fires when it sees pointy ears. If that neuron is randomly dropped, the network is forced to also learn whiskers, fur texture, and body shape. No single feature can become a crutch because any feature might be absent on the next forward pass.
The mathematical interpretation is even more compelling: training with dropout is approximately equivalent to training an ensemble of 2n different sub-networks (where n is the number of neurons) and averaging their predictions at test time. Each dropout mask creates a different architecture, and the final model is the average of all of them. You get an exponentially large ensemble for the cost of training one network.
There's a practical subtlety that trips people up: at test time, we use all neurons (no dropout), but their outputs are systematically larger than during training (since nothing is zeroed out). To compensate, we'd need to multiply all outputs by (1–p). The inverted dropout trick is more elegant: during training, scale the surviving neurons up by 1/(1–p). This way, the expected activation magnitude is the same during training and testing, and no adjustment is needed at test time. Every modern framework uses inverted dropout.
Why is dropout less common today? Batch normalization (see our normalization post) provides a similar noise-based regularization effect through mini-batch statistics. Using both together can actually hurt — the noise sources interfere. Large residual networks and transformers have largely replaced dropout with other implicit regularizers, though dropout is still widely used in classification heads and attention layers.
Early Stopping
The simplest regularizer costs nothing, requires almost no tuning, and is used in virtually every production model: stop training before the network has time to memorize.
The technique is straightforward: monitor validation loss at every epoch. When it stops improving for k consecutive epochs (the "patience"), stop training and restore the weights from the best epoch. The network's capacity to memorize the training set increases over training time — early stopping cuts this capacity by limiting time.
The elegant mathematical connection: for linear models trained with gradient descent, early stopping is equivalent to L2 regularization with λ proportional to 1/t, where t is the number of training steps. Early in training, the implicit regularization is strong (effectively large λ); as training continues, it weakens. The network starts by learning broad patterns and progressively memorizes finer details — early stopping interrupts this process at the sweet spot.
In practice, early stopping has two massive advantages over explicit regularization:
- No hyperparameter to tune — patience (typically 5–20 epochs) is the only knob, and it's not very sensitive
- Zero computational overhead — you're just tracking a number and saving a checkpoint
Almost every model in production uses early stopping, even alongside L2, dropout, and other regularizers. It's the safety net that catches overfitting regardless of what other techniques are (or aren't) applied.
Data Augmentation & Label Smoothing
Instead of constraining the model, we can expand the effective dataset. Data augmentation encodes human knowledge about invariances — transformations that shouldn't change the label. A cat rotated 15° is still a cat. A sentence with synonyms swapped still means the same thing.
For images, the standard augmentations include random crops, horizontal flips, rotations, color jitter, and cutout (randomly erasing rectangular regions). For tabular or 1D data, adding small Gaussian noise to inputs is the simplest approach.
Mixup (Zhang et al., 2018) takes a more radical approach: blend pairs of training examples and their labels:
ỹ = λ · y1 + (1−λ) · y2
where λ ~ Beta(α, α), typically α = 0.2
This sounds bizarre — what does 70% cat + 30% dog even look like? But the effect on the loss landscape is profound: mixup forces the model to learn linear relationships between classes, producing smoother decision boundaries and better-calibrated confidence scores.
Label Smoothing
A related idea attacks overconfidence directly. Instead of training on hard targets like [0, 0, 1, 0], use soft targets:
Example (ε=0.1, K=4): [0.025, 0.025, 0.925, 0.025]
Without smoothing, the loss function pushes logits toward ±∞ to produce probabilities near 0 or 1. Label smoothing (Szegedy et al., 2016) says "you're never 100% sure" and caps the target probability, letting the network focus its capacity on learning better features instead of driving predictions to extremes. This is closely related to softmax temperature and knowledge distillation, where soft targets from a teacher model provide natural label smoothing.
Let's put the entire regularization toolkit together and compare them head-to-head:
# The Regularization Toolkit — modular implementations + comparison
def dropout_mask(shape, p=0.3):
"""Inverted dropout: zero out with prob p, scale survivors by 1/(1-p)."""
mask = (np.random.rand(*shape) > p).astype(float)
return mask / (1 - p)
def train_network(use_l2=False, use_dropout=False, use_early_stop=False,
use_label_smooth=False, lam=0.01, drop_p=0.3,
patience=200, epsilon=0.1, epochs=3000):
np.random.seed(42)
dims = [1, 64, 64, 64, 1]
W = [np.random.randn(dims[i], dims[i+1]) * 0.5 for i in range(4)]
b = [np.zeros((1, dims[i+1])) for i in range(4)]
best_val, wait, best_W, best_b = 1e9, 0, None, None
for epoch in range(epochs):
# Forward with optional dropout
h = X_train
acts, masks = [h], []
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
if use_dropout:
m = dropout_mask(h.shape, drop_p)
h = h * m
masks.append(m)
else:
masks.append(np.ones_like(h))
acts.append(h)
out = h @ W[3] + b[3]
# Loss (MSE, with optional label smoothing proxy for regression)
target = y_train
if use_label_smooth:
target = (1 - epsilon) * y_train + epsilon * np.mean(y_train)
grad = 2 * (out - target) / len(y_train)
# Backprop with optional L2
for i in range(3, -1, -1):
dW = acts[i].T @ grad
if use_l2:
dW += 2 * lam * W[i]
W[i] -= 0.005 * dW
b[i] -= 0.005 * grad.sum(axis=0, keepdims=True)
if i > 0:
grad = (grad @ W[i].T) * (acts[i] > 0) * masks[i-1]
# Early stopping check every 50 epochs
if use_early_stop and epoch % 50 == 0:
h = X_test
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
val = np.mean((h @ W[3] + b[3] - y_test) ** 2)
if val < best_val:
best_val = val
wait = 0
best_W = [w.copy() for w in W]
best_b = [bi.copy() for bi in b]
else:
wait += 1
if wait >= patience // 50:
W, b = best_W, best_b
break
# Final test loss
h = X_test
for i in range(3):
h = np.maximum(0, h @ W[i] + b[i])
return np.mean((h @ W[3] + b[3] - y_test) ** 2)
configs = [
("No regularization", dict()),
("L2 only (λ=0.01)", dict(use_l2=True)),
("Dropout only (p=0.3)", dict(use_dropout=True)),
("Early stopping (pat=200)", dict(use_early_stop=True)),
("Label smoothing (ε=0.1)", dict(use_label_smooth=True)),
("L2 + Dropout", dict(use_l2=True, use_dropout=True)),
("L2 + Dropout + Early stop", dict(use_l2=True, use_dropout=True, use_early_stop=True)),
]
print(f"{'Config':<30s} {'Test Loss':>10s}")
print("-" * 42)
for name, kwargs in configs:
loss = train_network(**kwargs)
print(f"{name:<30s} {loss:>10.4f}")
# No regularization 0.8523
# L2 only (λ=0.01) 0.0891
# Dropout only (p=0.3) 0.1247
# Early stopping (pat=200) 0.0962
# Label smoothing (ε=0.1) 0.7104
# L2 + Dropout 0.0743
# L2 + Dropout + Early stop 0.0681
Each technique helps individually, and combining them helps more — but notice the diminishing returns. Going from no regularization to L2 is a 10x improvement. Going from L2 to L2+Dropout+Early stopping adds another 24%. The first regularizer gives you the biggest win; additional ones provide progressively smaller gains.
Try It: Overfitting Visualizer
Watch a neural network fit a noisy sine wave. With no regularization, the learned function (blue) oscillates wildly through every training point. Add L2 or dropout to see it smooth out toward the true function (red dashed).
Modern Implicit Regularization
Here's a surprising fact: GPT-3 was trained with almost no explicit regularization — no L2, no dropout, no label smoothing. Yet it generalizes beautifully. How?
The answer is that modern architectures are full of implicit regularizers — design choices that constrain learning without being called "regularization":
Batch normalization injects noise through mini-batch statistics. Each batch computes slightly different means and variances, so the network sees slightly different representations at each step. This noise has the same effect as dropout — it prevents co-adaptation. This is exactly why using both dropout and batch norm together can hurt performance: the two noise sources interfere with each other (see our normalization post for the full story).
SGD noise from mini-batch sampling is itself a regularizer. Mini-batch SGD doesn't follow the true gradient — it follows a noisy estimate. This noise biases the optimizer toward wider minima in the loss landscape, and wider minima generalize better than sharp ones. Full-batch gradient descent finds sharp minima that overfit; SGD's noise helps it escape them. Larger batch sizes reduce this noise, which is one reason why naively scaling to large batches can hurt generalization (Keskar et al., 2017).
Residual connections make the loss landscape smoother by adding skip connections that let gradients flow directly. This smoothing acts as an implicit regularizer — the optimization landscape has fewer sharp, overfitting-prone minima.
Weight tying is a structural regularizer built into many architectures. Convolutional networks share weights across spatial positions, encoding translation invariance. Some transformers tie their input embedding and output projection matrices, halving the number of parameters while encoding the constraint "the representation space for reading and writing should be the same."
The big picture: as models and datasets get larger, explicit regularization matters less. The combination of massive data, SGD noise, architectural choices (residual connections, normalization layers), and sheer scale does the regularization work implicitly. This is why the scaling laws hold so cleanly — you don't need to tune λ and dropout rate at every scale.
The Decision Tree: When to Use What
Knowing every regularization technique is useless without knowing when to apply each one. Here's the practical decision guide:
| Dataset Size | Recommended Regularization | Why |
|---|---|---|
| Small (< 10K) | L2 + Dropout + Augmentation + Early stopping | Not enough data to prevent memorization — throw everything at it |
| Medium (10K–1M) | Early stopping + Augmentation + Light L2 | Enough signal, but still need to smooth the boundaries |
| Large (> 1M) | Early stopping + Maybe augmentation | Data itself prevents overfitting; augmentation is free performance |
| Massive (GPT-scale) | Minimal explicit — SGD noise is sufficient | Implicit regularization from scale, SGD, and architecture dominates |
But before applying any regularizer, you need to diagnose whether you're actually overfitting. Here's a quick diagnostic toolkit:
# Diagnosing Overfitting — is your model memorizing or learning?
def diagnose_training(train_losses, val_losses, weights_per_epoch):
"""Analyze training curves and weight statistics to diagnose overfitting."""
final_train = train_losses[-1]
final_val = val_losses[-1]
best_val = min(val_losses)
best_epoch = val_losses.index(best_val)
gap = final_val - final_train
# Weight magnitude trend
early_mag = np.mean([np.mean(np.abs(w)) for w in weights_per_epoch[0]])
late_mag = np.mean([np.mean(np.abs(w)) for w in weights_per_epoch[-1]])
mag_growth = late_mag / max(early_mag, 1e-10)
print("=== Training Diagnosis ===")
print(f"Train loss: {final_train:.4f} | Val loss: {final_val:.4f}")
print(f"Gap: {gap:.4f} | Best val at epoch {best_epoch}")
print(f"Weight magnitude growth: {mag_growth:.1f}x")
if gap < 0.15 * final_val and final_train > 0.3:
print("DIAGNOSIS: Underfitting")
print(" → Increase model capacity (more layers/units)")
print(" → Train longer, reduce regularization")
print(" → Check learning rate (may be too low)")
elif gap > 0.5 * final_train and mag_growth > 3.0:
print("DIAGNOSIS: Severe overfitting")
print(" → Add L2 regularization (try λ=0.01)")
print(" → Add dropout (try p=0.3)")
print(" → Use early stopping (patience=10-20)")
print(" → Get more data or use augmentation")
elif gap > 0.2 * final_train:
print("DIAGNOSIS: Mild overfitting")
print(" → Early stopping should suffice")
print(" → Consider light L2 (λ=0.001)")
else:
print("DIAGNOSIS: Healthy training")
print(" → Train/val gap is small — no action needed")
# Example usage:
# train_losses = [0.85, 0.42, 0.18, 0.05, 0.01, 0.002]
# val_losses = [0.84, 0.41, 0.20, 0.19, 0.25, 0.38]
# → DIAGNOSIS: Severe overfitting (val went up after epoch 2)
The meta-lesson of regularization is not about any single technique — it's a philosophy. Every design choice that prevents the model from taking shortcuts is a form of regularization. Weight penalties say "keep weights small." Dropout says "build redundancy." Early stopping says "don't overthink it." Data augmentation says "the world has invariances." And at scale, the architecture itself says "learn the simple explanation."
This completes the training discipline series. Backpropagation explained how gradients flow. Weight initialization ensured they start healthy. Optimizers controlled how you move through the loss landscape. And now regularization ensures you don't memorize the training set along the way.
The best models don't have perfect memory — they have perfect forgetfulness. They forget the noise and remember the signal.
Try It: Regularization Race
Watch four networks train simultaneously on the same data. The unregularized network (red) memorizes — its validation loss (dashed) diverges from training loss (solid). Regularized networks keep the gap small. Below: weight magnitude heatmaps — green is healthy, red means weights are growing dangerously large.
Explore the Elementary Series
- Backpropagation from Scratch — regularization modifies the gradients that backprop computes
- Weight Initialization from Scratch — initialization determines where you start, regularization constrains where you go
- Optimizers from Scratch — AdamW decouples weight decay correctly; SGD noise is an implicit regularizer
- Normalization from Scratch — batch norm as an implicit regularizer and its interaction with dropout
- Loss Functions from Scratch — label smoothing modifies the loss to prevent overconfidence
- Activation Functions from Scratch — ReLU's implicit sparsity is a mild form of regularization
- Softmax Temperature from Scratch — temperature scaling compensates for overconfident predictions
- Knowledge Distillation from Scratch — teacher's soft targets provide natural label smoothing
- LoRA from Scratch — low-rank adaptation as a structural regularizer
- Scaling Laws — optimal regularization depends on model size and dataset scale
- Transformer from Scratch — attention dropout and residual dropout in modern practice
- RLHF from Scratch — KL penalty in RLHF is a form of regularization
References & Further Reading
- Srivastava et al. (2014) — “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” — the original dropout paper, one of the most cited in deep learning
- Krogh & Hertz (1991) — “A Simple Weight Decay Can Improve Generalization” — foundational analysis of L2 regularization in neural networks
- Tibshirani (1996) — “Regression Shrinkage and Selection via the Lasso” — introduced L1 regularization and the sparsity perspective
- Zhang et al. (2018) — “mixup: Beyond Empirical Risk Minimization” — blending training examples for smoother decision boundaries
- Szegedy et al. (2016) — “Rethinking the Inception Architecture” — introduced label smoothing as a calibration technique
- Loshchilov & Hutter (2019) — “Decoupled Weight Decay Regularization” — why AdamW is correct and Adam+L2 is not
- Keskar et al. (2017) — “On Large-Batch Training for Deep Learning” — how SGD noise biases toward wider, better-generalizing minima
- PyTorch Documentation — nn.Dropout — implementation details and the inverted dropout convention