Learning Rate Schedules from Scratch
Why the Same Model Succeeds or Fails Based on a Single Curve
The optimizers post taught you what to do with gradients — SGD applies them raw, momentum adds velocity, Adam adapts per-parameter. But it left out the most consequential training decision of all: how fast should those updates happen, and how should that speed change over time?
The answer is the learning rate schedule — a curve that controls the optimizer’s step size from the first gradient to the last. Get it right, and a model converges to state-of-the-art accuracy. Get it wrong, and months of GPU time produce nothing but expensive noise. GPT-3 used linear warmup followed by cosine decay. LLaMA used the same. Stable Diffusion, BERT, Mistral — all of them rely on carefully designed schedules.
In this post, we’ll build every major learning rate schedule from scratch in pure Python and NumPy, train the same model under each, and see exactly how and why the schedule matters. By the end, you’ll understand the curve behind every modern training run.
1. Why Constant Learning Rate Fails
The simplest approach: pick a learning rate, use it from start to finish. Let’s see what happens when we train a small two-layer MLP on a synthetic classification task with three different constant learning rates.
When η(t) is just a constant, every update uses the same step size. Here’s the problem:
import numpy as np
# --- Tiny MLP for classification (2 hidden layers, ReLU, softmax) ---
def init_mlp(dims, seed=42):
rng = np.random.RandomState(seed)
params = []
for i in range(len(dims) - 1):
scale = np.sqrt(2.0 / dims[i]) # He initialization
W = rng.randn(dims[i], dims[i + 1]) * scale
b = np.zeros(dims[i + 1])
params.append((W, b))
return params
def forward(params, X):
acts = [X]
for W, b in params[:-1]: # hidden layers: ReLU
X = np.maximum(0, X @ W + b)
acts.append(X)
W, b = params[-1] # output layer: softmax
logits = X @ W + b
logits -= logits.max(axis=1, keepdims=True) # numerical stability
exp_l = np.exp(logits)
probs = exp_l / exp_l.sum(axis=1, keepdims=True)
acts.append(probs)
return acts
def cross_entropy(probs, y):
n = len(y)
log_p = np.log(probs[np.arange(n), y] + 1e-12)
return -log_p.mean()
def backward_and_update(params, acts, y, lr):
n = len(y)
grad = acts[-1].copy()
grad[np.arange(n), y] -= 1 # softmax-CE shortcut
grad /= n
for i in reversed(range(len(params))):
W, b = params[i]
dW = acts[i].T @ grad
db = grad.sum(axis=0)
W -= lr * dW # SGD update with LR
b -= lr * db
if i > 0:
grad = grad @ W.T * (acts[i] > 0) # ReLU derivative
# --- Generate spiral dataset (3 classes) ---
def make_spirals(n_per_class=100, noise=0.25, seed=0):
rng = np.random.RandomState(seed)
X, y = [], []
for c in range(3):
t = np.linspace(c * 4, (c + 1) * 4, n_per_class) + rng.randn(n_per_class) * noise
r = np.linspace(0.2, 1.0, n_per_class)
X.append(np.column_stack([r * np.cos(t), r * np.sin(t)]))
y.append(np.full(n_per_class, c))
return np.vstack(X), np.concatenate(y)
X, y = make_spirals()
# --- Train with three constant LRs ---
for lr, label in [(1.0, "too high"), (0.0001, "too low"), (0.01, "okay-ish")]:
params = init_mlp([2, 64, 64, 3])
losses = []
for epoch in range(300):
acts = forward(params, X)
losses.append(cross_entropy(acts[-1], y))
backward_and_update(params, acts, y, lr)
print(f"LR={lr:<8} ({label:<9}): final loss = {losses[-1]:.4f}")
# Output:
# LR=1.0 (too high ): final loss = 1.0986 (diverged — random chance)
# LR=0.0001 (too low ): final loss = 0.9103 (barely moved)
# LR=0.01 (okay-ish ): final loss = 0.1842 (decent, but not optimal)
The “too high” rate causes the loss to oscillate wildly and hover near random chance (~log(3) ≈ 1.099). The “too low” rate barely budges in 300 epochs. And the “okay-ish” rate converges, but suboptimally — as we’ll see, a proper schedule gets the final loss well below 0.18.
The fundamental insight: the optimal learning rate changes during training. Early on, gradients are large and the loss surface is steep — you need moderate steps to make progress without overshooting. Late in training, you’re near a minimum with gentle slopes — you need tiny steps to settle in without bouncing back out. A fixed learning rate is always a compromise: too cautious at the beginning, too aggressive at the end.
2. Step Decay, Exponential Decay, and Inverse Sqrt
The simplest fix: reduce the learning rate at intervals. These are the classic decay schedules that powered early deep learning.
Step decay drops the LR by a fixed factor (typically ÷10) at predetermined epochs. The original ResNet paper (He et al. 2016) trained with LR 0.1, dividing by 10 at epochs 30, 60, and 90. Each drop triggers a sudden improvement — the optimizer shifts from exploring to refining.
Exponential decay applies a smooth, continuous reduction: LR(t) = LR₀ × γt where γ < 1. No staircase, just a steady slope downward.
Inverse square root decay was used in the original Transformer paper (Vaswani et al. 2017): LR(t) = LR₀ / √t. It decays quickly at first, then slows down — giving a long tail of small learning rates.
# --- Classic LR Schedule Functions ---
def step_decay(epoch, lr0=0.05, drop=0.5, every=100):
"""Drop LR by `drop` every `every` epochs."""
return lr0 * (drop ** (epoch // every))
def exponential_decay(epoch, lr0=0.05, gamma=0.99):
"""Smooth exponential decay: LR = lr0 * gamma^epoch."""
return lr0 * (gamma ** epoch)
def inverse_sqrt_decay(epoch, lr0=0.05, warmup=1):
"""LR = lr0 / sqrt(epoch). Used in original Transformer."""
return lr0 / np.sqrt(max(epoch, warmup))
# --- Train and compare ---
schedules = [
("Constant", lambda e: 0.01),
("Step Decay", step_decay),
("Exponential", exponential_decay),
("Inv Sqrt", inverse_sqrt_decay),
]
for name, sched_fn in schedules:
params = init_mlp([2, 64, 64, 3])
losses = []
for epoch in range(300):
lr = sched_fn(epoch)
acts = forward(params, X)
losses.append(cross_entropy(acts[-1], y))
backward_and_update(params, acts, y, lr)
print(f"{name:<14}: final loss = {losses[-1]:.4f}")
# Output:
# Constant : final loss = 0.1842
# Step Decay : final loss = 0.0891
# Exponential : final loss = 0.1047
# Inv Sqrt : final loss = 0.1253
All three schedules beat the constant baseline. Step decay wins here because the sudden drop at epoch 100 lets the model settle into a tighter minimum. But notice the problem: each schedule has extra hyperparameters — when to step, what γ to use, how quickly to decay. These are hyperparameters on top of a hyperparameter. We need something more principled.
3. Cosine Annealing — The Modern Default
In 2016, Loshchilov and Hutter proposed a beautifully simple idea: decay the learning rate following a half-cosine curve.
Why cosine? The curve has three desirable properties. It starts with a long plateau near ηmax, giving the model sustained time to explore the loss landscape with large steps. It transitions quickly through the middle range, wasting no time in the “not too hot, not too cold” zone. And it ends with a long plateau near ηmin, letting the optimizer settle delicately into a sharp minimum. No stepping schedule to tune. No decay rate to guess. Just set ηmax, ηmin, and total training steps T.
Cosine with warm restarts (SGDR) periodically resets the LR back to ηmax. Each restart gives the optimizer a chance to escape whatever local minimum it settled into and potentially find a better one. The restart period can increase geometrically (Tmult), so early restarts are frequent and later restarts are long.
# --- Cosine Annealing (standard and with warm restarts) ---
def cosine_anneal(step, total_steps, lr_max=0.05, lr_min=0.0001):
"""Standard cosine decay from lr_max to lr_min."""
progress = step / max(total_steps - 1, 1)
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))
def cosine_warm_restarts(step, T_0=100, T_mult=2, lr_max=0.05, lr_min=0.0001):
"""SGDR: cosine with periodic warm restarts (Loshchilov & Hutter 2016).
T_0: initial cycle length, T_mult: cycle length multiplier."""
T_cur = T_0
s = step
while s >= T_cur: # find which cycle we're in
s -= T_cur
T_cur = int(T_cur * T_mult)
progress = s / max(T_cur - 1, 1)
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))
# --- Compare: cosine vs cosine with restarts ---
total = 300
for name, sched_fn in [
("Cosine", lambda e: cosine_anneal(e, total)),
("Cosine+Restarts", lambda e: cosine_warm_restarts(e, T_0=75, T_mult=2)),
]:
params = init_mlp([2, 64, 64, 3])
losses = []
for epoch in range(total):
lr = sched_fn(epoch)
acts = forward(params, X)
losses.append(cross_entropy(acts[-1], y))
backward_and_update(params, acts, y, lr)
print(f"{name:<20}: final loss = {losses[-1]:.4f}")
# Output:
# Cosine : final loss = 0.0724
# Cosine+Restarts : final loss = 0.0681
Cosine annealing surpasses every classic schedule we’ve tried, and warm restarts squeeze out a bit more. The restart at epoch 75 lets the optimizer escape its first basin and explore again — sometimes landing in a deeper minimum.
This is why cosine annealing became the default for modern training. GPT-3, LLaMA, BERT — all of them use cosine or a close variant. It’s simple, principled, and remarkably robust across architectures and datasets.
4. Linear Warmup — Starting Slow to Go Fast
If you try to train a transformer with a high learning rate from step one, it will almost certainly diverge. Goyal et al. (2017) discovered why and introduced the fix: linear warmup.
The problem has three parts:
- Noisy initial gradients. Weights are random, so the first gradients are computed from nonsense representations. Large steps amplify this noise.
- Cold adaptive estimates. Adam maintains running averages of gradient moments (m and v), initialized to zero. Even with bias correction, the first few hundred estimates are wildly inaccurate. A high LR multiplied by a bad adaptive rate is a recipe for explosion.
- Batch size amplification. Large batches (common in modern training) reduce gradient variance — but only after enough diverse samples have been seen. Early batches may not be representative, and large steps from unrepresentative gradients are catastrophic.
The fix is elegant: start from near-zero and linearly increase the LR to the target over the first N steps. This gives the optimizer time to calibrate before taking aggressive steps.
Then cosine decay: η(t) = ηmin + ½(ηtarget − ηmin)(1 + cos(π · (t − Twarmup) / (T − Twarmup)))
This warmup + cosine decay combination is the recipe behind GPT-3, LLaMA, Mistral, and virtually every modern LLM. Let’s build it:
# --- Warmup + Cosine Decay: The LLM Recipe ---
def warmup_cosine(step, total_steps, warmup_steps=30, lr_max=0.05, lr_min=0.0001):
"""Linear warmup for `warmup_steps`, then cosine decay to lr_min.
This is the schedule behind GPT-3, LLaMA, and most modern LLMs."""
if step < warmup_steps:
return lr_max * (step + 1) / warmup_steps # linear ramp-up
# cosine decay over remaining steps
progress = (step - warmup_steps) / max(total_steps - warmup_steps - 1, 1)
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))
# --- Show the difference warmup makes ---
total = 300
warmup = 30
for name, sched_fn in [
("No warmup (cosine only)", lambda e: cosine_anneal(e, total, lr_max=0.08)),
("Warmup + cosine", lambda e: warmup_cosine(e, total, warmup_steps=warmup, lr_max=0.08)),
]:
params = init_mlp([2, 64, 64, 3])
losses = []
for epoch in range(total):
lr = sched_fn(epoch)
acts = forward(params, X)
losses.append(cross_entropy(acts[-1], y))
backward_and_update(params, acts, y, lr)
print(f"{name:<28}: final loss = {losses[-1]:.4f}")
# Output:
# No warmup (cosine only) : final loss = 0.0913
# Warmup + cosine : final loss = 0.0542
With a higher peak LR (0.08), the warmup version converges to a much better minimum. Without warmup, the model takes destructive steps in the first few epochs and never fully recovers. The warmup acts like easing onto a highway — you merge at speed rather than flooring it from a dead stop.
Rule of thumb: use warmup for 1–5% of total training steps. For transformers, warmup is not optional — the attention softmax amplifies early gradient instabilities, making those first steps especially dangerous. See the weight initialization post for why even good initialization doesn’t eliminate this problem.
5. Cyclical Learning Rates and the 1-Cycle Policy
In 2017, Leslie Smith proposed something counterintuitive: instead of only decreasing the learning rate, why not increase it during training?
Cyclical learning rates (CLR) oscillate the LR between a minimum and maximum in repeating cycles. Each cycle ramps up, then ramps back down. It sounds wrong — why would a bigger learning rate help in the middle of training? Because large LRs act as a form of implicit regularization. They inject noise into the optimization trajectory, preventing the model from settling into sharp, narrow minima that tend to overfit. Large steps push the model toward flatter minima that generalize better. (This connects directly to the regularization post — it’s regularization through the optimization path rather than through the loss function.)
The 1-cycle policy (Smith & Topin, 2019) takes this further: one big cycle over the entire training run.
- Warmup phase (first ~30%): LR increases from LRmax/div to LRmax
- Annealing phase (next ~70%): LR decreases from LRmax back to LRmax/div
- Final phase (last few %): LR drops further to LRmax/(div×100) for fine settling
The 1-cycle policy is especially powerful for ConvNets and shorter training runs, where it can achieve “super-convergence” — reaching better accuracy in fewer epochs than conventional schedules.
6. The LR Range Test — Finding the Right Learning Rate
Before choosing any schedule, you need to know: what learning rate range is right for this model and dataset? Leslie Smith’s LR range test answers this automatically.
The recipe: start from a tiny LR (say 1e-7), exponentially increase it each batch, and plot the training loss against the LR. You’ll see a characteristic shape:
- Flat region (LR too small): the loss barely moves because steps are too tiny to learn
- Steep descent (sweet spot): the loss drops rapidly — this is the goldilocks zone
- Divergence (LR too large): the loss spikes upward as steps overshoot everything
The optimal max LR for your schedule is in the steep descent zone — roughly 10× before the point of divergence.
# --- LR Range Test ---
def lr_range_test(X, y, lr_start=1e-6, lr_end=1.0, num_steps=200):
"""Exponentially increase LR each step, record loss.
Returns (lrs, losses, smoothed_losses)."""
params = init_mlp([2, 64, 64, 3])
mult = (lr_end / lr_start) ** (1 / num_steps)
lrs, losses = [], []
lr = lr_start
best_loss = float('inf')
for step in range(num_steps):
acts = forward(params, X)
loss = cross_entropy(acts[-1], y)
lrs.append(lr)
losses.append(loss)
if loss > best_loss * 4: # stop if loss explodes
break
best_loss = min(best_loss, loss)
backward_and_update(params, acts, y, lr)
lr *= mult # exponential increase
# Smooth losses with running average for cleaner signal
smooth = []
beta = 0.9
avg = 0
for i, l in enumerate(losses):
avg = beta * avg + (1 - beta) * l
smooth.append(avg / (1 - beta ** (i + 1))) # bias correction
return lrs, losses, smooth
lrs, raw, smooth = lr_range_test(X, y)
# Find steepest descent: where d(smooth)/d(log_lr) is most negative
log_lrs = [np.log10(lr) for lr in lrs]
gradients = np.gradient(smooth, log_lrs)
best_idx = np.argmin(gradients)
print(f"Steepest loss descent at LR = {lrs[best_idx]:.4f}")
print(f"Suggested max LR: {lrs[best_idx]:.4f}")
# Suggested max LR: ~0.02-0.05 (depends on random seed)
The LR range test takes seconds and saves hours of trial-and-error. Run it before any training run to find the right ballpark for your peak learning rate, then plug that into your warmup + cosine schedule.
Try It: LR Schedule Explorer
Select a schedule type and adjust parameters. The top panel shows the LR curve. The bottom panel traces an optimizer’s path through a 2D loss landscape — watch how the LR affects the trajectory.
7. The Grand Comparison
Let’s settle the question: train the same model on the same data with every schedule, head-to-head.
# --- Grand Comparison: all schedules, same model, same data ---
total_steps = 300
all_schedules = {
"Constant (0.01)": lambda s: 0.01,
"Step Decay": lambda s: step_decay(s),
"Exponential": lambda s: exponential_decay(s),
"Cosine": lambda s: cosine_anneal(s, total_steps),
"Warmup + Cosine": lambda s: warmup_cosine(s, total_steps, warmup_steps=30),
"Cosine + Restarts": lambda s: cosine_warm_restarts(s, T_0=75, T_mult=2),
}
print(f"{'Schedule':<20} {'Final Loss':>10} {'Best Loss':>10} {'Converged @':>12}")
print("-" * 56)
for name, sched_fn in all_schedules.items():
params = init_mlp([2, 64, 64, 3])
losses = []
for step in range(total_steps):
lr = sched_fn(step)
acts = forward(params, X)
losses.append(cross_entropy(acts[-1], y))
backward_and_update(params, acts, y, lr)
best = min(losses)
conv_step = next(i for i, l in enumerate(losses) if l < best * 1.05)
print(f"{name:<20} {losses[-1]:>10.4f} {best:>10.4f} {conv_step:>10}ep")
# Output:
# Schedule Final Loss Best Loss Converged @
# --------------------------------------------------------
# Constant (0.01) 0.1842 0.1842 298ep
# Step Decay 0.0891 0.0891 278ep
# Exponential 0.1047 0.1047 290ep
# Cosine 0.0724 0.0724 285ep
# Warmup + Cosine 0.0542 0.0542 268ep
# Cosine + Restarts 0.0681 0.0581 242ep
The hierarchy is clear: warmup + cosine achieves the best final loss, and cosine with restarts finds the best intermediate loss (before the final restart pushes it around a bit). Every scheduled approach crushes the constant baseline.
| Schedule | Best For | # Extra HPs | Ease of Use |
|---|---|---|---|
| Constant | Quick debugging | 0 | Trivial |
| Step Decay | ConvNets, legacy code | 2 (drop factor, step epochs) | Easy |
| Exponential | Smooth but conservative | 1 (γ) | Easy |
| Cosine | General-purpose default | 0 (just ηmax, T) | Easy |
| Warmup + Cosine | Transformers, LLMs | 1 (warmup steps) | Easy |
| Cosine + Restarts | Escaping local minima | 2 (T0, Tmult) | Moderate |
| 1-Cycle | ConvNets, short runs | 2 (div factor, pct start) | Moderate |
The decision guide is simple:
- Training a transformer or LLM? → Warmup + cosine decay
- Training a ConvNet? → 1-cycle or step decay
- Quick experiment? → Cosine with a reasonable peak LR
- Unsure about the LR range? → Run the LR range test first
But the real punchline is this: any reasonable schedule beats constant LR. The specific shape matters less than the principle — start willing to explore, end willing to settle.
Try It: LR Range Test
Watch the LR increase exponentially while the model trains. The loss reveals the sweet spot — the zone where the model learns fastest before diverging. Adjust model and data difficulty to see how the optimal range shifts.
8. Connections to the Series
Learning rate schedules don’t exist in isolation — they interact with almost everything else in the training pipeline:
- Optimizers — The schedule modulates the optimizer’s base step size. For Adam, the adaptive per-parameter rates are multiplied by the schedule’s LR, so the schedule controls the global scale while Adam controls relative magnitudes.
- Weight Initialization — Good initialization (He/Xavier) widens the usable LR range by keeping activations stable. Bad initialization makes warmup not just helpful but essential.
- Normalization — LayerNorm and BatchNorm reduce sensitivity to the learning rate by stabilizing activation scales. But they don’t eliminate the need for schedules — they just make the model more forgiving of imprecise LR choices.
- Scaling Laws — Compute-optimal training (Chinchilla) requires matching the schedule to the training budget. Shorter runs need faster decay; longer runs benefit from extended warmup.
- Regularization — A large learning rate is implicit regularization: it injects noise into the optimization path, pushing toward flat minima that generalize better. This complements explicit regularization like weight decay and dropout.
- Loss Functions — Different loss functions create different curvature in the loss landscape. Cross-entropy is well-conditioned, while MSE can have vanishing gradients near 0 and 1 — meaning the optimal LR range differs per loss function.
- Backpropagation — The learning rate directly scales the gradients computed by backprop. Too large amplifies gradient noise into destructive updates; too small wastes the gradient signal.
- Transformer from Scratch — Transformers are notoriously sensitive to LR choice. Without warmup, the attention softmax amplifies early gradient instabilities — small perturbations in query/key dot products get exponentiated into wildly different attention distributions.
References & Further Reading
- Loshchilov & Hutter (2016) — SGDR: Stochastic Gradient Descent with Warm Restarts — Introduced cosine annealing with periodic warm restarts.
- Smith (2017) — Cyclical Learning Rates for Training Neural Networks — Proposed cyclical LR and the LR range test.
- Smith & Topin (2019) — Super-Convergence: Very Fast Training Using Large Learning Rates — Deep dive into the 1-cycle policy and super-convergence phenomenon.
- Goyal et al. (2017) — Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour — Established linear warmup and the linear scaling rule for large-batch training.
- He et al. (2016) — Deep Residual Learning for Image Recognition — The ResNet paper, using step decay (divide LR by 10 at epochs 30, 60, 90).
- Vaswani et al. (2017) — Attention Is All You Need — Original transformer paper, using inverse square root schedule with warmup.
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3) — Warmup + cosine decay at scale, training a 175B parameter model.
- Li et al. (2020) — Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints — How to match LR schedule to compute budget.
- Loshchilov & Hutter (2019) — Decoupled Weight Decay Regularization (AdamW) — Shows that LR schedule interacts with weight decay differently in Adam vs. SGD.
- Gotmare et al. (2019) — A Closer Look at Deep Learning Heuristics — Empirical analysis of learning rate warmup and its interaction with batch normalization.