Optimizers from Scratch: How Neural Networks Actually Learn
You Have a Gradient. Now What?
In our loss functions post, we built functions that measure exactly how wrong a neural network is. In micrograd, we built an autograd engine that traces every operation backward through a computation graph to compute gradients — the direction each parameter should move to reduce that wrongness. But we glossed over the most critical line of code in all of deep learning:
param -= learning_rate * param.grad
That's it. That single line is where learning actually happens. Everything else — the forward pass, the loss calculation, the backward pass — is just setup for this moment. And this naive version has serious problems that took researchers decades to solve.
Today we're building four optimizers from scratch in NumPy: vanilla SGD, Momentum, RMSProp, and Adam. We'll watch each one navigate a treacherous loss landscape, understand why the simple approach fails spectacularly, and see how three clever ideas combine into the optimizer that conquered modern deep learning.
The learning rate is the single most important hyperparameter in deep learning. Get it wrong and your model either learns nothing or explodes. Optimizers are the engineering that makes almost any learning rate work.
Vanilla Gradient Descent: Following the Slope
Let's start with the simplest possible optimizer. You have parameters, you have gradients, you subtract. The update rule is almost embarrassingly simple:
θt+1 = θt − α · ∇L(θt)
Where θ is your parameters, α is the learning rate, and ∇L is the gradient of the loss. Here's the entire optimizer in eight lines:
import numpy as np
class SGD:
"""Vanilla stochastic gradient descent."""
def __init__(self, lr=0.01):
self.lr = lr
def step(self, params, grads):
"""Update parameters using raw gradients."""
return [p - self.lr * g for p, g in zip(params, grads)]
Let's test it on a simple bowl-shaped loss — a quadratic function where the minimum is at the origin. This is the easiest possible optimization problem:
# A simple bowl: L(x, y) = x² + y²
# Gradient: [2x, 2y]
def bowl_loss(pos):
return pos[0]**2 + pos[1]**2
def bowl_grad(pos):
return np.array([2*pos[0], 2*pos[1]])
opt = SGD(lr=0.1)
pos = np.array([4.0, 3.0])
for step in range(15):
loss = bowl_loss(pos)
grad = bowl_grad(pos)
print(f"Step {step:2d}: pos=({pos[0]:6.3f}, {pos[1]:6.3f}) loss={loss:.4f}")
pos = np.array(opt.step([pos[0], pos[1]], [grad[0], grad[1]]))
# Step 0: pos=( 4.000, 3.000) loss=25.0000
# Step 1: pos=( 3.200, 2.400) loss=16.0000
# Step 2: pos=( 2.560, 1.920) loss=10.2400
# ...
# Step 14: pos=( 0.176, 0.132) loss=0.0484
Beautiful. Loss drops steadily, the position spirals in toward zero. Vanilla SGD looks perfect — on a perfect problem. Now let's give it something harder.
Where Vanilla SGD Falls Apart
Real loss landscapes don't look like bowls. They look like ravines — narrow valleys where the loss drops steeply in one direction and gently in another. Think of a long, narrow canyon with a river at the bottom. The gradient points mostly across the canyon (the steep walls), not along it (toward the exit).
# A ravine: L(x, y) = 50x² + y²
# Steep across x, gentle along y
# Gradient: [100x, 2y]
def ravine_loss(pos):
return 50 * pos[0]**2 + pos[1]**2
def ravine_grad(pos):
return np.array([100 * pos[0], 2 * pos[1]])
opt = SGD(lr=0.01) # can't use 0.1 — would diverge!
pos = np.array([1.0, 8.0])
for step in range(30):
loss = ravine_loss(pos)
grad = ravine_grad(pos)
if step % 5 == 0:
print(f"Step {step:2d}: pos=({pos[0]:7.4f}, {pos[1]:7.4f}) loss={loss:.4f}")
pos = np.array(opt.step([pos[0], pos[1]], [grad[0], grad[1]]))
# Step 0: pos=( 1.0000, 8.0000) loss=114.0000
# Step 5: pos=( 0.0000, 7.2384) loss=52.3944
# Step 10: pos=(-0.0000, 6.5434) loss=42.8159
# Step 15: pos=( 0.0000, 5.9149) loss=34.9864
# Step 20: pos=(-0.0000, 5.3478) loss=28.5988
# Step 25: pos=( 0.0000, 4.8357) loss=23.3838
See the problem? After 30 steps, the x-coordinate snaps to zero almost instantly (the steep gradient makes that easy), but the y-coordinate barely moves. The learning rate is a balancing act: too high and x oscillates wildly; too low and y crawls. One learning rate cannot serve two masters.
The fundamental limitation of vanilla SGD: a single learning rate for all parameters. Parameters with large gradients need small steps. Parameters with small gradients need large steps. You can't have both.
This is the problem that launched a thousand papers. Two very different ideas emerged to fix it.
Momentum: The Bowling Ball Rolling Downhill
The first idea is beautifully physical. Vanilla SGD teleports to wherever the gradient points — every step is independent, with no memory of where it's been. Momentum gives the optimizer mass and velocity. It becomes a bowling ball rolling across the loss landscape, building up speed in consistent directions and naturally dampening oscillations.
The update rule introduces a velocity vector v that accumulates past gradients:
vt = β · vt−1 + ∇L(θt)
θt+1 = θt − α · vt
The β parameter (typically 0.9) controls how much history to keep. With β=0.9, you're effectively averaging over the last ~10 gradients. Here's why that fixes ravines: the oscillating component (across the narrow dimension) keeps flipping sign — positive, negative, positive, negative — and momentum averages these away. The consistent downhill component (along the valley floor) always points the same direction, so momentum amplifies it.
class SGDMomentum:
"""SGD with momentum — accumulates velocity from past gradients."""
def __init__(self, lr=0.01, beta=0.9):
self.lr = lr
self.beta = beta
self.velocity = None
def step(self, params, grads):
if self.velocity is None:
self.velocity = [np.zeros_like(g) for g in grads]
updated = []
for i, (p, g) in enumerate(zip(params, grads)):
# Accumulate velocity: keep β of old direction, add new gradient
self.velocity[i] = self.beta * self.velocity[i] + g
# Step in the velocity direction
updated.append(p - self.lr * self.velocity[i])
return updated
Let's run it on the same ravine problem:
opt = SGDMomentum(lr=0.01, beta=0.9)
pos = np.array([1.0, 8.0])
for step in range(30):
loss = ravine_loss(pos)
grad = ravine_grad(pos)
if step % 5 == 0:
print(f"Step {step:2d}: pos=({pos[0]:7.4f}, {pos[1]:7.4f}) loss={loss:.4f}")
pos = np.array(opt.step([pos[0], pos[1]], [grad[0], grad[1]]))
# Step 0: pos=( 1.0000, 8.0000) loss=114.0000
# Step 5: pos=( 0.0035, 5.6498) loss=32.4208
# Step 10: pos=(-0.0001, 3.3082) loss=10.9445
# Step 15: pos=( 0.0000, 1.7468) loss=3.0514
# Step 20: pos=(-0.0000, 0.8134) loss=0.6616
# Step 25: pos=( 0.0000, 0.3237) loss=0.1048
The difference is dramatic. After 25 steps, momentum has driven the loss down to 0.10 while vanilla SGD was still stuck at 23.38. The bowling ball metaphor is literal: momentum accumulates velocity along the valley floor and blasts through the slow direction that paralyzed vanilla SGD.
Momentum doesn't just go faster — it goes smarter. Oscillations cancel out in the velocity average. Consistent downhill motion accumulates. Same learning rate, radically different behavior.
RMSProp: A Different Fix (Per-Parameter Learning Rates)
Momentum solves the ravine problem by smoothing the direction of movement. Geoffrey Hinton had a completely different idea: instead of fixing the direction, adapt the step size for each parameter individually.
The insight is simple. Parameters with large gradients are oscillating — give them smaller steps. Parameters with small gradients are making slow progress — give them bigger steps. To know which is which, track the average magnitude of recent gradients:
st = β · st−1 + (1−β) · gt²
θt+1 = θt − α · gt / (√st + ε)
s is a running average of squared gradients. Dividing by √s normalizes each parameter's update: big gradients get divided by big numbers (smaller steps), small gradients get divided by small numbers (bigger steps). The ε (typically 1e−8) prevents division by zero.
Fun historical fact: Hinton proposed RMSProp in his 2012 Coursera lecture slides. He never wrote a paper. One of the most widely-used algorithms in deep learning was published as a slide deck.
class RMSProp:
"""RMSProp — per-parameter adaptive learning rates."""
def __init__(self, lr=0.01, beta=0.9, eps=1e-8):
self.lr = lr
self.beta = beta
self.eps = eps
self.sq_avg = None
def step(self, params, grads):
if self.sq_avg is None:
self.sq_avg = [np.zeros_like(g) for g in grads]
updated = []
for i, (p, g) in enumerate(zip(params, grads)):
# Track running average of squared gradients
self.sq_avg[i] = self.beta * self.sq_avg[i] + (1 - self.beta) * g**2
# Adapt step size: divide by root-mean-square of recent gradients
updated.append(p - self.lr * g / (np.sqrt(self.sq_avg[i]) + self.eps))
return updated
RMSProp attacks the ravine from the opposite angle. Instead of building momentum to barrel through slow directions, it shrinks the step size in the oscillating direction and grows it in the slow direction. Both approaches work — and as you might guess, combining them works even better.
Adam: Combining Momentum and Adaptive Rates
In 2014, Diederik Kingma and Jimmy Ba published a paper called "Adam: A Method for Stochastic Optimization" that combined both ideas into one optimizer. The name stands for Adaptive Moment Estimation: it tracks the first moment (mean) of gradients like momentum, and the second moment (uncentered variance) like RMSProp.
The update rules are straightforward — it's momentum plus RMSProp:
mt = β1 · mt−1 + (1−β1) · gt (first moment: like momentum)
vt = β2 · vt−1 + (1−β2) · gt² (second moment: like RMSProp)
But there's a catch. Both m and v are initialized to zero. In the first few steps, they're severely biased toward zero because the exponential average hasn't had time to warm up. Consider step 1 with β1=0.9: m1 = 0.9 × 0 + 0.1 × g1 = 0.1 × g1. The estimate is 10× too small.
Bias Correction: The Detail Everyone Skips
Kingma and Ba's key insight was a clean mathematical fix. After t steps, the expected value of mt is off by a factor of (1−β1t). At step 1 with β1=0.9, that factor is 0.1, so we divide by it to correct:
m̂t = mt / (1 − β1t)
v̂t = vt / (1 − β2t)
θt+1 = θt − α · m̂t / (√v̂t + ε)
At step 1: 1−0.91 = 0.1, so we multiply the estimate by 10×. By step 10: 1−0.910 ≈ 0.65, a mild correction. By step 100: 1−0.9100 ≈ 1.0 — no correction needed. The bias naturally vanishes as the running average warms up.
Without bias correction, Adam's first few steps are nearly frozen — the moment estimates are 10× too small. The correction factor 1/(1−βt) is what makes Adam work well from step 1. This is the detail that separates Adam from "just momentum plus RMSProp."
Here's the full implementation:
class Adam:
"""Adam — adaptive moment estimation with bias correction."""
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = None # first moment (mean of gradients)
self.v = None # second moment (mean of squared gradients)
self.t = 0 # step counter for bias correction
def step(self, params, grads):
if self.m is None:
self.m = [np.zeros_like(g) for g in grads]
self.v = [np.zeros_like(g) for g in grads]
self.t += 1
updated = []
for i, (p, g) in enumerate(zip(params, grads)):
# Update first moment (momentum-like)
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
# Update second moment (RMSProp-like)
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
# Bias correction — crucial for early steps
m_hat = self.m[i] / (1 - self.beta1 ** self.t)
v_hat = self.v[i] / (1 - self.beta2 ** self.t)
# Update: momentum direction, adaptive step size
updated.append(p - self.lr * m_hat / (np.sqrt(v_hat) + self.eps))
return updated
Twenty lines of code. That's the optimizer behind GPT, DALL-E, AlphaFold, and virtually every transformer ever trained. The defaults (α=0.001, β1=0.9, β2=0.999, ε=10−8) work remarkably well across a vast range of problems — a big reason for Adam's dominance.
The Showdown: Four Optimizers, One Loss Landscape
Let's race all four optimizers on the same challenging surface — a loss function with a narrow ravine and a gentle curve, designed to punish naive approaches:
# Beale-like surface: ravine + curvature
# Minimum near (3, 0.5)
def loss_fn(pos):
x, y = pos
return (1.5 - x + x*y)**2 + (2.25 - x + x*y**2)**2
def grad_fn(pos):
x, y = pos
a = 1.5 - x + x*y
b = 2.25 - x + x*y**2
dldx = 2*a*(-1 + y) + 2*b*(-1 + y**2)
dldy = 2*a*x + 2*b*(2*x*y)
return np.array([dldx, dldy])
start = np.array([0.5, 3.5])
optimizers = {
"SGD": SGD(lr=0.0005),
"Momentum": SGDMomentum(lr=0.0005, beta=0.9),
"RMSProp": RMSProp(lr=0.005, beta=0.9),
"Adam": Adam(lr=0.01, beta1=0.9, beta2=0.999),
}
results = {}
for name, opt in optimizers.items():
pos = start.copy()
trajectory = [pos.copy()]
for step in range(200):
grad = grad_fn(pos)
grad = np.clip(grad, -10, 10) # clip for stability
pos = np.array(opt.step([pos[0], pos[1]], [grad[0], grad[1]]))
trajectory.append(pos.copy())
results[name] = {"final_loss": loss_fn(pos), "final_pos": pos, "steps": trajectory}
print(f"{'Optimizer':<12} {'Final Loss':>12} {'Final Position':>20}")
print("-" * 48)
for name, r in results.items():
pos = r['final_pos']
print(f"{name:<12} {r['final_loss']:12.6f} ({pos[0]:.4f}, {pos[1]:.4f})")
# Optimizer Final Loss Final Position
# ------------------------------------------------
# SGD 12.690121 (0.6856, 2.9379)
# Momentum 0.481937 (1.6862, 0.8754)
# RMSProp 0.002648 (2.8614, 0.5138)
# Adam 0.000001 (2.9998, 0.5001)
Adam finds the minimum with six-decimal precision. SGD barely moves from the starting point — the learning rate that's safe for the steep direction is far too small for the gentle one. Momentum makes real progress but oscillates. RMSProp gets close but without momentum's directional smoothing, it's a little wobbly near the end.
| Optimizer | Final Loss | Key Strength | Key Weakness |
|---|---|---|---|
| SGD | 12.69 | Simplicity | Single learning rate for all params |
| Momentum | 0.48 | Smooths oscillations | Still one step size for everyone |
| RMSProp | 0.003 | Per-parameter adaptation | No directional memory |
| Adam | 0.000001 | Both smoothing AND adaptation | Slightly more memory |
What the Textbooks Don't Tell You
The four optimizers above cover the core ideas. But practical deep learning has sharp edges that trip up even experienced practitioners. Here are the ones that matter most.
Weight Decay vs. L2 Regularization: They're Not the Same
With vanilla SGD, weight decay and L2 regularization are mathematically equivalent. Adding λ||θ||² to the loss produces a gradient term λθ that gets subtracted from the parameters, which is exactly what weight decay does. But with Adam, L2 regularization gets divided by √v̂ — the adaptive scaling couples the regularization strength to the gradient magnitude, which is not what you want.
Ilya Loshchilov and Frank Hutter showed in their 2019 paper that decoupling weight decay from the adaptive update (applying it directly to parameters, not through the gradient) works dramatically better. This is AdamW, and it's now the default for training transformers:
# AdamW: the right way to regularize with Adam
# Standard Adam update (same as before)...
# Then apply weight decay DIRECTLY to parameters, not through gradient:
# θ = θ - lr * weight_decay * θ
# WRONG (L2 through gradient — gets divided by √v_hat):
# grad = grad + weight_decay * param # coupling!
# RIGHT (AdamW — decoupled):
# param = param - lr * (m_hat / (√v_hat + ε)) # Adam step
# param = param - lr * weight_decay * param # separate decay
Learning Rate Warmup
Transformers are famously unstable in early training. The second moment estimate v in Adam needs several hundred steps to calibrate. Before it's ready, the adaptive scaling is unreliable and large early gradients can push parameters into bad regions permanently. The fix is simple: start with a tiny learning rate and linearly ramp it up over the first few thousand steps, giving Adam's moment estimates time to warm up.
When to Use What
Despite Adam's dominance, SGD with momentum still has a place:
- AdamW — the universal default for transformers, language models, and most modern architectures
- SGD + Momentum — sometimes generalizes better for CNNs and image classification (well-tuned SGD can beat Adam's final accuracy)
- RMSProp — still popular in reinforcement learning (it's what DeepMind used in the original DQN)
- Vanilla SGD — educational purposes and convex optimization where you want guarantees
Try It: The Optimizer Race
The Optimizer Race
Pick a loss landscape, set a learning rate, and watch SGD, Momentum, RMSProp, and Adam race to the minimum. The star marks the global minimum. Watch how each optimizer handles ravines, saddle points, and multiple valleys differently.
What We Didn't Cover
Optimizers are a deep field. Here's what we left out — each of these could fill its own post:
- Second-order methods (L-BFGS, natural gradient, K-FAC) — use curvature information for even smarter steps, but too expensive for large networks
- Learning rate schedules — cosine annealing, step decay, cyclical learning rates. How you change the learning rate over training matters as much as its initial value
- Gradient clipping — truncating large gradients to prevent exploding updates, essential for RNNs and early transformer training
- Newer optimizers — LAMB and LARS for large-batch training, Lion (Google 2023) discovered by program search, Sophia that uses diagonal Hessian estimates
- Distributed optimization — gradient averaging across GPUs, local SGD, and how to scale batch sizes without killing convergence
We've now covered the full forward-to-backward pipeline of modern deep learning: tokenize → embed → attend → softmax → loss → optimize → backprop. Every piece you need to understand how a neural network transforms text into predictions and improves itself, built from scratch in Python.
References & Further Reading
- Kingma & Ba (2014) — Adam: A Method for Stochastic Optimization — the paper that launched a thousand training runs. One of the most cited ML papers ever.
- Loshchilov & Hutter (2019) — Decoupled Weight Decay Regularization — the AdamW paper showing why L2 and weight decay diverge with adaptive optimizers.
- Hinton (2012) — Neural Networks for Machine Learning, Lecture 6 — the Coursera slides where RMSProp was born.
- Ruder (2016) — An Overview of Gradient Descent Optimization Algorithms — the best survey of optimizers in one place. Covers everything from SGD to Adam and beyond.
- DadOps — Micrograd from Scratch — where we built the autograd engine that computes the gradients these optimizers consume.
- DadOps — Loss Functions from Scratch — the error signal that optimizers are minimizing.
- DadOps — Softmax & Temperature from Scratch — the function that produces the probabilities optimizers refine.