Backpropagation from Scratch: How Neural Networks Learn by Going Backwards
The Algorithm That Makes Everything Else Possible
Every neural network ever trained—from a 3-neuron XOR solver to GPT-4's trillion parameters—learned through the same algorithm. Not gradient descent (that's the optimizer). Not the loss function (that's the objective). The algorithm is backpropagation: the systematic process of computing how much each weight contributed to the error, by working backwards through the network one layer at a time.
The micrograd post built an autograd engine, but treated gradient computation as a black box. This post opens the box. Backprop is the chain rule applied systematically to computational graphs—but understanding it deeply explains why residual connections work, why vanishing gradients kill deep networks, and why certain activations train better than others.
Let's derive backprop from the chain rule, implement it by hand for a full MLP, verify it against numerical gradients, and build the intuition for gradient flow that will transform how you read every other post in the series.
The Chain Rule: One Equation That Powers All of Deep Learning
Backpropagation is just the chain rule, applied carefully. The single-variable chain rule says: if f(x) = g(h(x)), then:
Each function in the chain only needs to know its own local derivative—how its output changes with respect to its input. The chain rule multiplies these local derivatives together to get the end-to-end derivative.
Let's make it concrete. Consider f(x) = sin(x² + 3x). Let u = x² + 3x. Then df/dx = cos(u) · (2x + 3). At x = 2: u = 10, so df/dx = cos(10) · 7 ≈ -5.874. Two local derivatives, multiplied together. That's the entire idea.
For neural networks, a weight w affects the loss L through a chain of operations: multiply by input, add bias, apply activation, pass to next layer, ..., compute loss. The chain rule tells us ∂L/∂w by multiplying all the local derivatives along that chain. The deeper the weight, the longer the chain—which is both the power and the vulnerability of backpropagation.
But how do we know our chain rule computation is correct? We check it against brute force:
import numpy as np
def numerical_gradient(f, x, eps=1e-7):
"""Compute df/dx using central finite differences.
This is the 'ground truth' that backprop must match."""
return (f(x + eps) - f(x - eps)) / (2 * eps)
# Test 1: f(x) = sin(x^2 + 3x) at x=2
def f(x):
return np.sin(x**2 + 3*x)
def f_grad_analytic(x):
"""Hand-derived: df/dx = cos(x^2 + 3x) * (2x + 3)"""
return np.cos(x**2 + 3*x) * (2*x + 3)
x = 2.0
num_grad = numerical_gradient(f, x)
ana_grad = f_grad_analytic(x)
print(f"Numerical gradient: {num_grad:.10f}")
print(f"Analytic gradient: {ana_grad:.10f}")
print(f"Difference: {abs(num_grad - ana_grad):.2e}")
# Test 2: Gradient of a neural network weight
# Simple: y = sigmoid(w*x + b), L = (y - target)^2
def sigmoid(z):
return 1 / (1 + np.exp(-z))
x_val, target = 1.5, 1.0
w, b = 0.8, -0.2
def loss_fn(w_val):
z = w_val * x_val + b
y = sigmoid(z)
return (y - target) ** 2
# Numerical gradient for w
num_dw = numerical_gradient(loss_fn, w)
# Analytic gradient via chain rule:
# dL/dw = dL/dy * dy/dz * dz/dw
z = w * x_val + b
y = sigmoid(z)
dL_dy = 2 * (y - target) # loss gradient
dy_dz = y * (1 - y) # sigmoid gradient
dz_dw = x_val # linear gradient
ana_dw = dL_dy * dy_dz * dz_dw
print(f"\nNeural net weight gradient:")
print(f"Numerical: {num_dw:.10f}")
print(f"Analytic: {ana_dw:.10f}")
print(f"Difference: {abs(num_dw - ana_dw):.2e}")
# Output:
# Numerical gradient: -5.8735006980
# Analytic gradient: -5.8735007035
# Difference: 5.51e-09
#
# Neural net weight gradient:
# Numerical: -0.1586312778
# Analytic: -0.1586312784
# Difference: 5.17e-10
Differences on the order of 10-9 to 10-10—that's numerical precision noise, not a real discrepancy. This is the test suite for backprop: if your hand-derived gradients don't match numerical gradients to ~8 decimal places, you have a bug. No exceptions.
Computational Graphs: Seeing the Network as a Diagram
Every neural network computation can be drawn as a directed acyclic graph (DAG). Nodes are operations (add, multiply, ReLU, matmul). Edges carry values forward and gradients backward.
The forward pass evaluates nodes left to right, computing each operation's output from its inputs. The backward pass traverses right to left, propagating gradients using the chain rule. At each node, the computation is purely local:
The upstream gradient is ∂L/∂output—how the loss changes with this node's output (received from the right). The local gradient is ∂output/∂input—how this node's output changes with its input (a property of the operation itself). Their product gives the downstream gradient: ∂L/∂input—passed to the left.
Each node only needs to know two things: what operation it performs (for the local gradient) and what arrived from the right (the upstream gradient). It doesn't need to know anything about the rest of the network. This locality is what makes backprop both efficient and modular.
Local Gradients: The Building Blocks
A neural network is built from a small set of operations. Each one has a simple local gradient:
- Addition (
c = a + b):∂c/∂a = 1,∂c/∂b = 1— gradient flows equally to both inputs. Addition is a gradient distributor. - Multiplication (
c = a × b):∂c/∂a = b,∂c/∂b = a— the gradient is the other input. This "swap rule" is why we need to cache intermediate values during the forward pass. - Matrix multiply (
C = XW):∂L/∂X = (∂L/∂C)WT,∂L/∂W = XT(∂L/∂C)— transposes appear naturally. - ReLU (
y = max(0, x)):∂y/∂x = 1ifx > 0, else0— a binary gate that either passes the gradient through untouched or kills it. - Sigmoid (
y = σ(x)):∂y/∂x = σ(x)(1 - σ(x))— elegant self-referential form. Maximum value is 0.25, which is the root of the vanishing gradient problem. - Softmax + cross-entropy: the combined gradient simplifies to
ŷ - y— prediction minus target. The most beautiful cancellation in deep learning.
Let's implement and verify each one:
import numpy as np
def check(name, analytic, numerical, eps=1e-7):
diff = np.max(np.abs(analytic - numerical))
status = "PASS" if diff < 1e-5 else "FAIL"
print(f" {name}: diff={diff:.2e} [{status}]")
def num_grad_array(f, x, eps=1e-5):
"""Numerical gradient for array-valued input."""
grad = np.zeros_like(x)
for i in range(x.size):
x_flat = x.flat
old = x_flat[i]
x_flat[i] = old + eps
fplus = f(x.copy())
x_flat[i] = old - eps
fminus = f(x.copy())
x_flat[i] = old
grad.flat[i] = (fplus - fminus) / (2 * eps)
return grad
# --- Addition ---
a, b = 3.0, 5.0
upstream = 1.0 # dL/dc
da, db = upstream * 1.0, upstream * 1.0 # local grad = 1
print("Addition: da=1.0, db=1.0 ✓ (gradient distributor)")
# --- Multiplication ---
a, b = 3.0, 5.0
da, db = b, a # swap rule
print(f"Multiply: da={da} (=b), db={db} (=a) ✓ (swap rule)")
# --- ReLU ---
x = np.array([-2.0, 0.5, -0.1, 3.0])
relu_grad = (x > 0).astype(float)
print(f"ReLU grad: {relu_grad} ✓ (binary gate)")
# --- Sigmoid ---
def sigmoid(x):
return 1 / (1 + np.exp(-x))
x = np.array([0.0, 1.0, -1.0, 5.0])
sig = sigmoid(x)
sig_grad = sig * (1 - sig)
print(f"Sigmoid grad: {np.round(sig_grad, 4)}")
print(f" Max sigmoid grad = {sig_grad.max():.4f} (at x=0)")
# --- Matrix multiply ---
np.random.seed(42)
X = np.random.randn(3, 4) # 3 samples, 4 features
W = np.random.randn(4, 2) # 4 features -> 2 outputs
dC = np.random.randn(3, 2) # upstream gradient
dX_analytic = dC @ W.T # shape: (3, 4)
dW_analytic = X.T @ dC # shape: (4, 2)
# Verify dW numerically
scalar_loss = lambda W_: np.sum(dC * (X @ W_)) # proxy
dW_numerical = num_grad_array(scalar_loss, W.copy())
check("MatMul dW", dW_analytic, dW_numerical)
# --- Softmax + Cross-Entropy ---
logits = np.array([2.0, 1.0, 0.1])
target = np.array([1.0, 0.0, 0.0]) # one-hot
# Softmax
exp_l = np.exp(logits - logits.max())
probs = exp_l / exp_l.sum()
# Cross-entropy loss
loss = -np.sum(target * np.log(probs + 1e-12))
# The beautiful result: gradient = probs - target
grad_logits = probs - target
print(f"\nSoftmax+CE gradient: {np.round(grad_logits, 4)}")
print(f" = probs - target = {np.round(probs, 4)} - {target}")
print(f" (prediction minus truth — that's it!)")
# Output:
# Addition: da=1.0, db=1.0 ✓ (gradient distributor)
# Multiply: da=5.0 (=b), db=3.0 (=a) ✓ (swap rule)
# ReLU grad: [0. 1. 0. 1.] ✓ (binary gate)
# Sigmoid grad: [0.25 0.1966 0.1966 0.0066]
# Max sigmoid grad = 0.2500 (at x=0)
# MatMul dW: diff=1.89e-11 [PASS]
#
# Softmax+CE gradient: [-0.341 0.2424 0.0986]
# = probs - target = [0.659 0.2424 0.0986] - [1. 0. 0.]
# (prediction minus truth — that's it!)
That softmax + cross-entropy result deserves emphasis. The Jacobian of softmax alone is a full matrix—messy to derive, messy to compute. But when you compose it with cross-entropy loss, everything cancels to give ŷ - y: the prediction minus the truth. This simplification is why cross-entropy is the default loss for classification—it's not just theoretically motivated, it gives the cleanest possible gradient signal.
Backprop Through a Full Network: The Manual Way
Now let's put it all together. We'll build a 3-layer MLP from scratch, implement both the forward and backward passes by hand, train it on XOR, and verify every gradient against numerical finite differences.
XOR is the perfect test case: four data points, a non-linear decision boundary that requires at least one hidden layer, and small enough to trace every single gradient by hand.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
class ManualMLP:
"""3-layer MLP with fully manual forward and backward passes."""
def __init__(self):
np.random.seed(3)
# Layer 1: 2 -> 4 (hidden)
self.W1 = np.random.randn(2, 4) * 0.5
self.b1 = np.zeros((1, 4))
# Layer 2: 4 -> 4 (hidden)
self.W2 = np.random.randn(4, 4) * 0.5
self.b2 = np.zeros((1, 4))
# Layer 3: 4 -> 1 (output)
self.W3 = np.random.randn(4, 1) * 0.5
self.b3 = np.zeros((1, 1))
def forward(self, X):
"""Forward pass — cache everything for backward."""
self.X = X
self.z1 = X @ self.W1 + self.b1
self.a1 = np.maximum(0, self.z1) # ReLU
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = np.maximum(0, self.z2) # ReLU
self.z3 = self.a2 @ self.W3 + self.b3
self.a3 = sigmoid(self.z3) # Sigmoid output
return self.a3
def loss(self, y_pred, y_true):
"""Binary cross-entropy loss."""
eps = 1e-12
return -np.mean(
y_true * np.log(y_pred + eps) +
(1 - y_true) * np.log(1 - y_pred + eps)
)
def backward(self, y_true):
"""Backward pass — chain rule, layer by layer, right to left."""
m = y_true.shape[0]
# Step 1: Loss gradient -> sigmoid output
# d(BCE)/d(a3) combined with sigmoid gives:
dz3 = (self.a3 - y_true) / m
# Step 2: Layer 3 weight gradients
self.dW3 = self.a2.T @ dz3
self.db3 = np.sum(dz3, axis=0, keepdims=True)
# Step 3: Propagate to layer 2 activation
da2 = dz3 @ self.W3.T
# Step 4: Through ReLU (binary gate)
dz2 = da2 * (self.z2 > 0).astype(float)
# Step 5: Layer 2 weight gradients
self.dW2 = self.a1.T @ dz2
self.db2 = np.sum(dz2, axis=0, keepdims=True)
# Step 6: Propagate to layer 1 activation
da1 = dz2 @ self.W2.T
# Step 7: Through ReLU
dz1 = da1 * (self.z1 > 0).astype(float)
# Step 8: Layer 1 weight gradients
self.dW1 = self.X.T @ dz1
self.db1 = np.sum(dz1, axis=0, keepdims=True)
def step(self, lr=0.5):
"""Gradient descent update."""
self.W1 -= lr * self.dW1
self.b1 -= lr * self.db1
self.W2 -= lr * self.dW2
self.b2 -= lr * self.db2
self.W3 -= lr * self.dW3
self.b3 -= lr * self.db3
# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
y = np.array([[0], [1], [1], [0]], dtype=float)
net = ManualMLP()
# Verify gradients numerically before training
y_pred = net.forward(X)
L = net.loss(y_pred, y)
net.backward(y)
def loss_for_param(param_name, net, X, y):
"""Compute loss with a given parameter."""
y_pred = net.forward(X)
return net.loss(y_pred, y)
# Check W1 gradient numerically
eps = 1e-5
dW1_num = np.zeros_like(net.W1)
for i in range(net.W1.shape[0]):
for j in range(net.W1.shape[1]):
old = net.W1[i, j]
net.W1[i, j] = old + eps
lp = loss_for_param('W1', net, X, y)
net.W1[i, j] = old - eps
lm = loss_for_param('W1', net, X, y)
net.W1[i, j] = old
dW1_num[i, j] = (lp - lm) / (2 * eps)
max_diff = np.max(np.abs(net.dW1 - dW1_num))
print(f"W1 gradient check: max diff = {max_diff:.2e}")
# Train on XOR
net = ManualMLP()
for epoch in range(2000):
y_pred = net.forward(X)
L = net.loss(y_pred, y)
net.backward(y)
net.step(lr=1.0)
if epoch % 500 == 0:
print(f"Epoch {epoch:4d}: loss={L:.4f}")
# Final predictions
y_pred = net.forward(X)
print(f"\nFinal predictions:")
for i in range(4):
print(f" {X[i]} -> {y_pred[i,0]:.4f} (target: {y[i,0]:.0f})")
# Output:
# W1 gradient check: max diff = 3.24e-12
# Epoch 0: loss=0.6974
# Epoch 500: loss=0.0027
# Epoch 1000: loss=0.0012
# Epoch 1500: loss=0.0008
#
# Final predictions:
# [0. 0.] -> 0.0000 (target: 0)
# [0. 1.] -> 0.9978 (target: 1)
# [1. 0.] -> 1.0000 (target: 1)
# [1. 1.] -> 0.0000 (target: 0)
The gradient check passes with a difference of 10-12, confirming our hand-derived backward pass is correct. And the network solves XOR cleanly—the non-linear decision boundary emerges from just two hidden layers of ReLU activations, trained entirely by our manual backprop implementation.
Notice the mirror structure: the forward pass goes X → z1 → a1 → z2 → a2 → z3 → a3, and the backward pass reverses it exactly: dz3 → da2 → dz2 → da1 → dz1. Each forward operation has a corresponding backward step, and we traverse them in reverse order. This symmetry is why it's called back-propagation.
Vanishing and Exploding Gradients
Here's the critical insight about depth. The gradient at layer k is the product of all local gradients from layer k to the output. If each local gradient has magnitude less than 1, the product shrinks exponentially with depth. If each has magnitude greater than 1, the product explodes.
Sigmoid's fatal flaw: its maximum derivative is 0.25 (at x = 0). After 10 layers, the gradient is at most 0.2510 ≈ 10-6. The first layer receives a gradient a million times smaller than the last. It barely learns at all.
ReLU's fix: its derivative is exactly 1 for positive inputs. Gradients flow through without shrinking—though "dead ReLU" neurons (permanently negative) still kill gradient flow locally.
Residual connections: the deepest fix. If y = F(x) + x, then ∂y/∂x = ∂F/∂x + 1. That "+ 1" creates a gradient highway—even if F's gradient vanishes, the skip connection preserves a gradient of exactly 1. This is why ResNets can train 152 layers and why every transformer block uses residual connections.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def measure_gradient_flow(n_layers, activation='sigmoid', residual=False):
"""Measure how much gradient reaches layer 1 in an n-layer network."""
np.random.seed(42)
dim = 16
x = np.random.randn(1, dim)
# Forward pass — store activations and pre-activations
weights = []
pre_acts = []
acts = [x]
for i in range(n_layers):
W = np.random.randn(dim, dim) * (2.0 / dim) ** 0.5 # Kaiming init
weights.append(W)
z = acts[-1] @ W
pre_acts.append(z)
if activation == 'sigmoid':
a = sigmoid(z)
else:
a = np.maximum(0, z) # ReLU
if residual and i > 0:
a = a + acts[-1] # skip connection
acts.append(a)
# Backward pass — track gradient magnitude
grad = np.ones_like(acts[-1]) # dL/d(output) = 1
for i in range(n_layers - 1, -1, -1):
if residual and i > 0:
grad_skip = grad.copy() # residual path
if activation == 'sigmoid':
s = sigmoid(pre_acts[i])
local = s * (1 - s)
else:
local = (pre_acts[i] > 0).astype(float)
grad = grad * local @ weights[i].T
if residual and i > 0:
grad = grad + grad_skip # add skip gradient
return np.mean(np.abs(grad))
# Compare across depths
print(f"{'Layers':>8} {'Sigmoid':>12} {'ReLU':>12} {'Sig+Resid':>12}")
print("-" * 48)
for n in [2, 5, 10, 15, 20]:
g_sig = measure_gradient_flow(n, 'sigmoid', False)
g_relu = measure_gradient_flow(n, 'relu', False)
g_res = measure_gradient_flow(n, 'sigmoid', True)
print(f"{n:>8} {g_sig:>12.2e} {g_relu:>12.2e} {g_res:>12.2e}")
# Output:
# Layers Sigmoid ReLU Sig+Resid
# ------------------------------------------------
# 2 6.22e-02 1.44e+00 2.06e-01
# 5 5.00e-04 2.43e+00 1.99e-01
# 10 7.59e-06 5.16e+00 1.97e-01
# 15 7.03e-09 1.32e+00 1.61e-01
# 20 1.99e-11 1.24e+00 1.57e-01
The numbers tell the whole story. At 20 layers, the sigmoid gradient has shrunk to 10-11—effectively zero. ReLU maintains a healthy gradient around 1.0, over ten billion times larger. And sigmoid with residual connections recovers to ~0.16—the skip connections rescue the gradient flow even through a bad activation function.
This single table explains why the deep learning revolution needed three ingredients: ReLU activations (the activation functions post), careful weight initialization like Xavier/Kaiming (to keep variance at ~1 per layer), and residual connections (the transformer post uses them in every block). Without healthy gradient flow, no amount of data or compute can train a deep network.
Try It: Gradient Flow Visualizer
Interactive: Gradient Flow Visualizer
Watch how gradients flow backward through a network. Adjust depth and activation to see vanishing gradients in action.
Try It: Backprop Step-by-Step
Interactive: Backprop Step-by-Step
Step through the backward pass of a tiny network, one operation at a time. Watch gradients accumulate from right to left.
Beyond Manual Backprop: Automatic Differentiation
Manual backprop is educational but impractical for a network with millions of parameters. Automatic differentiation (AD) automates the process: record the computational graph during the forward pass, then traverse it backward. This is exactly what PyTorch's autograd, JAX's grad, and our own micrograd do under the hood.
There are two modes of AD. Reverse mode is backpropagation: one backward pass gives ALL parameter gradients, regardless of how many parameters exist. Forward mode computes one directional derivative per pass—efficient when you have few inputs and many outputs, but neural networks are the opposite: millions of inputs (parameters) and one output (the loss). Reverse mode wins overwhelmingly.
Think of it as a tape: the forward pass records every operation onto a "tape." The backward pass replays the tape in reverse, computing gradients at each step. The tape stores the intermediate values (the "activation cache") that the backward pass needs—this is why training uses roughly 3x the memory of inference (forward values + backward gradients + optimizer state).
The Gradient Perspective on Architecture
Once you understand backprop, every architectural choice in deep learning becomes a statement about gradient flow:
- Residual connections create gradient highways (
∂y/∂x = ∂F/∂x + 1) - ReLU preserves gradient magnitude for positive activations (local grad = 1)
- Layer norm prevents activations from drifting, keeping gradients in a healthy range
- Attention's softmax creates a gradient bottleneck (the Jacobian is a full matrix)
- Flash Attention's backward pass must also be IO-aware—the same tiling trick applies in reverse
- RNNs unrolled through time are just very deep networks, so they suffer vanishing gradients over sequence length
A neural network isn't just a function—it's a function whose computational graph determines how easily it can learn. Next time you see a new architecture, ask the question that backpropagation teaches you to ask: "How do gradients flow through this?"
Connections to the Elementary Series
- Micrograd: Micrograd IS backprop—this post explains the math that micrograd's autograd engine implements automatically.
- Loss Functions: The gradient of the loss is where backprop starts. Different losses (MSE vs cross-entropy) give different initial gradient signals, and that first gradient shapes everything downstream.
- Optimizers: Optimizers consume the gradients that backprop produces. SGD, Adam, and AdaGrad are all post-processing steps applied to ∂L/∂θ.
- Activation Functions: Each activation's derivative determines gradient flow quality. Sigmoid caps at 0.25; ReLU passes 1 or 0; GELU provides a smooth middle ground.
- Normalization: Batch norm and layer norm have non-trivial backward passes that stabilize gradient magnitudes across layers.
- Transformers: Every transformer block uses residual connections—the gradient highway that makes 100+ layer training possible.
- RNNs: Backprop through time is backprop unrolled over sequence steps. The vanishing gradient problem over time is why LSTMs add a cell state gradient highway.
- Neural Scaling Laws: Gradient noise and batch size scaling govern the compute-optimal training regime—backprop's efficiency enables the massive training runs that scaling laws predict.
References & Further Reading
- Rumelhart, Hinton, & Williams — "Learning Representations by Back-Propagating Errors" (1986) — the landmark paper that popularized backpropagation for neural networks.
- Goodfellow, Bengio, & Courville — Deep Learning (2016), Chapter 6.5 — the definitive textbook treatment of backprop and computational graphs.
- Karpathy — "Yes you should understand backprop" (2017) — blog post arguing every ML practitioner needs backprop intuition, not just autograd.
- He et al. — "Deep Residual Learning for Image Recognition" (2015) — ResNets showed that skip connections solve the degradation problem by creating gradient highways.
- Glorot & Bengio — "Understanding the difficulty of training deep feedforward neural networks" (2010) — Xavier initialization and the analysis of gradient flow in deep networks.
- Baydin et al. — "Automatic Differentiation in Machine Learning: a Survey" (2018) — comprehensive survey of AD theory, including forward vs reverse mode.
- Hochreiter — "Untersuchungen zu dynamischen neuronalen Netzen" (1991) — the original identification of the vanishing gradient problem.
- Sutskever — "Training Recurrent Neural Networks" (2013) — PhD thesis with deep treatment of BPTT and gradient dynamics.