Residual Networks from Scratch: Why Deeper Networks Need Shortcuts
The Degradation Problem
Here's a puzzle that stumped the deep learning community in 2015. If a 20-layer convolutional network achieves 7% training error, surely a 56-layer network can do at least as well. After all, the extra 36 layers could just learn identity mappings and replicate the shallower network's performance. More depth means more capacity, and more capacity should never hurt.
But it did hurt. He et al. showed that the 56-layer network got worse training error than the 20-layer one. Not just worse test error (which would suggest overfitting) but worse training error. The deeper network was failing to optimize, not failing to generalize. It simply couldn't find a solution as good as the shallower network's, despite having strictly more representational power.
This is the degradation problem, and it blocked the entire field from going deep. VGG-19, with its 19 layers, was near the practical limit. Going deeper made things worse, not better. The problem wasn't vanishing gradients (those were largely solved by ReLU and batch normalization). It was something more fundamental: stacking nonlinear layers makes it surprisingly hard to learn simple functions like the identity.
Let's see this failure empirically. We'll train plain networks of increasing depth and watch the deeper ones plateau at higher loss:
import numpy as np
def make_spirals(n_points=200, noise=0.3, seed=42):
rng = np.random.RandomState(seed)
t = np.linspace(0, 2 * np.pi, n_points)
r = np.linspace(0.3, 1.0, n_points)
x0 = np.column_stack([r * np.cos(t) + rng.randn(n_points) * noise * 0.1,
r * np.sin(t) + rng.randn(n_points) * noise * 0.1])
x1 = np.column_stack([r * np.cos(t + np.pi) + rng.randn(n_points) * noise * 0.1,
r * np.sin(t + np.pi) + rng.randn(n_points) * noise * 0.1])
X = np.vstack([x0, x1])
y = np.array([0] * n_points + [1] * n_points)
return X, y
def train_plain_network(X, y, depth, width=32, lr=0.01, epochs=500, seed=0):
"""Train a plain (no skip connections) network and return loss history."""
rng = np.random.RandomState(seed)
# He initialization: scale by sqrt(2/fan_in)
layers = []
fan_in = X.shape[1]
for i in range(depth):
fan_out = width if i < depth - 1 else 1
W = rng.randn(fan_in, fan_out) * np.sqrt(2.0 / fan_in)
b = np.zeros(fan_out)
layers.append((W, b))
fan_in = fan_out
losses = []
for epoch in range(epochs):
# Forward pass
activations = [X]
for i, (W, b) in enumerate(layers):
z = activations[-1] @ W + b
a = np.maximum(0, z) if i < depth - 1 else 1 / (1 + np.exp(-z))
activations.append(a)
pred = activations[-1].ravel()
loss = -np.mean(y * np.log(pred + 1e-8) + (1 - y) * np.log(1 - pred + 1e-8))
losses.append(loss)
# Backward pass (simplified)
grad = (pred - y).reshape(-1, 1) / len(y)
for i in range(depth - 1, -1, -1):
W, b = layers[i]
dW = activations[i].T @ grad
db = grad.sum(axis=0)
if i > 0:
grad = grad @ W.T
grad *= (activations[i] > 0).astype(float) # ReLU derivative
layers[i] = (W - lr * dW, b - lr * db)
return losses
# Results: deeper plain networks get WORSE training loss
# depth=4: loss ~ 0.15 (converges well)
# depth=8: loss ~ 0.25 (slower, higher floor)
# depth=16: loss ~ 0.45 (barely learns)
# depth=32: loss ~ 0.60 (almost stuck at random)
The numbers don't lie. A 4-layer network converges to a training loss around 0.15, while a 32-layer network stalls at 0.60 — barely better than random guessing. This is the degradation problem in action. More depth should mean more power, but the optimizer simply can't find good solutions through all those nonlinear layers. (For more on why gradients degrade through deep networks, see our backpropagation from scratch post.)
The Residual Learning Framework
Kaiming He and colleagues at Microsoft Research proposed an elegant fix. The insight: if learning the identity through stacked nonlinear layers is hard, don't make the network learn the identity. Give it for free.
Instead of asking a block of layers to learn the desired transformation H(x) directly, let it learn the residual F(x) = H(x) − x. The block's output becomes:
y = F(x) + x
That + x is the skip connection — the input bypasses the layers and gets added directly to the output. If the optimal transformation is close to identity (which it often is in deep networks), the block just needs to push F(x) toward zero. And pushing weights toward zero is trivially easy — weight decay does it automatically.
But the real magic is in the gradients. During backpropagation, the gradient of the output with respect to the input is:
∂y/∂x = ∂F/∂x + I
That + I (the identity matrix) is a gradient highway. Even if the residual branch has vanishing gradients (∂F/∂x ≈ 0), the skip connection preserves a gradient of exactly 1. Gradients flow unimpeded from the loss all the way back to the earliest layers, no matter how deep the network.
Let's see this gradient advantage concretely:
import numpy as np
def gradient_through_plain_block(x, W1, W2):
"""Forward + backward through a plain block: y = relu(W2 @ relu(W1 @ x))"""
z1 = W1 @ x
a1 = np.maximum(0, z1) # ReLU
z2 = W2 @ a1
y = np.maximum(0, z2) # ReLU
# Backward: dy/dx = diag(z2>0) @ W2 @ diag(z1>0) @ W1
grad = np.diag((z2 > 0).astype(float)) @ W2 @ np.diag((z1 > 0).astype(float)) @ W1
return y, grad
def gradient_through_residual_block(x, W1, W2):
"""Forward + backward through a residual block: y = relu(W2 @ relu(W1 @ x)) + x"""
z1 = W1 @ x
a1 = np.maximum(0, z1)
z2 = W2 @ a1
F_x = np.maximum(0, z2)
y = F_x + x # Skip connection!
# Backward: dy/dx = dF/dx + I
dF_dx = np.diag((z2 > 0).astype(float)) @ W2 @ np.diag((z1 > 0).astype(float)) @ W1
grad = dF_dx + np.eye(len(x)) # The +I that saves deep networks
return y, grad
# Example: 8-dimensional vectors, random weights (small)
rng = np.random.RandomState(42)
d = 8
x = rng.randn(d)
W1 = rng.randn(d, d) * 0.3
W2 = rng.randn(d, d) * 0.3
_, grad_plain = gradient_through_plain_block(x, W1, W2)
_, grad_resid = gradient_through_residual_block(x, W1, W2)
print(f"Plain block gradient norm: {np.linalg.norm(grad_plain):.4f}")
print(f"Residual block gradient norm: {np.linalg.norm(grad_resid):.4f}")
# Plain: ~0.35 (shrinking through multiplications)
# Residual: ~3.12 (anchored near identity)
The plain block's gradient norm is around 0.35 — already shrinking after just one block. Stack 50 of these and the gradient effectively vanishes. The residual block's gradient norm stays near the identity's norm (~2.83 for an 8-dimensional identity matrix), plus whatever the residual branch contributes. This is why ResNets can train with 152 layers while plain networks stall at 20.
Building a Residual Block
The standard residual block in the original ResNet paper has a specific structure: two 3×3 convolutional layers, each followed by batch normalization, with ReLU activations. The skip connection adds the input to the output before the final ReLU:
Conv3×3 → BN → ReLU → Conv3×3 → BN → (+x) → ReLU
Two key design choices here. First, batch normalization after each convolution stabilizes the residual branch, preventing its outputs from overwhelming the skip connection. (See our normalization from scratch post for the details.) Second, the block uses He initialization — scaling weights by √(2/fan_in) — which was designed specifically for ReLU networks by the same Kaiming He who invented ResNets. (That connection is covered in weight initialization from scratch.)
import numpy as np
class ResidualBlock:
"""A basic residual block: two conv layers with batch norm and a skip connection."""
def __init__(self, channels, rng):
# He initialization for both 3x3 conv layers
scale = np.sqrt(2.0 / (channels * 9)) # fan_in = channels * 3 * 3
self.W1 = rng.randn(channels, channels, 3, 3) * scale
self.W2 = rng.randn(channels, channels, 3, 3) * scale
# Batch norm parameters (simplified: just scale and shift)
self.gamma1 = np.ones(channels)
self.beta1 = np.zeros(channels)
self.gamma2 = np.ones(channels)
self.beta2 = np.zeros(channels)
def forward(self, x):
"""x shape: (channels, H, W)"""
identity = x # Save for skip connection
# First conv + BN + ReLU
out = conv2d_same(x, self.W1)
out = batch_norm(out, self.gamma1, self.beta1)
out = np.maximum(0, out) # ReLU
# Second conv + BN
out = conv2d_same(out, self.W2)
out = batch_norm(out, self.gamma2, self.beta2)
# Skip connection: add identity, then ReLU
out = out + identity
out = np.maximum(0, out)
return out
def conv2d_same(x, W):
"""3x3 convolution with same padding. x: (C_in, H, W), W: (C_out, C_in, 3, 3)"""
C_in, H, W_dim = x.shape
C_out = W.shape[0]
x_pad = np.pad(x, ((0, 0), (1, 1), (1, 1)))
out = np.zeros((C_out, H, W_dim))
for co in range(C_out):
for ci in range(C_in):
for i in range(H):
for j in range(W_dim):
out[co, i, j] += np.sum(x_pad[ci, i:i+3, j:j+3] * W[co, ci])
return out
def batch_norm(x, gamma, beta, eps=1e-5):
"""Simplified BN over spatial dims. x: (C, H, W)"""
mean = x.mean(axis=(1, 2), keepdims=True)
var = x.var(axis=(1, 2), keepdims=True)
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma.reshape(-1, 1, 1) * x_norm + beta.reshape(-1, 1, 1)
The out + identity line is the entire skip connection — a single addition that transforms a plain block into a residual block. The shapes match because both convolutions use "same" padding and the channel count doesn't change. When shapes do change, we need something more.
Projection Shortcuts and Downsampling
The equation y = F(x) + x requires F(x) and x to have identical shapes. But real networks need to downsample spatially (stride 2 halves height and width) and increase channels (64 → 128 → 256 → 512). When dimensions don't match, we have three options:
- Option A: Zero-pad the extra channels and subsample spatially. Free in parameters but wastes capacity.
- Option B: Use a 1×1 convolution to project the shortcut only when dimensions change. Minimal parameter cost.
- Option C: Use 1×1 convolutions on every shortcut. Maximum expressivity but more parameters.
He et al. tested all three and found Option B wins the tradeoff: projection shortcuts only where needed, identity shortcuts everywhere else. The 1×1 convolution acts as a learned linear transformation in channel space — it maps, say, 64 channels to 128 without touching spatial dimensions (see convnets from scratch for more on 1×1 convolutions).
class ResidualBlockWithProjection:
"""Residual block that handles spatial downsampling and channel changes."""
def __init__(self, in_channels, out_channels, stride, rng):
scale1 = np.sqrt(2.0 / (in_channels * 9))
scale2 = np.sqrt(2.0 / (out_channels * 9))
self.W1 = rng.randn(out_channels, in_channels, 3, 3) * scale1
self.W2 = rng.randn(out_channels, out_channels, 3, 3) * scale2
self.stride = stride
# Projection shortcut: 1x1 conv when dimensions change
self.needs_projection = (stride != 1) or (in_channels != out_channels)
if self.needs_projection:
scale_proj = np.sqrt(2.0 / in_channels)
self.W_proj = rng.randn(out_channels, in_channels, 1, 1) * scale_proj
def forward(self, x):
identity = x
# Main path: conv(stride) -> BN -> ReLU -> conv -> BN
out = conv2d(x, self.W1, stride=self.stride, pad=1)
out = np.maximum(0, out) # (BN omitted for clarity)
out = conv2d(out, self.W2, stride=1, pad=1)
# Shortcut path: project if dimensions changed
if self.needs_projection:
identity = conv2d(x, self.W_proj, stride=self.stride, pad=0)
# Add and activate
out = out + identity # Shapes now match!
out = np.maximum(0, out)
return out
# Example: transition from stage 1 (64 channels) to stage 2 (128 channels)
# Input: (64, 32, 32) -- 64 channels, 32x32 spatial
# Output: (128, 16, 16) -- 128 channels, 16x16 spatial (stride 2)
# The 1x1 projection maps: (64, 32, 32) -> (128, 16, 16)
# The main path maps: (64, 32, 32) -> (128, 16, 16) via stride-2 conv
The projection shortcut is elegant: it's the minimum intervention needed to make the skip connection work. Only two of every ~8 blocks in a typical ResNet need projections — the rest use free identity shortcuts.
Pre-Activation vs Post-Activation
The original ResNet (He et al. 2015) placed batch normalization and ReLU after the addition:
Post-activation (v1): Conv → BN → ReLU → Conv → BN → (+x) → ReLU
A year later, He et al. published a follow-up showing that moving BN and ReLU before the convolutions works better:
Pre-activation (v2): BN → ReLU → Conv → BN → ReLU → Conv → (+x)
The difference is subtle but important. In v1, the identity path passes through a ReLU after the addition, which can zero out negative values and block the identity signal. In v2, the identity path is completely clean — it's a pure addition with nothing gating it. For a 110-layer ResNet, the difference is marginal. For a 1001-layer ResNet, pre-activation is essential.
def post_activation_block(x, W1, W2, gamma1, beta1, gamma2, beta2):
"""Original ResNet v1: BN-ReLU after each conv, ReLU after addition."""
out = conv2d_same(x, W1)
out = batch_norm(out, gamma1, beta1)
out = np.maximum(0, out) # ReLU
out = conv2d_same(out, W2)
out = batch_norm(out, gamma2, beta2)
out = out + x # Skip connection
out = np.maximum(0, out) # ReLU gates the identity!
return out
def pre_activation_block(x, W1, W2, gamma1, beta1, gamma2, beta2):
"""ResNet v2: BN-ReLU before each conv, clean identity path."""
out = batch_norm(x, gamma1, beta1)
out = np.maximum(0, out) # ReLU
out = conv2d_same(out, W1)
out = batch_norm(out, gamma2, beta2)
out = np.maximum(0, out) # ReLU
out = conv2d_same(out, W2)
out = out + x # Pure identity — no ReLU gating!
return out
# The identity path in v2 is: x --> (+) --> next block
# Nothing modifies x along the skip connection.
# This is why transformers use Pre-Norm: x = x + Sublayer(Norm(x))
# Same idea, discovered independently for attention layers.
This is the direct precursor to Pre-Norm in transformers. When you see x = x + attn(norm(x)) in a transformer block (covered in our transformer from scratch post), you're seeing the same principle He discovered for ResNets: keep the residual stream clean, let sub-layers read from it and add updates back.
Bottleneck Blocks and the ResNet Family
For shallower ResNets (18 and 34 layers), the basic block with two 3×3 convolutions works well. But for deeper variants, the computational cost becomes prohibitive. A 3×3 conv on 256 channels costs 256 × 256 × 9 ≈ 590K multiply-adds per spatial location. Two of those per block adds up fast.
The bottleneck block solves this with a squeeze-and-expand pattern:
- 1×1 conv: Reduce channels (256 → 64) — the "bottleneck"
- 3×3 conv: Process at reduced dimensionality (64 → 64)
- 1×1 conv: Expand back (64 → 256)
Total cost: (256×64 + 64×64×9 + 64×256) ≈ 70K multiply-adds — about 8× cheaper than two 3×3 convs at full width, while being one layer deeper.
class BottleneckBlock:
"""ResNet bottleneck: 1x1 reduce -> 3x3 process -> 1x1 expand + skip."""
def __init__(self, in_channels, bottleneck_channels, out_channels, stride, rng):
# 1x1 reduce
s1 = np.sqrt(2.0 / in_channels)
self.W_reduce = rng.randn(bottleneck_channels, in_channels, 1, 1) * s1
# 3x3 process
s2 = np.sqrt(2.0 / (bottleneck_channels * 9))
self.W_process = rng.randn(bottleneck_channels, bottleneck_channels, 3, 3) * s2
# 1x1 expand
s3 = np.sqrt(2.0 / bottleneck_channels)
self.W_expand = rng.randn(out_channels, bottleneck_channels, 1, 1) * s3
self.stride = stride
self.needs_projection = (stride != 1) or (in_channels != out_channels)
if self.needs_projection:
sp = np.sqrt(2.0 / in_channels)
self.W_proj = rng.randn(out_channels, in_channels, 1, 1) * sp
def forward(self, x):
identity = x
# Reduce: 256 -> 64
out = conv2d(x, self.W_reduce, stride=1, pad=0)
out = np.maximum(0, out)
# Process at bottleneck width: 64 -> 64
out = conv2d(out, self.W_process, stride=self.stride, pad=1)
out = np.maximum(0, out)
# Expand: 64 -> 256
out = conv2d(out, self.W_expand, stride=1, pad=0)
# Projection shortcut if needed
if self.needs_projection:
identity = conv2d(x, self.W_proj, stride=self.stride, pad=0)
return np.maximum(0, out + identity)
The full ResNet family uses these two block types across four stages:
| Model | Block Type | Blocks per Stage | Parameters | Top-5 Error |
|---|---|---|---|---|
| ResNet-18 | Basic | [2, 2, 2, 2] | 11M | 10.92% |
| ResNet-34 | Basic | [3, 4, 6, 3] | 21M | 9.46% |
| ResNet-50 | Bottleneck | [3, 4, 6, 3] | 25M | 7.13% |
| ResNet-101 | Bottleneck | [3, 4, 23, 3] | 44M | 6.21% |
| ResNet-152 | Bottleneck | [3, 8, 36, 3] | 60M | 5.71% |
Each stage operates at a different spatial resolution: 56×56, 28×28, 14×14, and 7×7 (for 224×224 input). Channels double at each stage transition: 64 → 128 → 256 → 512. ResNet-152 won the 2015 ImageNet competition with 3.57% top-5 error — the first time a machine beat human-level performance on this benchmark.
Skip Connections Changed Everything
ResNets didn't just win ImageNet — they introduced an idea that now appears in virtually every state-of-the-art architecture. Skip connections are the common thread linking the most important models of the last decade:
- DenseNets (Huang et al. 2017): Instead of adding, concatenate the skip connection. Each layer receives features from ALL previous layers, creating extremely strong gradient flow.
- Highway Networks (Srivastava et al. 2015): Predated ResNets with learned gating: y = T(x) ⋅ H(x) + (1 − T(x)) ⋅ x. ResNets showed the gating was unnecessary — simple addition works better.
- U-Nets: Cross-scale skip connections that connect encoder layers to decoder layers at matching resolutions. Essential for segmentation and, later, for the diffusion models that power image generation.
- Transformers: Every transformer block uses residual connections:
x = x + attn(norm(x)). The residual stream is the backbone of GPT, BERT, and every LLM. Without skip connections, training a 96-layer transformer would be impossible. - Vision Transformers: Apply the exact same residual pattern to image patches, as covered in ViT from scratch.
The deeper lesson is what Veit et al. (2016) discovered: ResNets behave like ensembles of relatively shallow networks. The skip connections create an exponential number of paths through the network, and most of the gradient flows through short paths. A 110-layer ResNet effectively functions as an ensemble of networks ranging from 10 to 110 layers, with the shorter paths contributing most of the learning signal.
Try It: Depth vs. Skip Connections
Try It: Degradation Problem Explorer
Watch deeper networks fail without skip connections — and succeed with them. Adjust depth, toggle skip connections, and hit Train to see the loss curves.
Try It: Gradient Flow Visualizer
Try It: Gradient Flow Through Residual Blocks
See how gradients propagate backward through a stack of layers. Toggle between a plain network (gradients fade) and a ResNet (gradients stay strong).
What ResNets Teach Us
The lesson of ResNets is profound: sometimes the best thing you can do for a deep network is make it easy to learn nothing. By providing a default identity path, skip connections let each layer decide whether it has something useful to contribute, rather than forcing every layer to transform the signal. If a layer has nothing to add, the gradients will push its weights toward zero, and the block gracefully becomes a pass-through.
This "opt-in computation" principle now underlies virtually every state-of-the-art architecture. Transformers, diffusion models, and modern ConvNets all use residual connections — not because they're a nice trick, but because they're the mechanism that makes depth work. Before ResNets, depth was a liability. After ResNets, depth became the primary axis of scaling. That shift made everything from GPT to Stable Diffusion possible.
References & Further Reading
- He et al. — Deep Residual Learning for Image Recognition (2015) — The original ResNet paper that introduced skip connections and won ImageNet
- He et al. — Identity Mappings in Deep Residual Networks (2016) — Pre-activation ResNets with clean identity paths, enabling 1001-layer networks
- Veit et al. — Residual Networks Behave Like Ensembles of Relatively Shallow Networks (2016) — Shows that ResNets function as implicit ensembles of paths of varying depth
- Huang et al. — Densely Connected Convolutional Networks (2017) — DenseNets: concatenation instead of addition for even stronger feature reuse
- Srivastava et al. — Highway Networks (2015) — Learned gating for skip connections, a precursor to ResNets
- Li et al. — Visualizing the Loss Landscape of Neural Nets (2018) — Shows that skip connections dramatically smooth the loss landscape