Convolutional Neural Networks from Scratch
The Architecture That Taught Machines to See
Before transformers conquered vision, one architecture ruled for an entire decade. From AlexNet's 2012 ImageNet shock — when a convolutional neural network crushed the competition by 10 percentage points — to the rise of Vision Transformers in 2020, CNNs were computer vision.
The Vision Transformers post built a ViT from scratch and showed that a standard transformer can classify images by treating them as sequences of patches. But that post also made claims about the architecture it was replacing — claims about locality, weight sharing, and hierarchical features. We never proved those claims. We just asserted them.
This post fixes that. We're going to build a convolutional neural network from raw NumPy, layer by layer, so you can see exactly what convolutions do, why they dominated vision for so long, and what fundamental limitation opened the door for transformers.
The story starts in 1962, with a cat and two neuroscientists.
David Hubel and Torsten Wiesel discovered that neurons in a cat's visual cortex respond to oriented edges in specific local regions of the visual field. Not the whole image — just a small patch. And the same type of edge detector appeared all across the visual field. Three core properties of biological vision that map directly to the three core ideas behind CNNs: local connectivity, weight sharing, and translation equivariance.
What Is a Convolution?
Let's start with the naive approach. You have a 28×28 grayscale image — 784 pixels. If you wire every pixel to every neuron in a hidden layer of size 128, that's 784 × 128 = 100,352 parameters just for the first layer. And this layer has no concept of spatial structure. The pixel at position (0,0) is connected to the same neurons as the pixel at position (27,27). The network has to learn that nearby pixels are related, from scratch, every time.
A convolution bakes in a much smarter assumption: nearby pixels matter more than distant ones. Instead of connecting every pixel to every neuron, we slide a small filter (called a kernel) across the image, computing a weighted sum at each position.
A 5×5 kernel has just 25 learnable parameters. It slides across the entire 28×28 image, producing a 24×24 output (because the 5×5 window can start at positions 0 through 23 in each dimension). The same 25 parameters are reused at every position — this is weight sharing, and it's why CNNs are so parameter-efficient.
import numpy as np
def conv2d(image, kernel):
"""Convolve a 2D image with a 2D kernel (no padding, stride=1)."""
img_h, img_w = image.shape
k_h, k_w = kernel.shape
out_h = img_h - k_h + 1
out_w = img_w - k_w + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
patch = image[i:i+k_h, j:j+k_w]
output[i, j] = np.sum(patch * kernel)
return output
# Sobel edge detector — finds vertical edges
sobel_x = np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]], dtype=np.float32)
# Test on a 7x7 image with a vertical line in the middle
image = np.zeros((7, 7), dtype=np.float32)
image[:, 3] = 1.0 # vertical line at column 3
edges = conv2d(image, sobel_x)
print("Input shape:", image.shape) # (7, 7)
print("Output shape:", edges.shape) # (5, 5) — shrank by kernel_size - 1
print("Edge map:\n", edges) # Strong response at the line's borders
That's it. Four nested operations: slide, extract patch, multiply, sum. Every convolutional layer in every CNN — from LeNet to ResNet to the convolutional stems in hybrid ViT models — is doing exactly this.
Channels and Multiple Filters
Real images aren't grayscale — they have 3 channels (RGB). And one filter isn't enough: we want to detect many different features (horizontal edges, vertical edges, corners, gradients). So a real convolutional layer takes a multi-channel input and applies multiple filters.
Each filter spans all input channels. If the input has 3 channels and we use a 3×3 kernel, each filter has shape (3, 3, 3) = 27 parameters plus 1 bias. With 8 output filters, that's 8 × (27 + 1) = 224 parameters total — compared to 100K+ for a fully connected layer.
def conv2d_multi(x, kernels, biases, stride=1, padding=0):
"""Multi-channel convolution.
x: (C_in, H, W)
kernels: (C_out, C_in, kH, kW)
biases: (C_out,)
returns: (C_out, H_out, W_out)
"""
c_out, c_in, k_h, k_w = kernels.shape
_, h, w = x.shape
# Apply zero padding
if padding > 0:
x = np.pad(x, ((0,0), (padding,padding), (padding,padding)))
_, h, w = x.shape
out_h = (h - k_h) // stride + 1
out_w = (w - k_w) // stride + 1
output = np.zeros((c_out, out_h, out_w))
for f in range(c_out): # for each output filter
for i in range(out_h):
for j in range(out_w):
si, sj = i * stride, j * stride
patch = x[:, si:si+k_h, sj:sj+k_w] # (C_in, kH, kW)
output[f, i, j] = np.sum(patch * kernels[f]) + biases[f]
return output
# Example: 3 input channels (RGB), 8 output filters, 3x3 kernels
x = np.random.randn(3, 16, 16) # RGB image, 16x16
kernels = np.random.randn(8, 3, 3, 3) * 0.1
biases = np.zeros(8)
out = conv2d_multi(x, kernels, biases)
print("Input:", x.shape) # (3, 16, 16)
print("Output:", out.shape) # (8, 14, 14) — 8 feature maps, each 14x14
Notice the output shape: 8 feature maps, each 14×14. Each feature map is the result of one filter scanning the entire image. One filter might learn to detect horizontal edges, another vertical edges, another diagonal gradients. The network decides what to detect — we just provide the mechanism.
Pooling — Compress and Survive
After convolution detects features, we need to downsample. Max pooling takes the maximum value in each small window (typically 2×2), cutting the spatial dimensions in half.
Why does this work? If a filter detects an edge somewhere in a 2×2 region, the exact position doesn't matter much — we just need to know the edge exists. Max pooling provides local translation invariance: the feature can shift by a pixel and the output doesn't change.
def max_pool2d(x, pool_size=2):
"""Max pooling over spatial dimensions.
x: (C, H, W)
returns: (C, H//pool_size, W//pool_size)
"""
c, h, w = x.shape
out_h = h // pool_size
out_w = w // pool_size
output = np.zeros((c, out_h, out_w))
for ch in range(c):
for i in range(out_h):
for j in range(out_w):
si, sj = i * pool_size, j * pool_size
window = x[ch, si:si+pool_size, sj:sj+pool_size]
output[ch, i, j] = np.max(window)
return output
# 8 feature maps, 14x14 each
features = np.random.randn(8, 14, 14)
pooled = max_pool2d(features)
print("Before pooling:", features.shape) # (8, 14, 14)
print("After pooling:", pooled.shape) # (8, 7, 7) — spatial dims halved
Each 2×2 pool reduces spatial dimensions by half and reduces the total number of values by 4×. This aggressive compression is how CNNs gradually distill a large image down to a compact feature vector.
The Feature Hierarchy — Edges to Objects
Here's the insight that made CNNs so powerful: stacking convolutional layers creates a feature hierarchy, where each layer detects increasingly complex patterns.
- Layer 1 (small receptive field): edges, gradients, color patches — the most basic visual features
- Layer 2 (medium receptive field): textures, corners, simple shapes — combinations of edges
- Layer 3+ (large receptive field): object parts, then whole objects — combinations of textures and shapes
The receptive field — how much of the original image a single neuron can "see" — grows with each layer. A 5×5 kernel in Layer 1 sees a 5×5 patch of the input. After a 2×2 max pool, each position in Layer 2 already represents a 2×2 group. Add another 5×5 conv, and each neuron in Layer 2 effectively sees a 14×14 region of the original image.
This is exactly what the ViT post identified as a CNN's inductive bias: hierarchical features emerge naturally from the architecture. But here's the critical limitation — to connect the top-left corner with the bottom-right corner, you need many layers of conv + pool. A Vision Transformer gets global context in layer 1, via self-attention over all patches. CNNs have to earn it the hard way, one receptive field expansion at a time.
Building LeNet: A Complete CNN
Yann LeCun's LeNet (1998) was one of the first CNNs to achieve commercial success — the US Postal Service used it to read handwritten zip codes. The architecture is beautifully simple, and everything we need to build it is already in our toolkit.
Let's trace the tensor shapes through each layer, starting from a 28×28 grayscale image:
- Input: (1, 28, 28) — 1 channel, 28×28 pixels
- Conv1(1→6, 5×5): (6, 24, 24) — 6 feature maps detecting basic patterns
- ReLU + MaxPool(2×2): (6, 12, 12) — spatial dimensions halved
- Conv2(6→16, 5×5): (16, 8, 8) — 16 richer feature maps
- ReLU + MaxPool(2×2): (16, 4, 4) — now just 4×4 spatial
- Flatten: 256-dimensional vector
- FC1(256→120) → FC2(120→84) → FC3(84→10)
Total parameter count: Conv1 has 6 × 1 × 5 × 5 + 6 = 156. Conv2 has 16 × 6 × 5 × 5 + 16 = 2,416. FC1: 256 × 120 + 120 = 30,840. FC2: 120 × 84 + 84 = 10,164. FC3: 84 × 10 + 10 = 850. Grand total: 44,426 parameters. A fully connected network over the same 784 pixels with equivalent hidden layers would need millions.
def relu(x):
return np.maximum(0, x)
class LeNet:
def __init__(self):
# Xavier initialization for stable training
self.conv1_w = np.random.randn(6, 1, 5, 5) * np.sqrt(2.0 / 25)
self.conv1_b = np.zeros(6)
self.conv2_w = np.random.randn(16, 6, 5, 5) * np.sqrt(2.0 / 150)
self.conv2_b = np.zeros(16)
self.fc1_w = np.random.randn(256, 120) * np.sqrt(2.0 / 256)
self.fc1_b = np.zeros(120)
self.fc2_w = np.random.randn(120, 84) * np.sqrt(2.0 / 120)
self.fc2_b = np.zeros(84)
self.fc3_w = np.random.randn(84, 10) * np.sqrt(2.0 / 84)
self.fc3_b = np.zeros(10)
def forward(self, x):
"""Forward pass. x shape: (1, 28, 28)"""
# Conv block 1
x = conv2d_multi(x, self.conv1_w, self.conv1_b) # (6, 24, 24)
x = relu(x)
x = max_pool2d(x) # (6, 12, 12)
# Conv block 2
x = conv2d_multi(x, self.conv2_w, self.conv2_b) # (16, 8, 8)
x = relu(x)
x = max_pool2d(x) # (16, 4, 4)
# Flatten and fully connected layers
x = x.reshape(-1) # (256,)
x = relu(x @ self.fc1_w + self.fc1_b) # (120,)
x = relu(x @ self.fc2_w + self.fc2_b) # (84,)
x = x @ self.fc3_w + self.fc3_b # (10,) — raw logits
return x
net = LeNet()
dummy = np.random.randn(1, 28, 28)
logits = net.forward(dummy)
print("Logits shape:", logits.shape) # (10,)
print("Prediction:", np.argmax(logits))
That's a complete convolutional neural network in ~40 lines. The convolutional layers extract spatial features while being parameter-efficient. The fully connected layers at the end are just standard feed-forward networks that map the extracted features to class scores.
Training on Synthetic Digits
Let's train our LeNet on a dataset we generate ourselves — simple 28×28 images of geometric patterns representing digits. We'll implement the full training loop with cross-entropy loss and a basic SGD optimizer.
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
def cross_entropy_loss(logits, target):
probs = softmax(logits)
return -np.log(probs[target] + 1e-10)
def make_synthetic_digit(label, size=28):
"""Generate a simple synthetic image for a digit class."""
img = np.zeros((size, size), dtype=np.float32)
rng = np.random
cx, cy = size // 2, size // 2
if label == 0: # circle
for y in range(size):
for x in range(size):
if 8 <= ((x-cx)**2 + (y-cy)**2)**0.5 <= 11:
img[y, x] = 1.0
elif label == 1: # vertical line
img[4:-4, cx-1:cx+1] = 1.0
elif label == 2: # horizontal line
img[cy-1:cy+1, 4:-4] = 1.0
elif label == 3: # diagonal (top-left to bottom-right)
for k in range(-1, 2):
np.fill_diagonal(img[max(0,k):, max(0,-k):], 1.0)
elif label == 4: # cross (+)
img[cy-1:cy+1, 4:-4] = 1.0
img[4:-4, cx-1:cx+1] = 1.0
elif label == 5: # X shape
for k in range(-1, 2):
np.fill_diagonal(img[max(0,k):, max(0,-k):], 1.0)
np.fill_diagonal(np.fliplr(img)[max(0,k):, max(0,-k):], 1.0)
elif label == 6: # square
img[5:23, 5:7] = img[5:23, 21:23] = 1.0
img[5:7, 5:23] = img[21:23, 5:23] = 1.0
elif label == 7: # triangle (top)
for row in range(6, 24):
half_w = (row - 6) * 10 // 18
img[row, cx-half_w:cx+half_w+1] = 1.0
elif label == 8: # diamond
for row in range(size):
dist = abs(row - cy)
half_w = max(0, 10 - dist)
if half_w > 0:
img[row, cx-half_w:cx+half_w+1] = 1.0
elif label == 9: # small dot
img[cy-3:cy+3, cx-3:cx+3] = 1.0
# Add slight noise for variety
img += rng.randn(size, size) * 0.05
return np.clip(img, 0, 1)
# Generate dataset: 100 samples per class
X_train, y_train = [], []
for label in range(10):
for _ in range(100):
X_train.append(make_synthetic_digit(label))
y_train.append(label)
X_train = np.array(X_train).reshape(-1, 1, 28, 28)
y_train = np.array(y_train)
# Shuffle
perm = np.random.permutation(len(X_train))
X_train, y_train = X_train[perm], y_train[perm]
print(f"Dataset: {len(X_train)} images, {len(set(y_train))} classes")
For a full training loop, we'd implement backpropagation through the convolutional layers — which involves convolving with flipped kernels to compute gradients (the gradient of a convolution is itself a convolution). This connects directly to the chain rule foundations we built in micrograd. In practice, frameworks like PyTorch handle this automatically, but the principle is the same engine we've been building throughout this series.
BatchNorm — The CNN Stabilizer
Batch Normalization (Ioffe & Szegedy, 2015) was transformative for CNN training. The idea: normalize activations across the batch at each layer, then apply learnable scale and shift parameters.
For convolutional layers, we normalize per channel across the batch and spatial dimensions. This means each of the 6 or 16 feature maps gets its own mean and variance, computed over all spatial positions and all images in the batch.
def batch_norm_2d(x_batch, gamma, beta, eps=1e-5):
"""Batch normalization for convolutional layers.
x_batch: (N, C, H, W) — batch of feature maps
gamma: (C,) — learnable scale per channel
beta: (C,) — learnable shift per channel
returns: (N, C, H, W) — normalized feature maps
"""
n, c, h, w = x_batch.shape
# Mean and variance per channel, across batch + spatial dims
mean = x_batch.mean(axis=(0, 2, 3), keepdims=True) # (1, C, 1, 1)
var = x_batch.var(axis=(0, 2, 3), keepdims=True) # (1, C, 1, 1)
# Normalize
x_norm = (x_batch - mean) / np.sqrt(var + eps)
# Scale and shift (reshape gamma/beta to broadcast)
gamma = gamma.reshape(1, c, 1, 1)
beta = beta.reshape(1, c, 1, 1)
return gamma * x_norm + beta
# Example: batch of 4 images, 6 feature maps each
batch = np.random.randn(4, 6, 12, 12)
gamma = np.ones(6) # initialized to 1 (identity scaling)
beta = np.zeros(6) # initialized to 0 (no shift)
normed = batch_norm_2d(batch, gamma, beta)
print("Channel means after BN:", normed.mean(axis=(0,2,3)).round(4)) # ~0
print("Channel stds after BN:", normed.std(axis=(0,2,3)).round(4)) # ~1
BatchNorm smooths the loss landscape, allows higher learning rates, and reduces sensitivity to initialization. It was essential for training deep CNNs like ResNet. But it has a key weakness: it normalizes across the batch dimension, which fails when batch sizes are small or sequences have variable lengths. This is exactly why transformers use LayerNorm instead — normalizing within each sample, independent of the batch.
CNN vs ViT: The Great Tradeoff
Now that we've built a CNN from scratch, we can see exactly what the ViT post meant by "inductive biases." A CNN has four built-in assumptions about images:
- Locality — each conv kernel only looks at a small patch (we built this in
conv2d) - Weight sharing — the same kernel slides everywhere (the kernel doesn't change between positions)
- Translation equivariance — detect a cat at (5,5), same activation as at (15,15) (consequence of weight sharing)
- Hierarchical features — stacking conv + pool grows receptive fields from local to global
These biases are a gift on small datasets — the network doesn't need to discover spatial structure from data because it's baked into the architecture. But they're a curse on large datasets, because the network can't learn patterns that violate these assumptions.
| Dataset Size | Winner | Why |
|---|---|---|
| Small (< 1M images) | CNN | Inductive biases compensate for limited data |
| Medium (1–14M) | CNN (slight edge) | Priors still help; ViT closing the gap with DeiT |
| Large (> 14M images) | ViT | Flexibility beats assumptions — global attention wins |
| Any size + hybrid | CNN stem + ViT body | Best of both: local features early, global attention late |
CNNs aren't dead. They serve as teacher networks for knowledge distillation, power efficient mobile architectures (MobileNet, EfficientNet), form the stems of hybrid models, and the U-Net architecture — a CNN with skip connections — was the backbone of diffusion models until DiT (Diffusion Transformers) started replacing them.
Try It: Convolutional Neural Networks
Panel 1: Convolution Explorer
Select a kernel and watch it slide across the image. The output feature map builds up one pixel at a time.
Panel 2: Feature Map Visualizer
See what each layer of a trained CNN "sees." Layer 1 detects edges. Layer 2 combines edges into textures. Click a feature map to highlight its filter.
Panel 3: Receptive Field Race — CNN vs ViT
Step through layers and watch how information propagates. CNNs grow their receptive field slowly. ViTs see everything from layer 1.
Connections to the Series
CNNs touch almost every concept we've built in the elementary series:
- Micrograd — backpropagation through conv layers uses the same chain rule; the gradient of a convolution is a convolution with a flipped kernel
- Vision Transformers — the successor architecture; ViT replaces convolutions with self-attention over image patches
- Attention — global connectivity (attention) vs local connectivity (convolution) is the fundamental tradeoff
- Normalization — CNNs use BatchNorm; transformers use LayerNorm; the switch happened because of batch-size dependence
- Optimizers — SGD + momentum was the classic CNN optimizer; well-tuned SGD can beat Adam on image classification
- Loss Functions — cross-entropy loss for classification is identical whether the backbone is a CNN or transformer
- Knowledge Distillation — DeiT distills CNN knowledge into ViT; the teacher CNN is often more sample-efficient
- Diffusion Models — U-Nets (the original diffusion backbone) are CNNs with skip connections; DiT replaces them with transformers
- Embeddings — conv filters learn feature representations; the flattened output is the image's "embedding"
- Feed-Forward Networks — the FC layers at the end of a CNN are exactly the FFNs from the transformer series
References & Further Reading
- LeCun et al. (1998) — Gradient-Based Learning Applied to Document Recognition — the LeNet paper that proved CNNs work at commercial scale
- Krizhevsky, Sutskever & Hinton (2012) — ImageNet Classification with Deep Convolutional Neural Networks — AlexNet, the paper that launched the deep learning revolution
- Hubel & Wiesel (1962) — Receptive fields, binocular interaction and functional architecture in the cat's visual cortex — the biological inspiration for convolutional networks
- Ioffe & Szegedy (2015) — Batch Normalization: Accelerating Deep Network Training — the normalization technique that enabled training deep CNNs
- He et al. (2016) — Deep Residual Learning for Image Recognition — ResNet, which introduced skip connections and scaled CNNs to 152 layers
- Dosovitskiy et al. (2020) — An Image is Worth 16x16 Words — the ViT paper that showed transformers can replace CNNs for vision
- Zeiler & Fergus (2014) — Visualizing and Understanding Convolutional Networks — groundbreaking visualization of what CNN layers learn
- Goodfellow, Bengio & Courville — Deep Learning (Chapter 9) — the definitive textbook treatment of convolutional networks