← Back to Blog

Convolutional Neural Networks from Scratch

The Architecture That Taught Machines to See

Before transformers conquered vision, one architecture ruled for an entire decade. From AlexNet's 2012 ImageNet shock — when a convolutional neural network crushed the competition by 10 percentage points — to the rise of Vision Transformers in 2020, CNNs were computer vision.

The Vision Transformers post built a ViT from scratch and showed that a standard transformer can classify images by treating them as sequences of patches. But that post also made claims about the architecture it was replacing — claims about locality, weight sharing, and hierarchical features. We never proved those claims. We just asserted them.

This post fixes that. We're going to build a convolutional neural network from raw NumPy, layer by layer, so you can see exactly what convolutions do, why they dominated vision for so long, and what fundamental limitation opened the door for transformers.

The story starts in 1962, with a cat and two neuroscientists.

David Hubel and Torsten Wiesel discovered that neurons in a cat's visual cortex respond to oriented edges in specific local regions of the visual field. Not the whole image — just a small patch. And the same type of edge detector appeared all across the visual field. Three core properties of biological vision that map directly to the three core ideas behind CNNs: local connectivity, weight sharing, and translation equivariance.

What Is a Convolution?

Let's start with the naive approach. You have a 28×28 grayscale image — 784 pixels. If you wire every pixel to every neuron in a hidden layer of size 128, that's 784 × 128 = 100,352 parameters just for the first layer. And this layer has no concept of spatial structure. The pixel at position (0,0) is connected to the same neurons as the pixel at position (27,27). The network has to learn that nearby pixels are related, from scratch, every time.

A convolution bakes in a much smarter assumption: nearby pixels matter more than distant ones. Instead of connecting every pixel to every neuron, we slide a small filter (called a kernel) across the image, computing a weighted sum at each position.

output[i, j] = Σm Σn input[i+m, j+n] · kernel[m, n]

A 5×5 kernel has just 25 learnable parameters. It slides across the entire 28×28 image, producing a 24×24 output (because the 5×5 window can start at positions 0 through 23 in each dimension). The same 25 parameters are reused at every position — this is weight sharing, and it's why CNNs are so parameter-efficient.

import numpy as np

def conv2d(image, kernel):
    """Convolve a 2D image with a 2D kernel (no padding, stride=1)."""
    img_h, img_w = image.shape
    k_h, k_w = kernel.shape
    out_h = img_h - k_h + 1
    out_w = img_w - k_w + 1
    output = np.zeros((out_h, out_w))

    for i in range(out_h):
        for j in range(out_w):
            patch = image[i:i+k_h, j:j+k_w]
            output[i, j] = np.sum(patch * kernel)

    return output

# Sobel edge detector — finds vertical edges
sobel_x = np.array([[-1, 0, 1],
                     [-2, 0, 2],
                     [-1, 0, 1]], dtype=np.float32)

# Test on a 7x7 image with a vertical line in the middle
image = np.zeros((7, 7), dtype=np.float32)
image[:, 3] = 1.0  # vertical line at column 3

edges = conv2d(image, sobel_x)
print("Input shape:", image.shape)   # (7, 7)
print("Output shape:", edges.shape)  # (5, 5) — shrank by kernel_size - 1
print("Edge map:\n", edges)          # Strong response at the line's borders

That's it. Four nested operations: slide, extract patch, multiply, sum. Every convolutional layer in every CNN — from LeNet to ResNet to the convolutional stems in hybrid ViT models — is doing exactly this.

Channels and Multiple Filters

Real images aren't grayscale — they have 3 channels (RGB). And one filter isn't enough: we want to detect many different features (horizontal edges, vertical edges, corners, gradients). So a real convolutional layer takes a multi-channel input and applies multiple filters.

Each filter spans all input channels. If the input has 3 channels and we use a 3×3 kernel, each filter has shape (3, 3, 3) = 27 parameters plus 1 bias. With 8 output filters, that's 8 × (27 + 1) = 224 parameters total — compared to 100K+ for a fully connected layer.

def conv2d_multi(x, kernels, biases, stride=1, padding=0):
    """Multi-channel convolution.
    x:       (C_in, H, W)
    kernels: (C_out, C_in, kH, kW)
    biases:  (C_out,)
    returns: (C_out, H_out, W_out)
    """
    c_out, c_in, k_h, k_w = kernels.shape
    _, h, w = x.shape

    # Apply zero padding
    if padding > 0:
        x = np.pad(x, ((0,0), (padding,padding), (padding,padding)))
        _, h, w = x.shape

    out_h = (h - k_h) // stride + 1
    out_w = (w - k_w) // stride + 1
    output = np.zeros((c_out, out_h, out_w))

    for f in range(c_out):            # for each output filter
        for i in range(out_h):
            for j in range(out_w):
                si, sj = i * stride, j * stride
                patch = x[:, si:si+k_h, sj:sj+k_w]   # (C_in, kH, kW)
                output[f, i, j] = np.sum(patch * kernels[f]) + biases[f]

    return output

# Example: 3 input channels (RGB), 8 output filters, 3x3 kernels
x = np.random.randn(3, 16, 16)           # RGB image, 16x16
kernels = np.random.randn(8, 3, 3, 3) * 0.1
biases = np.zeros(8)

out = conv2d_multi(x, kernels, biases)
print("Input:", x.shape)    # (3, 16, 16)
print("Output:", out.shape)  # (8, 14, 14) — 8 feature maps, each 14x14

Notice the output shape: 8 feature maps, each 14×14. Each feature map is the result of one filter scanning the entire image. One filter might learn to detect horizontal edges, another vertical edges, another diagonal gradients. The network decides what to detect — we just provide the mechanism.

Pooling — Compress and Survive

After convolution detects features, we need to downsample. Max pooling takes the maximum value in each small window (typically 2×2), cutting the spatial dimensions in half.

Why does this work? If a filter detects an edge somewhere in a 2×2 region, the exact position doesn't matter much — we just need to know the edge exists. Max pooling provides local translation invariance: the feature can shift by a pixel and the output doesn't change.

def max_pool2d(x, pool_size=2):
    """Max pooling over spatial dimensions.
    x:       (C, H, W)
    returns: (C, H//pool_size, W//pool_size)
    """
    c, h, w = x.shape
    out_h = h // pool_size
    out_w = w // pool_size
    output = np.zeros((c, out_h, out_w))

    for ch in range(c):
        for i in range(out_h):
            for j in range(out_w):
                si, sj = i * pool_size, j * pool_size
                window = x[ch, si:si+pool_size, sj:sj+pool_size]
                output[ch, i, j] = np.max(window)

    return output

# 8 feature maps, 14x14 each
features = np.random.randn(8, 14, 14)
pooled = max_pool2d(features)
print("Before pooling:", features.shape)  # (8, 14, 14)
print("After pooling:", pooled.shape)     # (8, 7, 7) — spatial dims halved

Each 2×2 pool reduces spatial dimensions by half and reduces the total number of values by 4×. This aggressive compression is how CNNs gradually distill a large image down to a compact feature vector.

The Feature Hierarchy — Edges to Objects

Here's the insight that made CNNs so powerful: stacking convolutional layers creates a feature hierarchy, where each layer detects increasingly complex patterns.

The receptive field — how much of the original image a single neuron can "see" — grows with each layer. A 5×5 kernel in Layer 1 sees a 5×5 patch of the input. After a 2×2 max pool, each position in Layer 2 already represents a 2×2 group. Add another 5×5 conv, and each neuron in Layer 2 effectively sees a 14×14 region of the original image.

This is exactly what the ViT post identified as a CNN's inductive bias: hierarchical features emerge naturally from the architecture. But here's the critical limitation — to connect the top-left corner with the bottom-right corner, you need many layers of conv + pool. A Vision Transformer gets global context in layer 1, via self-attention over all patches. CNNs have to earn it the hard way, one receptive field expansion at a time.

Building LeNet: A Complete CNN

Yann LeCun's LeNet (1998) was one of the first CNNs to achieve commercial success — the US Postal Service used it to read handwritten zip codes. The architecture is beautifully simple, and everything we need to build it is already in our toolkit.

Input (1, 28, 28) → Conv 5×5 → ReLU → Pool 2×2Conv 5×5 → ReLU → Pool 2×2 → Flatten → FC → ReLU → FC → ReLU → FC → Output (10)

Let's trace the tensor shapes through each layer, starting from a 28×28 grayscale image:

  1. Input: (1, 28, 28) — 1 channel, 28×28 pixels
  2. Conv1(1→6, 5×5): (6, 24, 24) — 6 feature maps detecting basic patterns
  3. ReLU + MaxPool(2×2): (6, 12, 12) — spatial dimensions halved
  4. Conv2(6→16, 5×5): (16, 8, 8) — 16 richer feature maps
  5. ReLU + MaxPool(2×2): (16, 4, 4) — now just 4×4 spatial
  6. Flatten: 256-dimensional vector
  7. FC1(256→120) → FC2(120→84) → FC3(84→10)

Total parameter count: Conv1 has 6 × 1 × 5 × 5 + 6 = 156. Conv2 has 16 × 6 × 5 × 5 + 16 = 2,416. FC1: 256 × 120 + 120 = 30,840. FC2: 120 × 84 + 84 = 10,164. FC3: 84 × 10 + 10 = 850. Grand total: 44,426 parameters. A fully connected network over the same 784 pixels with equivalent hidden layers would need millions.

def relu(x):
    return np.maximum(0, x)

class LeNet:
    def __init__(self):
        # Xavier initialization for stable training
        self.conv1_w = np.random.randn(6, 1, 5, 5) * np.sqrt(2.0 / 25)
        self.conv1_b = np.zeros(6)
        self.conv2_w = np.random.randn(16, 6, 5, 5) * np.sqrt(2.0 / 150)
        self.conv2_b = np.zeros(16)
        self.fc1_w = np.random.randn(256, 120) * np.sqrt(2.0 / 256)
        self.fc1_b = np.zeros(120)
        self.fc2_w = np.random.randn(120, 84) * np.sqrt(2.0 / 120)
        self.fc2_b = np.zeros(84)
        self.fc3_w = np.random.randn(84, 10) * np.sqrt(2.0 / 84)
        self.fc3_b = np.zeros(10)

    def forward(self, x):
        """Forward pass. x shape: (1, 28, 28)"""
        # Conv block 1
        x = conv2d_multi(x, self.conv1_w, self.conv1_b)  # (6, 24, 24)
        x = relu(x)
        x = max_pool2d(x)                                 # (6, 12, 12)

        # Conv block 2
        x = conv2d_multi(x, self.conv2_w, self.conv2_b)  # (16, 8, 8)
        x = relu(x)
        x = max_pool2d(x)                                 # (16, 4, 4)

        # Flatten and fully connected layers
        x = x.reshape(-1)                                  # (256,)
        x = relu(x @ self.fc1_w + self.fc1_b)             # (120,)
        x = relu(x @ self.fc2_w + self.fc2_b)             # (84,)
        x = x @ self.fc3_w + self.fc3_b                   # (10,) — raw logits

        return x

net = LeNet()
dummy = np.random.randn(1, 28, 28)
logits = net.forward(dummy)
print("Logits shape:", logits.shape)  # (10,)
print("Prediction:", np.argmax(logits))

That's a complete convolutional neural network in ~40 lines. The convolutional layers extract spatial features while being parameter-efficient. The fully connected layers at the end are just standard feed-forward networks that map the extracted features to class scores.

Training on Synthetic Digits

Let's train our LeNet on a dataset we generate ourselves — simple 28×28 images of geometric patterns representing digits. We'll implement the full training loop with cross-entropy loss and a basic SGD optimizer.

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def cross_entropy_loss(logits, target):
    probs = softmax(logits)
    return -np.log(probs[target] + 1e-10)

def make_synthetic_digit(label, size=28):
    """Generate a simple synthetic image for a digit class."""
    img = np.zeros((size, size), dtype=np.float32)
    rng = np.random
    cx, cy = size // 2, size // 2

    if label == 0:     # circle
        for y in range(size):
            for x in range(size):
                if 8 <= ((x-cx)**2 + (y-cy)**2)**0.5 <= 11:
                    img[y, x] = 1.0
    elif label == 1:   # vertical line
        img[4:-4, cx-1:cx+1] = 1.0
    elif label == 2:   # horizontal line
        img[cy-1:cy+1, 4:-4] = 1.0
    elif label == 3:   # diagonal (top-left to bottom-right)
        for k in range(-1, 2):
            np.fill_diagonal(img[max(0,k):, max(0,-k):], 1.0)
    elif label == 4:   # cross (+)
        img[cy-1:cy+1, 4:-4] = 1.0
        img[4:-4, cx-1:cx+1] = 1.0
    elif label == 5:   # X shape
        for k in range(-1, 2):
            np.fill_diagonal(img[max(0,k):, max(0,-k):], 1.0)
            np.fill_diagonal(np.fliplr(img)[max(0,k):, max(0,-k):], 1.0)
    elif label == 6:   # square
        img[5:23, 5:7] = img[5:23, 21:23] = 1.0
        img[5:7, 5:23] = img[21:23, 5:23] = 1.0
    elif label == 7:   # triangle (top)
        for row in range(6, 24):
            half_w = (row - 6) * 10 // 18
            img[row, cx-half_w:cx+half_w+1] = 1.0
    elif label == 8:   # diamond
        for row in range(size):
            dist = abs(row - cy)
            half_w = max(0, 10 - dist)
            if half_w > 0:
                img[row, cx-half_w:cx+half_w+1] = 1.0
    elif label == 9:   # small dot
        img[cy-3:cy+3, cx-3:cx+3] = 1.0

    # Add slight noise for variety
    img += rng.randn(size, size) * 0.05
    return np.clip(img, 0, 1)

# Generate dataset: 100 samples per class
X_train, y_train = [], []
for label in range(10):
    for _ in range(100):
        X_train.append(make_synthetic_digit(label))
        y_train.append(label)

X_train = np.array(X_train).reshape(-1, 1, 28, 28)
y_train = np.array(y_train)

# Shuffle
perm = np.random.permutation(len(X_train))
X_train, y_train = X_train[perm], y_train[perm]
print(f"Dataset: {len(X_train)} images, {len(set(y_train))} classes")

For a full training loop, we'd implement backpropagation through the convolutional layers — which involves convolving with flipped kernels to compute gradients (the gradient of a convolution is itself a convolution). This connects directly to the chain rule foundations we built in micrograd. In practice, frameworks like PyTorch handle this automatically, but the principle is the same engine we've been building throughout this series.

BatchNorm — The CNN Stabilizer

Batch Normalization (Ioffe & Szegedy, 2015) was transformative for CNN training. The idea: normalize activations across the batch at each layer, then apply learnable scale and shift parameters.

For convolutional layers, we normalize per channel across the batch and spatial dimensions. This means each of the 6 or 16 feature maps gets its own mean and variance, computed over all spatial positions and all images in the batch.

def batch_norm_2d(x_batch, gamma, beta, eps=1e-5):
    """Batch normalization for convolutional layers.
    x_batch: (N, C, H, W) — batch of feature maps
    gamma:   (C,) — learnable scale per channel
    beta:    (C,) — learnable shift per channel
    returns: (N, C, H, W) — normalized feature maps
    """
    n, c, h, w = x_batch.shape

    # Mean and variance per channel, across batch + spatial dims
    mean = x_batch.mean(axis=(0, 2, 3), keepdims=True)   # (1, C, 1, 1)
    var  = x_batch.var(axis=(0, 2, 3), keepdims=True)     # (1, C, 1, 1)

    # Normalize
    x_norm = (x_batch - mean) / np.sqrt(var + eps)

    # Scale and shift (reshape gamma/beta to broadcast)
    gamma = gamma.reshape(1, c, 1, 1)
    beta = beta.reshape(1, c, 1, 1)

    return gamma * x_norm + beta

# Example: batch of 4 images, 6 feature maps each
batch = np.random.randn(4, 6, 12, 12)
gamma = np.ones(6)    # initialized to 1 (identity scaling)
beta = np.zeros(6)    # initialized to 0 (no shift)

normed = batch_norm_2d(batch, gamma, beta)
print("Channel means after BN:", normed.mean(axis=(0,2,3)).round(4))  # ~0
print("Channel stds after BN:", normed.std(axis=(0,2,3)).round(4))   # ~1

BatchNorm smooths the loss landscape, allows higher learning rates, and reduces sensitivity to initialization. It was essential for training deep CNNs like ResNet. But it has a key weakness: it normalizes across the batch dimension, which fails when batch sizes are small or sequences have variable lengths. This is exactly why transformers use LayerNorm instead — normalizing within each sample, independent of the batch.

CNN vs ViT: The Great Tradeoff

Now that we've built a CNN from scratch, we can see exactly what the ViT post meant by "inductive biases." A CNN has four built-in assumptions about images:

  1. Locality — each conv kernel only looks at a small patch (we built this in conv2d)
  2. Weight sharing — the same kernel slides everywhere (the kernel doesn't change between positions)
  3. Translation equivariance — detect a cat at (5,5), same activation as at (15,15) (consequence of weight sharing)
  4. Hierarchical features — stacking conv + pool grows receptive fields from local to global

These biases are a gift on small datasets — the network doesn't need to discover spatial structure from data because it's baked into the architecture. But they're a curse on large datasets, because the network can't learn patterns that violate these assumptions.

Dataset Size Winner Why
Small (< 1M images) CNN Inductive biases compensate for limited data
Medium (1–14M) CNN (slight edge) Priors still help; ViT closing the gap with DeiT
Large (> 14M images) ViT Flexibility beats assumptions — global attention wins
Any size + hybrid CNN stem + ViT body Best of both: local features early, global attention late

CNNs aren't dead. They serve as teacher networks for knowledge distillation, power efficient mobile architectures (MobileNet, EfficientNet), form the stems of hybrid models, and the U-Net architecture — a CNN with skip connections — was the backbone of diffusion models until DiT (Diffusion Transformers) started replacing them.

Try It: Convolutional Neural Networks

Panel 1: Convolution Explorer

Select a kernel and watch it slide across the image. The output feature map builds up one pixel at a time.

Kernel: 3×3 Input: 12×12 Output: 10×10 Parameters: 9
Panel 2: Feature Map Visualizer

See what each layer of a trained CNN "sees." Layer 1 detects edges. Layer 2 combines edges into textures. Click a feature map to highlight its filter.

Conv1: 4 filters, 3×3 Conv2: 8 filters, 3×3 Click a feature map for details
Panel 3: Receptive Field Race — CNN vs ViT

Step through layers and watch how information propagates. CNNs grow their receptive field slowly. ViTs see everything from layer 1.

0
CNN receptive field: 1×1 ViT receptive field: global

Connections to the Series

CNNs touch almost every concept we've built in the elementary series:

References & Further Reading