← Back to Blog

Semantic Segmentation from Scratch: Classifying Every Pixel in an Image

1. From Boxes to Pixels

Object detection says "there's a cat somewhere in this rectangle." Semantic segmentation says "these exact 12,847 pixels are cat, those 8,291 pixels are couch, and everything else is background." It's the difference between pointing at something and tracing its exact outline — and it unlocks a whole new class of applications.

Visual understanding comes in three levels. Classification answers "what's in this image?" — one label for the whole image. Detection answers "what and where?" — bounding boxes around objects. Segmentation answers "what is every single pixel?" — a label map at full image resolution. The output of semantic segmentation is an H×W grid where each cell contains a class ID: road, car, pedestrian, sky, building, tree.

There are actually three flavors of segmentation. Semantic segmentation labels every pixel by class but doesn't distinguish individual objects — two cars get the same "car" label. Instance segmentation separates individual objects — car_1 vs car_2. Panoptic segmentation does both: "stuff" classes (sky, road) get semantic labels, while "thing" classes (cars, people) get instance IDs. This post focuses on semantic segmentation — the foundation that the others build upon. You need pixel-level class labels before you can separate instances.

2. Fully Convolutional Networks — The Foundation

In 2015, Long, Shelhamer, and Darrell asked a simple question: can we repurpose a classification CNN for pixel-level prediction? A classification network like VGG takes a 224×224 image and produces a single 1000-class probability vector. Five pooling layers, each halving spatial dimensions, shrink the feature map from 224×224 down to 7×7. A fully connected layer then collapses everything to a single prediction.

The key insight of the Fully Convolutional Network (FCN): replace those fully connected layers with 1×1 convolutions. Now the network outputs a 7×7 grid of class predictions instead of a single vector. Each spatial position predicts a class. Upsample that 7×7 prediction back to 224×224 using bilinear interpolation, and you have a segmentation map.

It works — but it's blurry. Five rounds of 2× pooling compress the image by 32×. A 224×224 image becomes a 7×7 prediction, and upsampling by 32× can't recover the fine boundary details that pooling destroyed. The prediction looks like a low-resolution mosaic: roughly correct but with blocky, imprecise edges. FCN-8s improved this by fusing predictions from earlier layers (8× downsampled instead of 32×), but the fundamental problem remained: pooling throws away spatial information, and you can't upsample it back.

import numpy as np

def fcn_segment(feature_map, num_classes, target_h, target_w):
    """Minimal FCN: 1x1 conv + bilinear upsampling."""
    feat_h, feat_w, channels = feature_map.shape

    # 1x1 convolution: project each spatial position to class scores
    weights = np.random.randn(channels, num_classes) * 0.01
    # Reshape to (H*W, C) @ (C, K) -> (H*W, K) -> (H, W, K)
    flat = feature_map.reshape(-1, channels)
    scores = flat @ weights  # [feat_h*feat_w, num_classes]
    coarse_map = scores.reshape(feat_h, feat_w, num_classes)
    coarse_labels = coarse_map.argmax(axis=2)  # [feat_h, feat_w]

    # Nearest-neighbor upsampling to original resolution
    upsampled = np.zeros((target_h, target_w), dtype=int)
    for y in range(target_h):
        for x in range(target_w):
            src_y = y * feat_h / target_h
            src_x = x * feat_w / target_w
            upsampled[y, x] = coarse_labels[
                min(int(src_y), feat_h - 1),
                min(int(src_x), feat_w - 1)
            ]
    return coarse_labels, upsampled

# Simulate: 7x7 feature map from a backbone, 4 classes
feat = np.random.randn(7, 7, 512)
coarse, full = fcn_segment(feat, num_classes=4, target_h=224, target_w=224)
print(f"Coarse map: {coarse.shape}")   # (7, 7) — very blocky
print(f"Full map:   {full.shape}")     # (224, 224) — upsampled but blurry
print(f"Each coarse pixel covers {224//7}x{224//7} = {(224//7)**2} output pixels")

Each cell in the 7×7 coarse map covers a 32×32 pixel region in the output — over 1,000 pixels share the same prediction. That's why FCN boundaries are blocky. The fix? Don't throw away spatial information in the first place.

3. U-Net — Skip Connections Save the Day

U-Net (Ronneberger et al. 2015) solved the boundary problem with an elegant architectural insight. It uses an encoder-decoder structure: the encoder (a standard CNN backbone) progressively downsamples to extract semantic features, and the decoder progressively upsamples to recover spatial resolution. The magic is in the skip connections that bridge the two paths.

Why do skip connections matter? Consider what each layer of the encoder knows. Deep layers (near the bottleneck) have large receptive fields — they understand what something is ("this region is a cat") but have lost spatial precision (they operate at 1/16 or 1/32 resolution). Shallow layers (near the input) preserve fine spatial detail — they know where edges and textures are — but they don't know what those edges belong to.

U-Net's skip connections concatenate encoder features with decoder features at each resolution level. The decoder gets the best of both worlds: "what" from the deep features and "where" from the shallow features. The result is sharp, class-correct boundaries — a dramatic improvement over the blurry FCN output.

The architecture forms a U-shape: the encoder path descends (spatial resolution shrinks, channel count grows), the decoder path ascends (spatial resolution grows, channel count shrinks), and horizontal skip connections bridge the two at each level. Originally designed for biomedical image segmentation with very few training images, U-Net's design proved universal — it's now the most cited segmentation paper in history (~80,000 citations) and the backbone of everything from satellite imagery analysis to diffusion models.

import numpy as np

def conv_block(x, out_channels):
    """Simplified conv block: linear projection (simulates conv + ReLU)."""
    in_channels = x.shape[-1]
    W = np.random.randn(in_channels, out_channels) * 0.01
    h, w, c = x.shape
    return np.maximum(0, x.reshape(-1, c) @ W).reshape(h, w, out_channels)

def downsample(x):
    """Stride-2 spatial subsampling (simulates pooling)."""
    h, w, c = x.shape
    return x[:h//2*2:2, :w//2*2:2, :]  # simple stride-2 subsampling

def upsample(x, target_h, target_w):
    """Nearest-neighbor 2x upsampling."""
    return np.repeat(np.repeat(x, 2, axis=0), 2, axis=1)[:target_h, :target_w, :]

def unet_forward(image, num_classes=4):
    """Simplified U-Net: 3-level encoder-decoder with skip connections."""
    # Encoder (downsampling path)
    e1 = conv_block(image, 64)           # Level 1: full resolution
    e2 = conv_block(downsample(e1), 128) # Level 2: 1/2 resolution
    e3 = conv_block(downsample(e2), 256) # Level 3: 1/4 resolution
    bottleneck = conv_block(downsample(e3), 512)  # 1/8 resolution

    # Decoder (upsampling path with skip connections)
    d3_up = upsample(bottleneck, e3.shape[0], e3.shape[1])
    d3 = conv_block(np.concatenate([d3_up, e3], axis=2), 256)  # skip!

    d2_up = upsample(d3, e2.shape[0], e2.shape[1])
    d2 = conv_block(np.concatenate([d2_up, e2], axis=2), 128)  # skip!

    d1_up = upsample(d2, e1.shape[0], e1.shape[1])
    d1 = conv_block(np.concatenate([d1_up, e1], axis=2), 64)   # skip!

    # Final 1x1 conv for pixel classification
    W_out = np.random.randn(64, num_classes) * 0.01
    h, w, c = d1.shape
    logits = d1.reshape(-1, c) @ W_out
    return logits.reshape(h, w, num_classes).argmax(axis=2)

# Example: 64x64 "image" with 3 channels
image = np.random.randn(64, 64, 3)
seg_map = unet_forward(image, num_classes=4)
print(f"Input: {image.shape} -> Segmentation: {seg_map.shape}")
# Input: (64, 64, 3) -> Segmentation: (64, 64) — full resolution preserved!

Notice the key operation: np.concatenate([d3_up, e3], axis=2). The upsampled decoder features are concatenated with the encoder features from the same level. This doubles the channel count but gives the decoder access to the fine-grained spatial details that the encoder preserved. Compare this to ResNet's skip connections, which add element-wise — U-Net concatenates along the channel dimension, a richer fusion that lets the decoder learn which information to use from each source.

Try It: Segmentation Quality — FCN vs U-Net

Scene | Ground Truth | Prediction (FCN-32s)

Toggle between the three methods above. FCN-32s produces blocky 32×32 pixel blocks — like looking through a mosaic. FCN-8s sharpens to 8×8 blocks by fusing multi-scale predictions. U-Net achieves pixel-level boundaries thanks to skip connections preserving spatial detail. Click "New Scene" to see how each method handles different shape arrangements.

4. Dilated Convolutions — Seeing More Without Shrinking

Encoder-decoder architectures downsample and upsample. But what if you could expand the network's receptive field without reducing resolution? Dilated convolutions (also called atrous convolutions) do exactly that.

A standard 3×3 convolution looks at a 3×3 patch of the input. A dilated convolution with rate r inserts (r - 1) zeros between each kernel element, creating an effective kernel that covers a much larger area without adding parameters. At rate 1 (standard), a 3×3 kernel sees 3×3 = 9 pixels. At rate 2, it sees a 5×5 region (but still only 9 non-zero weights). At rate 4, it sees a 9×9 region — all from the same 9-parameter kernel.

DeepLab (Chen et al. 2017) exploited this with Atrous Spatial Pyramid Pooling (ASPP): apply parallel dilated convolutions at rates 6, 12, and 18, plus a 1×1 convolution and global average pooling, then concatenate all branches. Each rate captures context at a different scale, and the concatenation gives the network multi-scale understanding in a single layer. The advantage over encoder-decoder: no pooling means no information loss. The disadvantage: keeping feature maps at full resolution is memory-intensive.

import numpy as np

def dilated_conv2d(image, kernel, dilation_rate=1):
    """2D dilated convolution from scratch."""
    H, W = image.shape
    kh, kw = kernel.shape
    # Effective kernel size with dilation
    eff_kh = kh + (kh - 1) * (dilation_rate - 1)
    eff_kw = kw + (kw - 1) * (dilation_rate - 1)
    pad_h, pad_w = eff_kh // 2, eff_kw // 2

    # Pad input
    padded = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w)), mode='constant')
    output = np.zeros((H, W))

    for y in range(H):
        for x in range(W):
            val = 0.0
            for ky in range(kh):
                for kx in range(kw):
                    py = y + pad_h + ky * dilation_rate - pad_h
                    px = x + pad_w + kx * dilation_rate - pad_w
                    if 0 <= py < padded.shape[0] and 0 <= px < padded.shape[1]:
                        val += padded[py, px] * kernel[ky, kx]
            output[y, x] = val
    return output

# Edge detection kernel
kernel = np.array([[-1, -1, -1],
                   [-1,  8, -1],
                   [-1, -1, -1]], dtype=float)

image = np.random.rand(16, 16)
for rate in [1, 2, 4]:
    out = dilated_conv2d(image, kernel, dilation_rate=rate)
    eff_size = 3 + (3 - 1) * (rate - 1)
    print(f"Rate {rate}: effective {eff_size}x{eff_size} receptive field, output {out.shape}")

The same 3×3 kernel captures progressively larger context: rate 1 sees 3×3, rate 2 sees 5×5, rate 4 sees 9×9. ASPP runs all three in parallel and concatenates the results, giving the network multi-scale context without losing a single pixel of resolution. One caveat: regular dilation patterns can create "gridding artifacts" — a checkerboard pattern of pixels that the kernel never touches. Hybrid dilation schedules (mixing rates 1, 2, 3 instead of 2, 4, 8) mitigate this.

5. Loss Functions for Pixel-Level Prediction

Segmentation models predict a class for every pixel, so the loss function operates at every pixel too. The simplest approach is pixel-wise cross-entropy: compute cross-entropy at each pixel independently, then average. For a 512×512 image with 20 classes, that's 262,144 individual classification problems per training step.

The catch is class imbalance. In a driving scene, "road" might cover 60% of pixels while "pedestrian" covers 0.5%. Minimizing average cross-entropy, the model can achieve low loss by predicting "road" everywhere and ignoring rare but safety-critical classes like pedestrians. Weighted cross-entropy helps: weight each class inversely proportional to its frequency, so errors on rare classes are penalized more heavily.

Dice loss takes a fundamentally different approach. Instead of treating pixels independently, it measures the overlap between the predicted mask and the ground truth: Dice = 2|A∩B| / (|A| + |B|). For soft predictions, this becomes Dice = 2·Σ(p·g) / (Σp + Σg), which is differentiable. Dice loss directly optimizes the metric we care about — overlap — and naturally handles imbalance because it cares about the ratio of correctly predicted foreground to total foreground, not the absolute number of pixels. It's the default choice for binary medical segmentation where the target (a small tumor) might cover only 1% of the image.

import numpy as np

def pixel_cross_entropy(pred_logits, target, class_weights=None):
    """Pixel-wise cross-entropy loss for segmentation."""
    H, W, C = pred_logits.shape
    # Softmax per pixel
    exp_logits = np.exp(pred_logits - pred_logits.max(axis=2, keepdims=True))
    probs = exp_logits / exp_logits.sum(axis=2, keepdims=True)

    loss = 0.0
    for y in range(H):
        for x in range(W):
            c = target[y, x]
            p = np.clip(probs[y, x, c], 1e-7, 1.0)
            w = class_weights[c] if class_weights is not None else 1.0
            loss -= w * np.log(p)
    return loss / (H * W)

def dice_loss(pred_probs, target_mask):
    """Dice loss for binary segmentation."""
    pred = pred_probs.flatten()
    target = target_mask.flatten().astype(float)
    intersection = (pred * target).sum()
    dice = (2.0 * intersection) / (pred.sum() + target.sum() + 1e-7)
    return 1.0 - dice

def mean_iou(pred_labels, target_labels, num_classes):
    """Mean Intersection-over-Union metric."""
    ious = []
    for c in range(num_classes):
        pred_c = (pred_labels == c)
        target_c = (target_labels == c)
        intersection = (pred_c & target_c).sum()
        union = (pred_c | target_c).sum()
        if union > 0:
            ious.append(intersection / union)
    return np.mean(ious)

# Demo: tiny segmentation with severe imbalance (5% foreground)
H, W = 20, 20
target = np.zeros((H, W), dtype=int)
target[8:12, 8:12] = 1  # small foreground square (16 of 400 = 4%)

# Model A: predicts all background (achieves 96% pixel accuracy!)
pred_all_bg = np.zeros((H, W), dtype=int)
acc_bg = (pred_all_bg == target).mean()
dice_bg = 1.0 - dice_loss(np.zeros((H, W)), target)

# Model B: finds the foreground
pred_good = target.copy()
pred_good[7:13, 7:13] = 1  # slightly oversized prediction
dice_good = 1.0 - dice_loss(pred_good.astype(float), target)

print(f"All-background: accuracy={acc_bg:.0%}, Dice={dice_bg:.3f}")
print(f"Good prediction: accuracy={(pred_good==target).mean():.0%}, Dice={dice_good:.3f}")

The "all background" model gets 96% pixel accuracy — a metric that says it's excellent! But its Dice score is 0.000 — it found zero foreground. Even more telling: the good model's pixel accuracy is lower at 95%, because it "wastes" predictions on the small foreground. Yet its Dice score of 0.615 correctly reflects that it actually found the target. This is why segmentation always uses mIoU or Dice for evaluation, not pixel accuracy. In practice, many models combine CE + Dice: cross-entropy for stable gradients, Dice for imbalance handling.

Try It: Dice Loss vs Cross-Entropy on Imbalanced Data

Click "Train 50 Steps" a few times and watch the difference. The CE model (left) struggles to find the small foreground circle because its gradients, averaged across mostly-background pixels, push everything toward "predict background." The Dice model (right) converges much faster because Dice loss directly optimizes overlap — even a small foreground region gets strong gradient signal. Try adjusting the foreground size: smaller circles make the imbalance worse and the CE failure more dramatic.

6. Post-Processing and Boundary Refinement

Raw CNN outputs are often "blobby" — they get the overall region right but produce jagged, imprecise boundaries. Conditional Random Fields (CRFs) were the original fix. A CRF models pairwise relationships between pixels: if two nearby pixels have similar colors, they probably share the same label. Dense CRF (Krähenbühl & Koltun 2011) applies this efficiently using Gaussian filtering, refining the entire image in ~0.5 seconds.

DeepLab v1 and v2 used CRF as a fixed post-processing step, and the boundary improvement was dramatic. But by DeepLab v3+, the CNN itself had gotten good enough that CRF barely helped — better architectures and training made post-processing unnecessary. Modern alternatives include boundary-aware loss terms (penalize errors near edges more heavily), learned edge detection heads that predict boundaries explicitly, and test-time augmentation (predict on multiple scales and flips, then average — and remember, for segmentation the mask must be transformed alongside the image). The trend is clear: as architectures improve, post-processing fades away.

7. Modern Architectures — Transformers and Foundation Models

The latest chapter in segmentation replaces CNN backbones with vision transformers. SegFormer (Xie et al. 2021) uses a hierarchical Mix Transformer (MiT) encoder that produces multi-scale features, paired with an MLP decoder so lightweight it's almost embarrassing — four linear layers fuse the multi-scale features and produce the final segmentation. No complex decoder needed when the transformer features are rich enough.

Mask2Former (Cheng et al. 2022) unified semantic, instance, and panoptic segmentation into a single architecture using masked attention: each query attends only to pixels within its predicted mask region, focusing computation where it matters.

Then came SAM — the Segment Anything Model (Kirillov et al. 2023). Trained on 11 million images with 1 billion masks, SAM is a promptable segmentation foundation model: give it a point, a box, or a text description, and it segments the corresponding object. It generalizes zero-shot to objects it has never seen. SAM represents a philosophical shift in segmentation: from training task-specific models on small labeled datasets to using foundation models that transfer everywhere. The trajectory from FCN to U-Net to DeepLab to SAM mirrors deep learning's broader arc — from hand-crafted components to end-to-end learning at scale.

Conclusion

Semantic segmentation takes computer vision from "there's a cat" to "these exact pixels are cat." We traced the full arc: FCN showed that classification networks can be repurposed for dense prediction, but pooling-based downsampling produces blurry results. U-Net fixed this with skip connections that preserve spatial detail — an idea so powerful it appears in nearly every modern architecture. Dilated convolutions offered an alternative: expand the receptive field without shrinking the feature map. Dice loss solved the class imbalance problem that pixel-wise cross-entropy ignores. And the field continues to evolve — from hand-crafted CRF post-processing to transformer-based architectures to foundation models like SAM that segment anything, zero-shot.

The core lesson from this journey? Preserving spatial information is everything. Every major advance — skip connections, dilated convolutions, multi-scale feature fusion — is fundamentally about retaining the fine-grained "where" that classification networks throw away. If you understand that tension between what (semantic understanding) and where (spatial precision), you understand segmentation.

References & Further Reading