← Back to Blog

Vision Transformers from Scratch: Your Transformer Already Understands Images

The Punchline

Over the last twenty posts, we built every piece of the transformer from scratch — tokenization, embeddings, positional encoding, attention, feed-forward networks, normalization — and assembled them into a complete language model. That pipeline processes sequences of tokens. Text tokens. Word tokens.

Here’s the twist: images are sequences too.

In 2020, a team at Google Brain took a standard transformer encoder — the same architecture we built for language — and applied it directly to images. No convolutions. No pooling layers. No custom vision-specific modules. Just a transformer eating image patches instead of word tokens. They called it the Vision Transformer (ViT), and their paper title said it all: “An Image is Worth 16×16 Words.”

The result was startling. With enough training data, this plain transformer didn’t just compete with decades of convolutional neural network engineering — it surpassed it. And the reason ViT matters to us isn’t just performance. It’s that you’ve already built every component it needs. The only new idea is patch embedding: how to turn an image into a sequence of tokens. Everything after that — attention, FFN, normalization, residual connections — is identical to what you know.

Let’s prove it.

From Pixels to Patches — The Core Insight

Why can’t we just feed raw pixels into a transformer? Consider a modest 224×224 RGB image. That’s 224 × 224 = 50,176 pixels. Self-attention computes a score between every pair of tokens, which is O(n²). For 50,176 tokens, that’s 2.5 billion attention scores per layer. For a 12-layer transformer with 12 attention heads, you’d need roughly 360 billion operations just for attention. Not practical.

The solution is brilliantly simple: don’t use pixels as tokens. Use patches.

Take that 224×224 image and divide it into a grid of non-overlapping 16×16 squares. You get 14×14 = 196 patches. Each patch is a small crop of the image — a 16×16 block of pixels with 3 color channels, giving a vector of 16×16×3 = 768 values. Now self-attention operates on 196 tokens instead of 50,176. That’s a 65,000× reduction in the number of attention pairs.

Think of it as visual tokenization. In language, BPE tokenization chops text into subword tokens. Patchification chops images into sub-image tokens. Both convert raw input into a manageable sequence length for the transformer.

Here’s the code:

import numpy as np

def patchify(image, patch_size):
    """Split an image into non-overlapping patches.

    Args:
        image: array of shape (H, W, C) or (H, W) for grayscale
        patch_size: side length of each square patch

    Returns:
        patches: array of shape (num_patches, patch_size * patch_size * C)
    """
    if image.ndim == 2:
        image = image[:, :, np.newaxis]
    H, W, C = image.shape
    assert H % patch_size == 0 and W % patch_size == 0, \
        f"Image size {H}x{W} not divisible by patch size {patch_size}"

    nH = H // patch_size
    nW = W // patch_size

    # Reshape: (H, W, C) -> (nH, P, nW, P, C) -> (nH, nW, P, P, C) -> (N, P*P*C)
    patches = image.reshape(nH, patch_size, nW, patch_size, C)
    patches = patches.transpose(0, 2, 1, 3, 4)
    patches = patches.reshape(nH * nW, patch_size * patch_size * C)
    return patches

# Example: a tiny 8x8 grayscale image -> four 4x4 patches
image = np.random.rand(8, 8)
patches = patchify(image, patch_size=4)
print(f"Image shape: {image.shape}")
print(f"Patches shape: {patches.shape}")
# Image shape: (8, 8)
# Patches shape: (4, 16)   -- 4 patches, each with 4*4*1 = 16 values

The reshape and transpose trick is the heart of patchification. We reshape the image into a grid of (nH, patch_size, nW, patch_size, C), then transpose so the two patch dimensions come together, and finally flatten each patch into a single vector. No loops needed — it’s a pure array reshape.

Patch Embedding — The Visual Tokenizer

Raw patch vectors aren’t ready for the transformer yet. Each patch is a 768-dimensional vector of pixel values (for 16×16 RGB patches), but the transformer expects embeddings in its own hidden dimension d_model. We need a linear projection to map patches into the transformer’s space.

This is exactly what token embedding does in language models. There, a lookup table maps each discrete token ID to a dense vector. Here, a matrix multiplication maps each continuous patch vector to a dense vector. Different input type, same purpose: convert raw input into the model’s representation space.

class PatchEmbedding:
    """Projects image patches into the transformer's hidden dimension."""

    def __init__(self, patch_size, in_channels, d_model):
        self.patch_size = patch_size
        patch_dim = patch_size * patch_size * in_channels
        # Xavier initialization
        scale = (patch_dim) ** -0.5
        self.W = np.random.randn(patch_dim, d_model) * scale
        self.b = np.zeros(d_model)

    def forward(self, image):
        patches = patchify(image, self.patch_size)  # (N, patch_dim)
        return patches @ self.W + self.b             # (N, d_model)

# Embed 8x8 grayscale image with 4x4 patches into d_model=64
embed = PatchEmbedding(patch_size=4, in_channels=1, d_model=64)
image = np.random.rand(8, 8)
patch_embeddings = embed.forward(image)
print(f"Patch embeddings shape: {patch_embeddings.shape}")
# Patch embeddings shape: (4, 64)   -- 4 patches, each mapped to 64-dim

One implementation detail worth noting: the patch embedding — patchify followed by a linear projection — is mathematically identical to a 2D convolution with kernel_size = patch_size and stride = patch_size. PyTorch implementations typically use nn.Conv2d for this reason. It’s one operation instead of two, but the result is the same: each non-overlapping image patch gets independently projected to a d_model-dimensional vector.

The [CLS] Token and Position Embeddings

We now have a sequence of patch embeddings. But a sequence of patches is a bag — the model doesn’t know whether a patch came from the top-left corner or the bottom-right. And we need a way to get a single image-level representation for classification. ViT solves both problems with two additions.

The [CLS] Token

Borrowed from BERT, the [CLS] token is a learnable vector prepended to the patch sequence. It doesn’t correspond to any image region. Instead, through self-attention, it learns to aggregate information from all patches. After the final transformer layer, the [CLS] token’s hidden state becomes the image-level representation that gets fed to the classification head.

Why not just average all patch embeddings? You could — and some variants do (the ViT paper calls this “global average pooling”). But the [CLS] token lets the model learn how to aggregate, potentially weighting important patches more heavily through attention.

Position Embeddings

Just like in language transformers, we add positional information so the model knows where each patch lives. But ViT makes a surprising choice: 1D learnable position embeddings. No sinusoidal encoding. No explicit 2D grid. Just a different learnable vector for each position in the sequence (position 0 for [CLS], positions 1 through N for patches in raster order).

The ViT paper tested 2D-aware positional embeddings and found they performed no better than 1D. The model discovers the 2D spatial structure on its own during training — we’ll see this in our interactive demo, where the learned position embeddings form a clear grid pattern.

class ViTInput:
    """Prepends [CLS] token and adds positional embeddings."""

    def __init__(self, num_patches, d_model):
        self.cls_token = np.random.randn(1, d_model) * 0.02
        # One position embedding per slot: [CLS] + N patches
        self.pos_embed = np.random.randn(1 + num_patches, d_model) * 0.02

    def forward(self, patch_embeddings):
        # patch_embeddings: (N, d_model)
        # Prepend [CLS] token to the sequence
        x = np.vstack([self.cls_token, patch_embeddings])  # (N+1, d_model)
        # Add positional embeddings
        x = x + self.pos_embed                              # (N+1, d_model)
        return x

# 4 patches + 1 [CLS] token = 5 positions
vit_input = ViTInput(num_patches=4, d_model=64)
x = vit_input.forward(patch_embeddings)
print(f"Sequence shape: {x.shape}")
# Sequence shape: (5, 64)   -- [CLS] + 4 patches, each 64-dim

After this step, we have a sequence of N+1 vectors, each carrying both content (from the patch embedding) and position (from the positional embedding). This sequence is ready for the transformer.

The Transformer Encoder — Everything You Already Know

Here’s where the payoff arrives. The transformer encoder block in ViT is identical to the one we built in the transformer capstone:

  1. LayerNorm (Pre-Norm architecture)
  2. Multi-Head Self-Attention
  3. Residual connection
  4. LayerNorm
  5. Feed-Forward Network (expand → GELU → contract)
  6. Residual connection

There is exactly one difference from our language transformer: no causal mask. In language modeling, token t can only attend to tokens 0…t (you can’t peek at future words). In vision, there is no “future” — all patches exist simultaneously. So ViT uses bidirectional attention where every patch can attend to every other patch from the very first layer. This is actually simpler than the masked attention we built for language.

def softmax(x):
    """Numerically stable softmax over the last axis."""
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def rms_norm(x, gamma, eps=1e-6):
    """RMSNorm: normalize by root-mean-square, then scale."""
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return (x / rms) * gamma

def gelu(x):
    """GELU activation (Gaussian Error Linear Unit)."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        scale = d_model ** -0.5
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def forward(self, x):
        seq_len, d_model = x.shape
        Q = (x @ self.W_q).reshape(seq_len, self.num_heads, self.head_dim)
        K = (x @ self.W_k).reshape(seq_len, self.num_heads, self.head_dim)
        V = (x @ self.W_v).reshape(seq_len, self.num_heads, self.head_dim)
        # (seq_len, num_heads, head_dim) -> per-head attention
        attn_scores = np.zeros((self.num_heads, seq_len, seq_len))
        for h in range(self.num_heads):
            attn_scores[h] = Q[:, h, :] @ K[:, h, :].T / np.sqrt(self.head_dim)
        # No causal mask! Bidirectional attention.
        attn_weights = np.array([softmax(attn_scores[h]) for h in range(self.num_heads)])
        # Compute output per head and concatenate
        head_outputs = []
        for h in range(self.num_heads):
            head_outputs.append(attn_weights[h] @ V[:, h, :])
        concat = np.concatenate(head_outputs, axis=-1)  # (seq_len, d_model)
        return concat @ self.W_o

class FFN:
    def __init__(self, d_model, d_ff):
        scale = d_model ** -0.5
        self.W1 = np.random.randn(d_model, d_ff) * scale
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * (d_ff ** -0.5)
        self.b2 = np.zeros(d_model)

    def forward(self, x):
        return gelu(x @ self.W1 + self.b1) @ self.W2 + self.b2

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.gamma1 = np.ones(d_model)
        self.gamma2 = np.ones(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = FFN(d_model, d_ff)

    def forward(self, x):
        # Pre-Norm: Normalize -> Attend -> Residual
        x = x + self.attn.forward(rms_norm(x, self.gamma1))
        # Pre-Norm: Normalize -> FFN -> Residual
        x = x + self.ffn.forward(rms_norm(x, self.gamma2))
        return x

Compare this to the block we built in the transformer capstone. It’s the same code. The MultiHeadAttention computes Q, K, V projections and scaled dot-product attention. The FFN does the expand-GELU-contract pattern. TransformerBlock chains them with Pre-Norm and residual connections. The only thing missing is the causal mask — and that’s a deletion, not an addition.

The Complete Vision Transformer

Now we assemble everything into a single class. The forward pass is a clean pipeline: image → patchify → embed → prepend [CLS] → add positions → L transformer blocks → extract [CLS] → normalize → classify.

class VisionTransformer:
    """Complete ViT: image in, class logits out."""

    def __init__(self, image_size, patch_size, in_channels,
                 d_model, num_heads, num_layers, d_ff, num_classes):
        num_patches = (image_size // patch_size) ** 2

        self.patch_embed = PatchEmbedding(patch_size, in_channels, d_model)
        self.input_stage = ViTInput(num_patches, d_model)
        self.blocks = [TransformerBlock(d_model, num_heads, d_ff)
                       for _ in range(num_layers)]
        self.final_gamma = np.ones(d_model)
        # Classification head: d_model -> num_classes
        self.W_head = np.random.randn(d_model, num_classes) * (d_model ** -0.5)
        self.b_head = np.zeros(num_classes)

    def forward(self, image):
        # 1. Patch embedding
        x = self.patch_embed.forward(image)          # (N, d_model)
        # 2. Add [CLS] token and positions
        x = self.input_stage.forward(x)              # (N+1, d_model)
        # 3. Transformer encoder blocks
        for block in self.blocks:
            x = block.forward(x)                     # (N+1, d_model)
        # 4. Extract [CLS] token (first position)
        cls_output = x[0:1, :]                       # (1, d_model)
        # 5. Final normalization
        cls_output = rms_norm(cls_output, self.final_gamma)
        # 6. Classification head
        logits = cls_output @ self.W_head + self.b_head  # (1, num_classes)
        return logits[0]  # (num_classes,)

# A tiny ViT for 8x8 grayscale images with 4x4 patches
model = VisionTransformer(
    image_size=8, patch_size=4, in_channels=1,
    d_model=64, num_heads=4, num_layers=2, d_ff=256,
    num_classes=4
)

image = np.random.rand(8, 8)
logits = model.forward(image)
probs = softmax(logits)
print(f"Logits: {logits.round(3)}")
print(f"Probs:  {probs.round(3)}")
# Logits: [-0.012  0.034  0.008 -0.021]   (random, untrained)
# Probs:  [0.248  0.259  0.252  0.241]    (nearly uniform -- no training yet)

Let’s count the parameters for this tiny model (image_size=8, patch_size=4, d_model=64, 4 heads, 2 layers, d_ff=256, 4 classes):

For reference, ViT-Base/16 (the standard model used in practice) has 86 million parameters with d_model=768, 12 heads, 12 layers, and 16×16 patches on 224×224 images. Same architecture, larger numbers.

Training on Tiny Images

Let’s verify our ViT actually learns something. We’ll create a tiny synthetic dataset of 8×8 grayscale images with four classes of simple patterns, then train our model with cross-entropy loss and gradient descent.

For a real training loop with backpropagation, you’d need gradients for every operation — which we built in micrograd and refined in the optimizers and loss functions posts. Here we show the forward pass and loss computation to verify the architecture works. A full training implementation would wrap these components with autograd.
def make_shape_dataset(n_per_class=50, size=8):
    """Generate tiny images of 4 shapes: horizontal bars, vertical bars,
       diagonal, and checker pattern."""
    images, labels = [], []
    for i in range(n_per_class):
        # Class 0: horizontal bars
        img = np.zeros((size, size))
        img[::2, :] = 1.0
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(0)

        # Class 1: vertical bars
        img = np.zeros((size, size))
        img[:, ::2] = 1.0
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(1)

        # Class 2: diagonal gradient
        img = np.fromfunction(lambda r, c: (r + c) / (2 * size), (size, size))
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(2)

        # Class 3: checkerboard
        img = np.fromfunction(lambda r, c: ((r + c) % 2).astype(float), (size, size))
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(3)

    return images, labels

def cross_entropy_loss(logits, label):
    """Cross-entropy loss for a single sample."""
    probs = softmax(logits)
    loss = -np.log(probs[label] + 1e-10)
    return loss, probs

# Create dataset
X, y = make_shape_dataset(n_per_class=50, size=8)
print(f"Dataset: {len(X)} images, {len(set(y))} classes")
# Dataset: 200 images, 4 classes

# Forward pass on one image (before training)
logits = model.forward(X[0])
loss, probs = cross_entropy_loss(logits, y[0])
print(f"Untrained loss: {loss:.3f}")
print(f"Untrained probs: {probs.round(3)}")
# Untrained loss: 1.387   (close to -ln(1/4) = 1.386 -- random guessing)
# Untrained probs: [0.247 0.254 0.251 0.248]

The untrained model outputs near-uniform probabilities across all four classes — exactly what you’d expect. The loss is approximately ln(4) ≈ 1.386, the theoretical maximum for random guessing on 4 classes. Training would minimize this loss using cross-entropy and Adam, just like we’ve done for every model in this series.

The key insight at this tiny scale: ViT works, but it doesn’t have the same head start as a CNN. Convolutional networks come pre-loaded with inductive biases about images — locality (nearby pixels are related) and translation equivariance (the same pattern matters regardless of where it appears). ViT must learn these properties from data. At 200 training images, that’s a handicap. At 200 million, it’s a strength.

What the Model Sees — Attention Visualization

One of the most compelling features of ViT is that its attention maps have a direct spatial interpretation. In language, attention weights tell you which words attend to which other words. In vision, those same weights tell you which image regions attend to which other regions — and you can literally paint them as a heatmap on top of the image.

Research on trained ViT models has revealed a consistent pattern:

This progressive shift from local to global attention is fascinating because it happens without any architectural constraint. Unlike CNNs, where the receptive field grows layer by layer because each filter is small, ViT’s attention is global from layer 1. Every patch can attend to every other patch from the very beginning. The model chooses to attend locally in early layers because that’s what’s most useful for building basic features — and it shifts to global attention only when it needs to reason about the whole image.

This is the same scaled dot-product attention mechanism we built from scratch, but now the weights have spatial meaning you can see.

ViT vs CNNs — The Data Scale Trade-off

CNNs dominated computer vision for a decade (2012–2020) because they encode strong assumptions about images directly into the architecture:

These biases are correct for images, which gives CNNs a massive head start when data is scarce. ViT, with no built-in image priors, must learn spatial relationships from scratch. The data scale determines the winner:

Training Data Winner Why
Small (<1M images) CNN Inductive biases compensate for limited data
Medium (1–14M) CNN (slight edge) Priors still help; ViT closing the gap
Large (>14M images) ViT Flexibility beats assumptions; better scaling

A major breakthrough came with DeiT (Data-efficient Image Transformers, Touvron et al. 2021). Instead of needing 300 million images to beat CNNs, DeiT trains a ViT on just ImageNet (1.3 million images) by distilling knowledge from a CNN teacher. The CNN teacher provides the inductive bias that ViT lacks, packaged as soft labels. If you read our knowledge distillation post, you know exactly how this works — the teacher’s soft probability distribution over classes contains “dark knowledge” about inter-class relationships that hard labels destroy.

DeiT achieved 85.2% top-1 accuracy on ImageNet with no external data, trained in under 3 days on a single machine. That result ended the “ViT needs massive data” narrative and opened the door to practical vision transformers.

The ViT Family and Beyond

The original ViT paper defined three model sizes that follow the now-familiar transformer scaling convention:

Model Layers d_model Heads Params Patch
ViT-Base/16 12 768 12 86M 16×16
ViT-Large/16 24 1024 16 304M 16×16
ViT-Huge/14 32 1280 16 632M 14×14

The “/16” and “/14” suffixes indicate patch size. Smaller patches mean more tokens: a 224×224 image produces 196 tokens with /16 patches but 256 tokens with /14 patches. Since attention is O(n²), smaller patches give better accuracy at higher computational cost.

The real impact of ViT, though, isn’t classification benchmarks. It’s that the same architecture became the backbone for an entire ecosystem of vision models:

The pattern is unmistakable: the transformer we built for language processes text tokens, and with one architectural change — swap the tokenizer for a patchifier — it processes image patches just as well.

Try It: Vision Transformer Explorer

Panel 1: Patchify Explorer

Select an image pattern and adjust the patch size to see how images become token sequences.

8×8
4×4 grid = 16 patches · each patch = 64 pixels
Panel 2: Attention Heatmap

Click a patch to see its attention pattern. Early layers attend locally; deep layers attend globally.

1
1
Click a patch to see where it attends
Panel 3: Position Embedding Similarity

Click a patch position to see its cosine similarity to all others. Toggle to see how training creates 2D spatial structure from 1D positions.

Currently: Random (Untrained)
Click a patch position to see similarity pattern

Connections to the Series

ViT is the junction point where the entire transformer pipeline we’ve built meets a new modality. Here’s how every piece connects:

image → patchify → embed → position → normalize → attend → FFN → softmax → classify
The same pipeline. One new step. Everything else you already built.

What Comes Next

ViT opened the door. The transformer we built for language now processes images too. But the real revolution isn’t just classification — it’s what happens when you connect vision and language in the same model.

CLIP trains a ViT image encoder and a text transformer side-by-side with contrastive learning, creating a shared embedding space where images and text live together. It’s the bridge between our embeddings post (where we learned word vectors) and this one (where we learned image vectors). And it’s the conditioning mechanism that tells Stable Diffusion what to generate from a text prompt.

The bigger picture: the same attention mechanism works for text, images, audio, video, protein sequences, and genomic data. The transformer isn’t a language model or an image model. It’s a sequence model — and anything can be turned into a sequence. We just proved it for images with one new idea and zero new architecture.

That’s the power of understanding fundamentals. You didn’t just learn ViT today. You confirmed that the twenty components you already built are the universal building blocks of modern AI.

References & Further Reading