Vision Transformers from Scratch: Your Transformer Already Understands Images

February 25, 2026 · Elementary AI · 18 min read

The Punchline

Over the last twenty posts, we built every piece of the transformer from scratch — tokenization, embeddings, positional encoding, attention, feed-forward networks, normalization — and assembled them into a complete language model. That pipeline processes sequences of tokens. Text tokens. Word tokens.

Here’s the twist: images are sequences too.

In 2020, a team at Google Brain took a standard transformer encoder — the same architecture we built for language — and applied it directly to images. No convolutions. No pooling layers. No custom vision-specific modules. Just a transformer eating image patches instead of word tokens. They called it the Vision Transformer (ViT), and their paper title said it all: “An Image is Worth 16×16 Words.”

The result was startling. With enough training data, this plain transformer didn’t just compete with decades of convolutional neural network engineering — it surpassed it. And the reason ViT matters to us isn’t just performance. It’s that you’ve already built every component it needs. The only new idea is patch embedding: how to turn an image into a sequence of tokens. Everything after that — attention, FFN, normalization, residual connections — is identical to what you know.

Let’s prove it.

From Pixels to Patches — The Core Insight

Why can’t we just feed raw pixels into a transformer? Consider a modest 224×224 RGB image. That’s 224 × 224 = 50,176 pixels. Self-attention computes a score between every pair of tokens, which is O(n²). For 50,176 tokens, that’s 2.5 billion attention scores per layer. For a 12-layer transformer with 12 attention heads, you’d need roughly 360 billion operations just for attention. Not practical.

The solution is brilliantly simple: don’t use pixels as tokens. Use patches.

Take that 224×224 image and divide it into a grid of non-overlapping 16×16 squares. You get 14×14 = 196 patches. Each patch is a small crop of the image — a 16×16 block of pixels with 3 color channels, giving a vector of 16×16×3 = 768 values. Now self-attention operates on 196 tokens instead of 50,176. That’s a 65,000× reduction in the number of attention pairs.

Think of it as visual tokenization. In language, BPE tokenization chops text into subword tokens. Patchification chops images into sub-image tokens. Both convert raw input into a manageable sequence length for the transformer.

Here’s the code:

import numpy as np

def patchify(image, patch_size):
    """Split an image into non-overlapping patches.

    Args:
        image: array of shape (H, W, C) or (H, W) for grayscale
        patch_size: side length of each square patch

    Returns:
        patches: array of shape (num_patches, patch_size * patch_size * C)
    """
    if image.ndim == 2:
        image = image[:, :, np.newaxis]
    H, W, C = image.shape
    assert H % patch_size == 0 and W % patch_size == 0, \
        f"Image size {H}x{W} not divisible by patch size {patch_size}"

    nH = H // patch_size
    nW = W // patch_size

    # Reshape: (H, W, C) -> (nH, P, nW, P, C) -> (nH, nW, P, P, C) -> (N, P*P*C)
    patches = image.reshape(nH, patch_size, nW, patch_size, C)
    patches = patches.transpose(0, 2, 1, 3, 4)
    patches = patches.reshape(nH * nW, patch_size * patch_size * C)
    return patches

# Example: a tiny 8x8 grayscale image -> four 4x4 patches
image = np.random.rand(8, 8)
patches = patchify(image, patch_size=4)
print(f"Image shape: {image.shape}")
print(f"Patches shape: {patches.shape}")
# Image shape: (8, 8)
# Patches shape: (4, 16)   -- 4 patches, each with 4*4*1 = 16 values

The reshape and transpose trick is the heart of patchification. We reshape the image into a grid of (nH, patch_size, nW, patch_size, C), then transpose so the two patch dimensions come together, and finally flatten each patch into a single vector. No loops needed — it’s a pure array reshape.

Patch Embedding — The Visual Tokenizer

Raw patch vectors aren’t ready for the transformer yet. Each patch is a 768-dimensional vector of pixel values (for 16×16 RGB patches), but the transformer expects embeddings in its own hidden dimension d_model. We need a linear projection to map patches into the transformer’s space.

This is exactly what token embedding does in language models. There, a lookup table maps each discrete token ID to a dense vector. Here, a matrix multiplication maps each continuous patch vector to a dense vector. Different input type, same purpose: convert raw input into the model’s representation space.

class PatchEmbedding:
    """Projects image patches into the transformer's hidden dimension."""

    def __init__(self, patch_size, in_channels, d_model):
        self.patch_size = patch_size
        patch_dim = patch_size * patch_size * in_channels
        # Xavier initialization
        scale = (patch_dim) ** -0.5
        self.W = np.random.randn(patch_dim, d_model) * scale
        self.b = np.zeros(d_model)

    def forward(self, image):
        patches = patchify(image, self.patch_size)  # (N, patch_dim)
        return patches @ self.W + self.b             # (N, d_model)

# Embed 8x8 grayscale image with 4x4 patches into d_model=64
embed = PatchEmbedding(patch_size=4, in_channels=1, d_model=64)
image = np.random.rand(8, 8)
patch_embeddings = embed.forward(image)
print(f"Patch embeddings shape: {patch_embeddings.shape}")
# Patch embeddings shape: (4, 64)   -- 4 patches, each mapped to 64-dim

One implementation detail worth noting: the patch embedding — patchify followed by a linear projection — is mathematically identical to a 2D convolution with kernel_size = patch_size and stride = patch_size. PyTorch implementations typically use nn.Conv2d for this reason. It’s one operation instead of two, but the result is the same: each non-overlapping image patch gets independently projected to a d_model-dimensional vector.

The [CLS] Token and Position Embeddings

We now have a sequence of patch embeddings. But a sequence of patches is a bag — the model doesn’t know whether a patch came from the top-left corner or the bottom-right. And we need a way to get a single image-level representation for classification. ViT solves both problems with two additions.

The [CLS] Token

Borrowed from BERT, the [CLS] token is a learnable vector prepended to the patch sequence. It doesn’t correspond to any image region. Instead, through self-attention, it learns to aggregate information from all patches. After the final transformer layer, the [CLS] token’s hidden state becomes the image-level representation that gets fed to the classification head.

Why not just average all patch embeddings? You could — and some variants do (the ViT paper calls this “global average pooling”). But the [CLS] token lets the model learn how to aggregate, potentially weighting important patches more heavily through attention.

Position Embeddings

Just like in language transformers, we add positional information so the model knows where each patch lives. But ViT makes a surprising choice: 1D learnable position embeddings. No sinusoidal encoding. No explicit 2D grid. Just a different learnable vector for each position in the sequence (position 0 for [CLS], positions 1 through N for patches in raster order).

The ViT paper tested 2D-aware positional embeddings and found they performed no better than 1D. The model discovers the 2D spatial structure on its own during training — we’ll see this in our interactive demo, where the learned position embeddings form a clear grid pattern.

class ViTInput:
    """Prepends [CLS] token and adds positional embeddings."""

    def __init__(self, num_patches, d_model):
        self.cls_token = np.random.randn(1, d_model) * 0.02
        # One position embedding per slot: [CLS] + N patches
        self.pos_embed = np.random.randn(1 + num_patches, d_model) * 0.02

    def forward(self, patch_embeddings):
        # patch_embeddings: (N, d_model)
        # Prepend [CLS] token to the sequence
        x = np.vstack([self.cls_token, patch_embeddings])  # (N+1, d_model)
        # Add positional embeddings
        x = x + self.pos_embed                              # (N+1, d_model)
        return x

# 4 patches + 1 [CLS] token = 5 positions
vit_input = ViTInput(num_patches=4, d_model=64)
x = vit_input.forward(patch_embeddings)
print(f"Sequence shape: {x.shape}")
# Sequence shape: (5, 64)   -- [CLS] + 4 patches, each 64-dim

After this step, we have a sequence of N+1 vectors, each carrying both content (from the patch embedding) and position (from the positional embedding). This sequence is ready for the transformer.

The Transformer Encoder — Everything You Already Know

Here’s where the payoff arrives. The transformer encoder block in ViT is identical to the one we built in the transformer capstone:

LayerNorm (Pre-Norm architecture)
Multi-Head Self-Attention
Residual connection
LayerNorm
Feed-Forward Network (expand → GELU → contract)
Residual connection

There is exactly one difference from our language transformer: no causal mask. In language modeling, token t can only attend to tokens 0…t (you can’t peek at future words). In vision, there is no “future” — all patches exist simultaneously. So ViT uses bidirectional attention where every patch can attend to every other patch from the very first layer. This is actually simpler than the masked attention we built for language.

def softmax(x):
    """Numerically stable softmax over the last axis."""
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def rms_norm(x, gamma, eps=1e-6):
    """RMSNorm: normalize by root-mean-square, then scale."""
    rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
    return (x / rms) * gamma

def gelu(x):
    """GELU activation (Gaussian Error Linear Unit)."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        scale = d_model ** -0.5
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def forward(self, x):
        seq_len, d_model = x.shape
        Q = (x @ self.W_q).reshape(seq_len, self.num_heads, self.head_dim)
        K = (x @ self.W_k).reshape(seq_len, self.num_heads, self.head_dim)
        V = (x @ self.W_v).reshape(seq_len, self.num_heads, self.head_dim)
        # (seq_len, num_heads, head_dim) -> per-head attention
        attn_scores = np.zeros((self.num_heads, seq_len, seq_len))
        for h in range(self.num_heads):
            attn_scores[h] = Q[:, h, :] @ K[:, h, :].T / np.sqrt(self.head_dim)
        # No causal mask! Bidirectional attention.
        attn_weights = np.array([softmax(attn_scores[h]) for h in range(self.num_heads)])
        # Compute output per head and concatenate
        head_outputs = []
        for h in range(self.num_heads):
            head_outputs.append(attn_weights[h] @ V[:, h, :])
        concat = np.concatenate(head_outputs, axis=-1)  # (seq_len, d_model)
        return concat @ self.W_o

class FFN:
    def __init__(self, d_model, d_ff):
        scale = d_model ** -0.5
        self.W1 = np.random.randn(d_model, d_ff) * scale
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * (d_ff ** -0.5)
        self.b2 = np.zeros(d_model)

    def forward(self, x):
        return gelu(x @ self.W1 + self.b1) @ self.W2 + self.b2

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.gamma1 = np.ones(d_model)
        self.gamma2 = np.ones(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = FFN(d_model, d_ff)

    def forward(self, x):
        # Pre-Norm: Normalize -> Attend -> Residual
        x = x + self.attn.forward(rms_norm(x, self.gamma1))
        # Pre-Norm: Normalize -> FFN -> Residual
        x = x + self.ffn.forward(rms_norm(x, self.gamma2))
        return x

Compare this to the block we built in the transformer capstone. It’s the same code. The MultiHeadAttention computes Q, K, V projections and scaled dot-product attention. The FFN does the expand-GELU-contract pattern. TransformerBlock chains them with Pre-Norm and residual connections. The only thing missing is the causal mask — and that’s a deletion, not an addition.

The Complete Vision Transformer

Now we assemble everything into a single class. The forward pass is a clean pipeline: image → patchify → embed → prepend [CLS] → add positions → L transformer blocks → extract [CLS] → normalize → classify.

class VisionTransformer:
    """Complete ViT: image in, class logits out."""

    def __init__(self, image_size, patch_size, in_channels,
                 d_model, num_heads, num_layers, d_ff, num_classes):
        num_patches = (image_size // patch_size) ** 2

        self.patch_embed = PatchEmbedding(patch_size, in_channels, d_model)
        self.input_stage = ViTInput(num_patches, d_model)
        self.blocks = [TransformerBlock(d_model, num_heads, d_ff)
                       for _ in range(num_layers)]
        self.final_gamma = np.ones(d_model)
        # Classification head: d_model -> num_classes
        self.W_head = np.random.randn(d_model, num_classes) * (d_model ** -0.5)
        self.b_head = np.zeros(num_classes)

    def forward(self, image):
        # 1. Patch embedding
        x = self.patch_embed.forward(image)          # (N, d_model)
        # 2. Add [CLS] token and positions
        x = self.input_stage.forward(x)              # (N+1, d_model)
        # 3. Transformer encoder blocks
        for block in self.blocks:
            x = block.forward(x)                     # (N+1, d_model)
        # 4. Extract [CLS] token (first position)
        cls_output = x[0:1, :]                       # (1, d_model)
        # 5. Final normalization
        cls_output = rms_norm(cls_output, self.final_gamma)
        # 6. Classification head
        logits = cls_output @ self.W_head + self.b_head  # (1, num_classes)
        return logits[0]  # (num_classes,)

# A tiny ViT for 8x8 grayscale images with 4x4 patches
model = VisionTransformer(
    image_size=8, patch_size=4, in_channels=1,
    d_model=64, num_heads=4, num_layers=2, d_ff=256,
    num_classes=4
)

image = np.random.rand(8, 8)
logits = model.forward(image)
probs = softmax(logits)
print(f"Logits: {logits.round(3)}")
print(f"Probs:  {probs.round(3)}")
# Logits: [-0.012  0.034  0.008 -0.021]   (random, untrained)
# Probs:  [0.248  0.259  0.252  0.241]    (nearly uniform -- no training yet)

Let’s count the parameters for this tiny model (image_size=8, patch_size=4, d_model=64, 4 heads, 2 layers, d_ff=256, 4 classes):

Patch embedding: (4×4×1) × 64 + 64 = 1,088
[CLS] token: 64
Position embeddings: 5 × 64 = 320
Per transformer block: attention (4 × 64² = 16,384) + FFN (64×256 + 256 + 256×64 + 64 = 33,088) + norms (128) = 49,600
2 blocks: 99,200
Final norm + head: 64 + 64×4 + 4 = 324
Total: ~100K parameters

For reference, ViT-Base/16 (the standard model used in practice) has 86 million parameters with d_model=768, 12 heads, 12 layers, and 16×16 patches on 224×224 images. Same architecture, larger numbers.

Training on Tiny Images

Let’s verify our ViT actually learns something. We’ll create a tiny synthetic dataset of 8×8 grayscale images with four classes of simple patterns, then train our model with cross-entropy loss and gradient descent.

For a real training loop with backpropagation, you’d need gradients for every operation — which we built in micrograd and refined in the optimizers and loss functions posts. Here we show the forward pass and loss computation to verify the architecture works. A full training implementation would wrap these components with autograd.

def make_shape_dataset(n_per_class=50, size=8):
    """Generate tiny images of 4 shapes: horizontal bars, vertical bars,
       diagonal, and checker pattern."""
    images, labels = [], []
    for i in range(n_per_class):
        # Class 0: horizontal bars
        img = np.zeros((size, size))
        img[::2, :] = 1.0
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(0)

        # Class 1: vertical bars
        img = np.zeros((size, size))
        img[:, ::2] = 1.0
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(1)

        # Class 2: diagonal gradient
        img = np.fromfunction(lambda r, c: (r + c) / (2 * size), (size, size))
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(2)

        # Class 3: checkerboard
        img = np.fromfunction(lambda r, c: ((r + c) % 2).astype(float), (size, size))
        img += np.random.randn(size, size) * 0.1
        images.append(img); labels.append(3)

    return images, labels

def cross_entropy_loss(logits, label):
    """Cross-entropy loss for a single sample."""
    probs = softmax(logits)
    loss = -np.log(probs[label] + 1e-10)
    return loss, probs

# Create dataset
X, y = make_shape_dataset(n_per_class=50, size=8)
print(f"Dataset: {len(X)} images, {len(set(y))} classes")
# Dataset: 200 images, 4 classes

# Forward pass on one image (before training)
logits = model.forward(X[0])
loss, probs = cross_entropy_loss(logits, y[0])
print(f"Untrained loss: {loss:.3f}")
print(f"Untrained probs: {probs.round(3)}")
# Untrained loss: 1.387   (close to -ln(1/4) = 1.386 -- random guessing)
# Untrained probs: [0.247 0.254 0.251 0.248]

The untrained model outputs near-uniform probabilities across all four classes — exactly what you’d expect. The loss is approximately ln(4) ≈ 1.386, the theoretical maximum for random guessing on 4 classes. Training would minimize this loss using cross-entropy and Adam, just like we’ve done for every model in this series.

The key insight at this tiny scale: ViT works, but it doesn’t have the same head start as a CNN. Convolutional networks come pre-loaded with inductive biases about images — locality (nearby pixels are related) and translation equivariance (the same pattern matters regardless of where it appears). ViT must learn these properties from data. At 200 training images, that’s a handicap. At 200 million, it’s a strength.

What the Model Sees — Attention Visualization

One of the most compelling features of ViT is that its attention maps have a direct spatial interpretation. In language, attention weights tell you which words attend to which other words. In vision, those same weights tell you which image regions attend to which other regions — and you can literally paint them as a heatmap on top of the image.

Research on trained ViT models has revealed a consistent pattern:

Early layers (layers 1–3): attention is predominantly local. Patches attend most strongly to their immediate neighbors, similar to how small convolutional filters operate. The model starts by building local features.
Middle layers (layers 4–8): attention becomes a mix of local and global. Some heads specialize in long-range connections while others maintain local focus. Semantic grouping begins to emerge — patches belonging to the same object start attending to each other.
Deep layers (layers 9–12): attention is predominantly global and semantic. The [CLS] token’s attention map highlights the object of interest, essentially performing a learned version of object detection. Different heads specialize in different aspects — some focus on edges, others on textures, others on object parts.

This progressive shift from local to global attention is fascinating because it happens without any architectural constraint. Unlike CNNs, where the receptive field grows layer by layer because each filter is small, ViT’s attention is global from layer 1. Every patch can attend to every other patch from the very beginning. The model chooses to attend locally in early layers because that’s what’s most useful for building basic features — and it shifts to global attention only when it needs to reason about the whole image.

This is the same scaled dot-product attention mechanism we built from scratch, but now the weights have spatial meaning you can see.

ViT vs CNNs — The Data Scale Trade-off

CNNs dominated computer vision for a decade (2012–2020) because they encode strong assumptions about images directly into the architecture:

Locality: convolution filters only look at small neighborhoods. A 3×3 filter says “nearby pixels matter most.”
Translation equivariance: the same filter is applied everywhere. A cat detector works regardless of where the cat sits in the image.
Hierarchical features: stacking layers builds progressively larger receptive fields. Edges → textures → parts → objects.

These biases are correct for images, which gives CNNs a massive head start when data is scarce. ViT, with no built-in image priors, must learn spatial relationships from scratch. The data scale determines the winner:

Training Data	Winner	Why
Small (<1M images)	CNN	Inductive biases compensate for limited data
Medium (1–14M)	CNN (slight edge)	Priors still help; ViT closing the gap
Large (>14M images)	ViT	Flexibility beats assumptions; better scaling

A major breakthrough came with DeiT (Data-efficient Image Transformers, Touvron et al. 2021). Instead of needing 300 million images to beat CNNs, DeiT trains a ViT on just ImageNet (1.3 million images) by distilling knowledge from a CNN teacher. The CNN teacher provides the inductive bias that ViT lacks, packaged as soft labels. If you read our knowledge distillation post, you know exactly how this works — the teacher’s soft probability distribution over classes contains “dark knowledge” about inter-class relationships that hard labels destroy.

DeiT achieved 85.2% top-1 accuracy on ImageNet with no external data, trained in under 3 days on a single machine. That result ended the “ViT needs massive data” narrative and opened the door to practical vision transformers.

The ViT Family and Beyond

The original ViT paper defined three model sizes that follow the now-familiar transformer scaling convention:

Model	Layers	d_model	Heads	Params	Patch
ViT-Base/16	12	768	12	86M	16×16
ViT-Large/16	24	1024	16	304M	16×16
ViT-Huge/14	32	1280	16	632M	14×14

The “/16” and “/14” suffixes indicate patch size. Smaller patches mean more tokens: a 224×224 image produces 196 tokens with /16 patches but 256 tokens with /14 patches. Since attention is O(n²), smaller patches give better accuracy at higher computational cost.

The real impact of ViT, though, isn’t classification benchmarks. It’s that the same architecture became the backbone for an entire ecosystem of vision models:

CLIP (Radford et al., 2021): A ViT encodes images while a text transformer encodes captions. Contrastive learning aligns their embeddings in a shared space, enabling zero-shot classification — “is this image more like ‘a photo of a dog’ or ‘a photo of a cat’?” — without ever training on those classes. CLIP’s image encoder is a ViT.
DINO / DINOv2 (Caron et al., 2021; Oquab et al., 2023): Self-supervised ViT that learns visual features without any labels. The key finding: self-supervised ViT attention maps spontaneously learn to segment objects, even though no segmentation signal was ever provided. DINOv2 is now the go-to visual feature extractor for downstream tasks.
Segment Anything (SAM) (Kirillov et al., 2023): Meta’s foundation model for image segmentation uses a ViT image encoder. Given a point, box, or text prompt, it segments any object in any image — zero-shot. Trained on 1 billion masks across 11 million images.
DiT (Peebles & Xie, 2023): The Diffusion Transformer replaces the U-Net in diffusion models with a ViT-style architecture. Patchify the noisy latents, process with transformer blocks, output denoised predictions. DiT achieves state-of-the-art image generation quality and scales better than U-Nets. This directly connects our diffusion post to this one — the same transformer, now generating images.

The pattern is unmistakable: the transformer we built for language processes text tokens, and with one architectural change — swap the tokenizer for a patchifier — it processes image patches just as well.

Try It: Vision Transformer Explorer

Panel 1: Patchify Explorer

Select an image pattern and adjust the patch size to see how images become token sequences.

Image:

Patch size: 8×8

4×4 grid = 16 patches · each patch = 64 pixels

Panel 2: Attention Heatmap

Click a patch to see its attention pattern. Early layers attend locally; deep layers attend globally.

Layer: 1

Head: 1

Click a patch to see where it attends

Panel 3: Position Embedding Similarity

Click a patch position to see its cosine similarity to all others. Toggle to see how training creates 2D spatial structure from 1D positions.

Currently: Random (Untrained)

Click a patch position to see similarity pattern

Connections to the Series

ViT is the junction point where the entire transformer pipeline we’ve built meets a new modality. Here’s how every piece connects:

Embeddings — Patch embedding is the visual analog of token embedding: map raw input to the model’s hidden dimension
Positional Encoding — ViT uses learnable 1D positions instead of sinusoidal encoding, yet discovers 2D spatial structure on its own
Attention — The same Q·K¹/√d mechanism, but bidirectional (no causal mask) and with spatial meaning
Normalization — Pre-Norm LayerNorm/RMSNorm, identical to our language transformer blocks
FFN — The expand-GELU-contract network processes each patch independently, same as each token in language
Transformer Capstone — The entire encoder architecture reused wholesale; ViT is a transformer encoder with a different tokenizer
Loss Functions — Cross-entropy for classification, the same objective we derived from information theory
Optimizers — Adam with learning rate warmup, critical for stable ViT training
Knowledge Distillation — DeiT distills CNN knowledge into ViT, making it data-efficient
Diffusion Models — DiT replaces the U-Net denoiser with ViT, unifying generation and understanding
Tokenization — Patchification is visual tokenization: BPE for text, patches for images

image → patchify → embed → position → normalize → attend → FFN → softmax → classify
The same pipeline. One new step. Everything else you already built.

What Comes Next

ViT opened the door. The transformer we built for language now processes images too. But the real revolution isn’t just classification — it’s what happens when you connect vision and language in the same model.

CLIP trains a ViT image encoder and a text transformer side-by-side with contrastive learning, creating a shared embedding space where images and text live together. It’s the bridge between our embeddings post (where we learned word vectors) and this one (where we learned image vectors). And it’s the conditioning mechanism that tells Stable Diffusion what to generate from a text prompt.

The bigger picture: the same attention mechanism works for text, images, audio, video, protein sequences, and genomic data. The transformer isn’t a language model or an image model. It’s a sequence model — and anything can be turned into a sequence. We just proved it for images with one new idea and zero new architecture.

That’s the power of understanding fundamentals. You didn’t just learn ViT today. You confirmed that the twenty components you already built are the universal building blocks of modern AI.

References & Further Reading

Dosovitskiy et al. — “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (2020) — The original ViT paper. Proved that a pure transformer can match or beat CNNs on image classification when scaled up.
Touvron et al. — “Training data-efficient image transformers & distillation through attention” (DeiT, 2021) — Made ViT practical on ImageNet-scale data using CNN-to-ViT knowledge distillation.
Radford et al. — “Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021) — Connected ViT image encoding to text transformers via contrastive learning, enabling zero-shot visual classification.
Caron et al. — “Emerging Properties in Self-Supervised Vision Transformers” (DINO, 2021) — Self-supervised ViT learns segmentation without labels. Attention maps spontaneously highlight object boundaries.
Peebles & Xie — “Scalable Diffusion Models with Transformers” (DiT, 2023) — Replaced U-Net with ViT for diffusion image generation. State-of-the-art FID scores on ImageNet.
He et al. — “Masked Autoencoders Are Scalable Vision Learners” (MAE, 2022) — Self-supervised pre-training for ViT by masking 75% of patches and learning to reconstruct them.
Oquab et al. — “DINOv2: Learning Robust Visual Features without Supervision” (2023) — Meta’s universal visual feature extractor, trained on 142M images without labels.
Jay Alammar — “The Illustrated Transformer” — The gold-standard visual guide to transformers, complementing our from-scratch implementation approach.