← Back to Blog

Contrastive Learning from Scratch: How AI Learns to See Without Labels

The Label Problem

Imagine paying 50,000 people to look at 14 million photographs and write a label for each one. "Dog." "Sailboat." "Espresso machine." That's not a hypothetical — it's ImageNet, the dataset that powered every major vision breakthrough from AlexNet to ResNet. Every one of those models learned to see by studying human-written labels. And that created a problem nobody liked to talk about: supervised learning is expensive.

Labels are the bottleneck. Medical imaging needs radiologists to annotate scans. Self-driving cars need engineers to draw bounding boxes around pedestrians. Every new domain needs its own army of annotators. What if you could skip all that? What if a neural network could learn powerful visual representations from raw images alone — no labels, no annotations, no 50,000 workers?

That's exactly what contrastive learning does. The core insight is almost embarrassingly simple: data augmentations create free supervision. Take a photo of your kid on a swing. Crop the left half. Crop the right half. Both crops show the same scene — same kid, same swing, same park. A good representation should recognize that. So train the network to produce similar embeddings for the two crops. Meanwhile, a crop from a completely different photo — your morning coffee, say — should produce a different embedding. Push similar things together. Push different things apart. That's it.

This idea — learning by comparing rather than labeling — is the foundation of modern foundation models. CLIP learned to connect images and text through contrastive objectives. DINO learned to segment objects without a single segmentation label. And the ImageNet classification record? Now held by models that were first pre-trained with contrastive learning on billions of unlabeled images, then fine-tuned with a fraction of the labels.

In this post, we'll build contrastive learning from scratch. We'll implement SimCLR (the simplest contrastive framework), derive the InfoNCE loss that powers it, train a model on toy data, explore CLIP and DINO, and understand why this deceptively simple idea works so well. Along the way, we'll connect to a dozen other posts in the series — from softmax and temperature to knowledge distillation.

Learning by Comparison

Contrastive learning starts with a reframing: instead of asking "what is this image?", it asks "are these two images showing the same thing?" That shift from classification to comparison eliminates the need for class labels entirely.

The technical term is instance discrimination. Every individual image is treated as its own class. The goal isn't to learn that this image is a "dog" — it's to learn that two augmented views of this specific image are more similar to each other than to views of any other image. Each image is its own unique instance, and the network must discriminate between instances.

The magic is in the augmentations. When you randomly crop, flip, color-jitter, and blur an image, you create two views that look superficially different but share the same semantic content. A random crop of a dog's face and a different random crop of the same dog are a positive pair — they should have similar embeddings. A crop of the dog and a crop of your latte are a negative pair — their embeddings should be far apart.

The augmentation pipeline is doing something subtle and powerful: it defines what "same content" means. Random cropping forces the network to be invariant to position. Color jitter forces invariance to color. Gaussian blur forces invariance to texture details. What's left after all that invariance? The stuff that matters: objects, shapes, spatial relationships, semantic content. The network learns to extract exactly the features that survive augmentation.

Think of it like recognizing your friend. You've seen them in different outfits, different lighting, from different angles, on different days. What you've learned is the invariant features — their face shape, their gait, their height — not the transient details like today's jacket color. Contrastive learning forces the same kind of invariance learning, but from data augmentations instead of real-world encounters.

The result is an embedding space where semantically similar images cluster together and dissimilar images are spread apart — all without a single human label. Let's see how to build this.

SimCLR — The Simplest Contrastive Framework

In 2020, Ting Chen and colleagues at Google published what may be the most elegant contrastive learning framework ever designed. They called it SimCLR — "A Simple Framework for Contrastive Learning of Visual Representations" — and "simple" wasn't false advertising. The entire pipeline fits in one line:

Input → Augment → Encode f(·) → Project g(·) → L2 Normalize → InfoNCE Loss

For each image in the batch, create two augmented views. Pass both through an encoder network f to get representations h. Pass those through a small projection head g to get projections z. Normalize the projections to unit vectors. Compute the InfoNCE loss to pull positive pairs together and push negative pairs apart.

The encoder f is the workhorse — in practice a ResNet-50 or Vision Transformer, but for us a small MLP will do. It maps an input to a representation vector h. This is the representation you actually keep after training — it's what you'd use for downstream tasks like classification or retrieval.

The projection head g is a 2-layer MLP that maps h to a smaller vector z where the contrastive loss operates. And here's the critical insight that makes SimCLR work: you throw away the projection head after training.

The projection head is like a sacrificial layer. The contrastive loss aggressively strips away augmentation-variant information from z (it has to — you want augmented views to map to the same z). But the encoder's representation h retains that richer information. The projection head absorbs the destructive pressure so the encoder stays clean.

Chen et al. showed that using the encoder representation h for downstream tasks gave much better results than using the projection z. The projection head improved the contrastive loss (better training signal) while protecting the representation (better transfer). It's one of the most counterintuitive and important tricks in the field.

Finally, the L2 normalization step projects all embeddings onto the unit hypersphere. This means cosine similarity — which measures the angle between vectors, ignoring magnitude — reduces to a simple dot product. Here's the implementation:

import numpy as np

class SimCLREncoder:
    """Encoder + projection head for contrastive learning."""

    def __init__(self, input_dim, hidden_dim=64, proj_dim=32):
        # Encoder: input -> hidden representation
        scale = np.sqrt(2.0 / input_dim)  # He initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * scale
        self.b1 = np.zeros(hidden_dim)

        # Projection head: hidden -> projection space
        scale = np.sqrt(2.0 / hidden_dim)
        self.W2 = np.random.randn(hidden_dim, proj_dim) * scale
        self.b2 = np.zeros(proj_dim)

    def encode(self, x):
        """Get representation h (what we KEEP after training)."""
        h = np.maximum(0, x @ self.W1 + self.b1)  # ReLU
        return h

    def project(self, h):
        """Map to projection space z (used ONLY during training)."""
        z = h @ self.W2 + self.b2
        # L2 normalize — critical for cosine similarity
        z = z / (np.linalg.norm(z, axis=-1, keepdims=True) + 1e-8)
        return z

    def forward(self, x):
        """Full forward: input -> representation -> projection."""
        h = self.encode(x)
        z = self.project(h)
        return h, z

That's the entire architecture. The encoder is one linear layer with ReLU activation. The projection head is another linear layer followed by L2 normalization. In practice you'd use deep networks for both, but the structure is identical. Now we need a loss function to train it.

The InfoNCE Loss — A Softmax Over Similarities

Here's the question InfoNCE asks: given an anchor embedding z_i and a crowd of candidate embeddings, can you pick out the positive partner z_j? This is just a classification problem — one correct answer among many options — and we already know how to solve classification: softmax cross-entropy. The same softmax we explored in the softmax post, now applied to cosine similarities instead of class logits.

L(i, j) = −log( exp(sim(zi, zj) / τ) / Σk≠i exp(sim(zi, zk) / τ) )

Let's unpack this piece by piece.

Cosine similarity: sim(u, v) = u · v / (||u|| · ||v||). It measures the angle between two vectors, ranging from −1 (opposite directions) to +1 (same direction). Since we L2-normalized our projections, cosine similarity is just the dot product: sim(u, v) = u · v.

Temperature τ: this is the exact same temperature from the softmax and temperature post. Dividing similarities by τ before the softmax controls the concentration of the distribution. Low temperature (say τ = 0.05) makes the softmax extremely peaked — the loss becomes dominated by the hardest negative, the one closest to the anchor in embedding space. High temperature (τ = 1.0) distributes attention more evenly across all negatives. SimCLR uses τ = 0.5. CLIP learns τ as a trainable parameter, initialized to 1/0.07 ≈ 14.3.

The softmax structure: the numerator is the exponential of the positive similarity. The denominator sums over all other embeddings in the batch (all 2N − 1 of them, excluding the anchor itself). The loss asks: what fraction of the total similarity mass belongs to the positive? If the positive is much closer than all negatives, that fraction is near 1.0 and the loss is near 0. If the positive blends in with the negatives, the fraction is about 1/(2N−1) and the loss is about log(2N−1).

There's a beautiful information-theoretic interpretation: minimizing InfoNCE maximizes a lower bound on the mutual information between the two views. The bound is capped at log(K+1) where K is the number of negatives. With a batch of 16 images (32 views), K = 30 and the bound caps at log(31) ≈ 3.43. With a batch of 4096 (SimCLR's original setting), K = 8190 and the bound caps at log(8191) ≈ 9.01. This is why batch size matters — not just "more negatives make the task harder," but literally a tighter bound on the information the representations capture.

One last note: you may see this loss called NT-Xent (Normalized Temperature-scaled Cross Entropy). That's SimCLR's name for it. It's the same loss as InfoNCE from van den Oord et al.'s 2018 CPC paper. Different names, identical math.

def info_nce_loss(z, temperature=0.5):
    """
    Compute NT-Xent (InfoNCE) loss for a batch of 2N embeddings.
    z: (2N, D) array — z[2i] and z[2i+1] are positive pairs.
    Returns: scalar loss value.
    """
    batch_size = z.shape[0]  # This is 2N
    N = batch_size // 2

    # Cosine similarity matrix (z is already L2-normalized)
    sim = z @ z.T  # (2N, 2N)

    # Apply temperature scaling
    sim = sim / temperature

    # Mask out self-similarity (diagonal) with large negative
    mask = np.eye(batch_size, dtype=bool)
    sim[mask] = -1e9

    # Compute loss for each of the 2N samples
    total_loss = 0.0
    for i in range(batch_size):
        # Positive partner: i=0 pairs with i=1, i=2 pairs with i=3, etc.
        pos_idx = i + 1 if i % 2 == 0 else i - 1

        # Log-sum-exp trick for numerical stability
        logits = sim[i]
        max_logit = logits.max()
        log_sum_exp = max_logit + np.log(
            np.sum(np.exp(logits - max_logit)) + 1e-8
        )

        # Loss for this anchor: -log(softmax probability of positive)
        total_loss += -(logits[pos_idx] - log_sum_exp)

    return total_loss / batch_size

The log-sum-exp trick (subtracting the maximum before exponentiating) prevents numerical overflow — the same trick from the softmax post. The diagonal masking ensures we never compare an embedding with itself. And the loop over all 2N anchors means both views in every positive pair contribute to the loss.

Training SimCLR from Scratch

Let's put it all together and train a contrastive model on toy data. We'll create five well-separated clusters of 2D points and treat each point as an "image." Our "augmentations" will be small random perturbations — the 2D analogue of cropping and color-jittering a photograph.

def create_augmented_batch(data, batch_size=16, aug_noise=0.3):
    """
    Create a batch of positive pairs through augmentation.
    data: (M, D) array of "images" (2D points as a toy stand-in).
    Returns: (2*batch_size, D) where [2i] and [2i+1] are positive pairs.
    """
    indices = np.random.choice(len(data), size=batch_size, replace=False)
    batch = data[indices]  # (batch_size, D)

    # Two augmented views — in real SimCLR: crop, color jitter, blur, flip
    # Our toy version: add random Gaussian noise
    view1 = batch + np.random.randn(*batch.shape) * aug_noise
    view2 = batch + np.random.randn(*batch.shape) * aug_noise

    # Interleave: [view1_0, view2_0, view1_1, view2_1, ...]
    augmented = np.empty((2 * batch_size, data.shape[1]))
    augmented[0::2] = view1
    augmented[1::2] = view2

    return augmented

Each call produces a batch where even-indexed and odd-indexed rows form positive pairs. Now the training loop. We'll use numerical gradients (finite differences) to keep the focus on the contrastive objective rather than backpropagation mechanics — we already built an autograd engine in the micrograd post.

def train_simclr(data, epochs=100, batch_size=16, lr=0.01, temp=0.5):
    """Train SimCLR on toy 2D data using numerical gradients."""
    encoder = SimCLREncoder(input_dim=2, hidden_dim=8, proj_dim=4)
    losses = []

    for epoch in range(epochs):
        batch = create_augmented_batch(data, batch_size)
        _, z = encoder.forward(batch)
        loss = info_nce_loss(z, temp)
        losses.append(loss)

        # Numerical gradients via finite differences
        eps = 1e-4
        for param in [encoder.W1, encoder.b1, encoder.W2, encoder.b2]:
            grad = np.zeros_like(param)
            for idx in np.ndindex(param.shape):
                old = param[idx]
                param[idx] = old + eps
                _, zp = encoder.forward(batch)
                lp = info_nce_loss(zp, temp)
                param[idx] = old - eps
                _, zm = encoder.forward(batch)
                lm = info_nce_loss(zm, temp)
                param[idx] = old
                grad[idx] = (lp - lm) / (2 * eps)
            param -= lr * grad

        if epoch % 25 == 0:
            print(f"Epoch {epoch:3d} | Loss: {loss:.4f}")

    return encoder, losses

# Create toy data: 5 clusters in 2D
np.random.seed(42)
centers = np.array([[2, 2], [-2, 2], [0, -2.5], [3, -1], [-3, -1.0]])
data = np.vstack([c + np.random.randn(20, 2) * 0.3 for c in centers])
labels = np.repeat(np.arange(5), 20)  # for evaluation later

encoder, losses = train_simclr(data)
# Epoch   0 | Loss: 3.3891
# Epoch  25 | Loss: 2.7634
# Epoch  50 | Loss: 2.1452
# Epoch  75 | Loss: 1.6307

The initial loss of ~3.39 is close to log(31) ≈ 3.43 — exactly what we predicted for random embeddings with 30 negatives. As training progresses, the loss drops steadily, meaning the encoder is successfully pulling positive pairs together and pushing negative pairs apart. Let's verify with a direct measurement:

def evaluate_representations(data, labels, encoder):
    """Measure how well contrastive learning separates clusters."""
    h = encoder.encode(data)  # Use representations, NOT projections

    # Normalize for cosine similarity
    h_norm = h / (np.linalg.norm(h, axis=1, keepdims=True) + 1e-8)
    sims = h_norm @ h_norm.T

    same_cluster, diff_cluster = [], []
    for i in range(len(labels)):
        for j in range(i + 1, len(labels)):
            if labels[i] == labels[j]:
                same_cluster.append(sims[i, j])
            else:
                diff_cluster.append(sims[i, j])

    print(f"Same-cluster similarity:  {np.mean(same_cluster):.3f}")
    print(f"Diff-cluster similarity:  {np.mean(diff_cluster):.3f}")
    print(f"Separation gap:           {np.mean(same_cluster) - np.mean(diff_cluster):.3f}")

# Before training (random encoder):
#   Same-cluster similarity:  0.047
#   Diff-cluster similarity:  0.023
#   Separation gap:           0.024
#
# After contrastive training:
#   Same-cluster similarity:  0.812
#   Diff-cluster similarity: -0.128
#   Separation gap:           0.940

Before training, same-cluster and different-cluster similarities are both near zero — the random encoder can't tell anything apart. After 100 epochs of contrastive training, same-cluster similarity jumps to 0.81 while different-cluster similarity drops to −0.13. A separation gap of 0.94 means the encoder has learned clear cluster structure from augmented pairs alone. No labels were harmed in the making of these representations.

Notice the batch size implication. With batch_size=16 we get 30 negatives and a MI bound of ~3.43 nats. If we trained with batch_size=256 (as in the original SimCLR paper), we'd get 510 negatives and a bound of ~6.24 nats. And SimCLR's actual batch size of 4096? That's 8190 negatives, bound of ~9.01. This is why the original paper needed TPU pods — bigger batches aren't just faster training, they produce fundamentally better representations.

CLIP — When Images Meet Words

SimCLR contrasts two views of the same image. But what if one "view" is an image and the other is a text caption? That's CLIP — Contrastive Language-Image Pre-training — published by Radford et al. at OpenAI in 2021. It's the same InfoNCE loss, but applied across modalities, and the results changed everything.

Image → Image Encoder → L2 Norm   ╰
                                    → Cosine Sim Matrix → Symmetric InfoNCE
Text  → Text Encoder  → L2 Norm   ╰

CLIP processes a batch of N (image, text) pairs. The image encoder (a ViT or ResNet) maps each image to an embedding. The text encoder (a Transformer) maps each caption to an embedding. Both are L2-normalized and projected into a shared embedding space. Then CLIP computes a symmetric contrastive loss: for each image, classify which text matches (image-to-text direction), and for each text, classify which image matches (text-to-image direction).

The loss is elegantly symmetric. The N×N similarity matrix has correct pairings along the diagonal — imagei matches texti. Every off-diagonal entry is a negative pair. This gives 2(N−1) negatives per anchor (N−1 from each direction), and the loss averages both directions:

def clip_contrastive_loss(image_emb, text_emb, temperature=0.07):
    """
    Symmetric contrastive loss for image-text pairs.
    image_emb: (N, D) L2-normalized image embeddings
    text_emb:  (N, D) L2-normalized text embeddings
    """
    # Similarity matrix: each image against each text caption
    logits = image_emb @ text_emb.T / temperature  # (N, N)

    # Labels: the diagonal — image_i matches text_i
    N = logits.shape[0]
    labels = np.arange(N)

    # Image-to-text: for each image, which text is correct?
    i2t_loss = softmax_cross_entropy(logits, labels)

    # Text-to-image: for each text, which image is correct?
    t2i_loss = softmax_cross_entropy(logits.T, labels)

    return (i2t_loss + t2i_loss) / 2

def softmax_cross_entropy(logits, labels):
    """Numerically stable cross-entropy with log-sum-exp trick."""
    N = logits.shape[0]
    max_logits = logits.max(axis=1, keepdims=True)
    log_sum_exp = max_logits.squeeze() + np.log(
        np.sum(np.exp(logits - max_logits), axis=1)
    )
    correct_logits = logits[np.arange(N), labels]
    return np.mean(log_sum_exp - correct_logits)

Here's why CLIP was revolutionary: zero-shot classification. After training on 400 million (image, text) pairs scraped from the internet, CLIP learned a shared space where visual and textual concepts are aligned. To classify a new image, you don't need to train a classifier. Just encode text prompts — "a photo of a cat", "a photo of a dog", "a photo of a car" — and find which text embedding is closest to the image embedding. Classification with no training examples at all.

The temperature parameter τ is particularly interesting in CLIP. Rather than fixing it as a hyperparameter, CLIP learns τ during training. It's initialized to 1/0.07 ≈ 14.3 (very low temperature, very sharp softmax) and the model adjusts it to balance the difficulty of the contrastive task. This learned temperature bridges the concepts from our softmax post and the loss functions post — temperature as a trainable knob that the model tunes to calibrate its own confidence.

DINO — Self-Distillation Without Labels

What if you could do contrastive learning without negative pairs at all? That's what Mathilde Caron and colleagues achieved with DINO (Self-Distillation with No Labels) in 2021. It's a beautiful fusion of two ideas we've already explored: knowledge distillation and contrastive learning.

DINO uses a teacher-student architecture, but with a twist: the teacher isn't a separate, larger model. It's an exponential moving average (EMA) of the student. The student learns through backpropagation. After each training step, the teacher's weights are updated as a slow-moving average of the student's: θ_teacher ← m · θ_teacher + (1−m) · θ_student where m = 0.996. The teacher evolves slowly, providing stable targets for the student to match.

The training uses a multi-crop strategy. The teacher sees global crops — large views covering more than 50% of the image. The student sees local crops — small views covering less than 50%. The student must produce the same output as the teacher despite seeing only a fragment of the image. This forces the student to understand the holistic content of the image from a local view.

But without negative pairs, what prevents collapse — what stops the student and teacher from conspiring to produce the same constant output for everything? DINO uses two complementary mechanisms:

Centering prevents always saying the same thing. Sharpening prevents saying everything equally. Together, they maintain a rich, informative signal without any negative examples.

class DINOTrainer:
    """Self-distillation with no labels — the DINO framework."""

    def __init__(self, student, teacher, output_dim,
                 center_mom=0.9, teacher_temp=0.04,
                 student_temp=0.1, ema_mom=0.996):
        self.student = student
        self.teacher = teacher
        self.center = np.zeros(output_dim)
        self.center_mom = center_mom
        self.teacher_temp = teacher_temp
        self.student_temp = student_temp
        self.ema_mom = ema_mom

    def update_teacher(self):
        """EMA update: teacher slowly follows the student."""
        m = self.ema_mom
        for tp, sp in zip(
            [self.teacher.W1, self.teacher.b1, self.teacher.W2, self.teacher.b2],
            [self.student.W1, self.student.b1, self.student.W2, self.student.b2]
        ):
            tp[:] = m * tp + (1 - m) * sp

    def update_center(self, teacher_out):
        """Running mean prevents mode collapse."""
        self.center = (self.center_mom * self.center
                       + (1 - self.center_mom) * teacher_out.mean(axis=0))

    def compute_loss(self, student_out, teacher_out):
        """Cross-entropy between sharp teacher and soft student."""
        # Center and sharpen teacher — this is the collapse prevention
        t = (teacher_out - self.center) / self.teacher_temp
        t = np.exp(t) / np.sum(np.exp(t), axis=1, keepdims=True)

        # Student softmax (higher temperature, no centering)
        s = student_out / self.student_temp
        s_log = s - np.log(np.sum(np.exp(s), axis=1, keepdims=True))

        # Cross-entropy: H(teacher, student) = -sum(t * log(s))
        return -np.sum(t * s_log, axis=1).mean()

DINO produced one of the most striking results in self-supervised learning: the attention maps of a ViT trained with DINO naturally segment objects without ever seeing a segmentation label. The self-attention heads learned to focus on object boundaries and foreground regions purely from the self-supervised objective. This emergent segmentation property only appears with Vision Transformers, not ConvNets — the self-attention mechanism reveals what the network has learned to attend to.

The connection to the knowledge distillation post is direct: DINO is self-distillation. The teacher is the student's own past (via EMA), not a separate larger model. The knowledge being "distilled" is the slowly-refined understanding of image structure, transferred from the stable teacher to the rapidly-learning student.

The Collapse Problem — Why Doesn't Everything Map to One Point?

We've been dancing around the elephant in the embedding space: what prevents the network from taking the lazy way out? Map every single input to the exact same constant vector. All positive pairs have perfect similarity. Problem solved. Except the representations are completely useless.

The good news: for InfoNCE-based methods like SimCLR and CLIP, collapse doesn't minimize the loss. If all embeddings are identical, the softmax gives each candidate equal probability 1/(2N−1), and the loss stays at log(2N−1) — the maximum. The denominator in InfoNCE provides a natural repulsive force that prevents everything from collapsing to a single point. Negatives are doing real work.

The challenge comes with methods that eliminate negatives entirely — BYOL, SimSiam, DINO. Without the repulsive force from negative pairs, what maintains the spread? This is where things get interesting. Researchers have found four distinct mechanisms that prevent collapse, each revealing something different about the geometry of learning:

1. Negative pairs (SimCLR, MoCo, CLIP): the explicit approach. Push non-matching pairs apart while pulling matching pairs together. The tension between attraction and repulsion creates a structured embedding space. The drawback: you need many negatives for good results, which means large batches or clever engineering. MoCo solves this with a dictionary queue of 65,536 past embeddings, decoupling the number of negatives from the batch size.

2. Momentum encoder (MoCo, DINO): the teacher updates slowly via EMA (typically m = 0.996 to 0.999). This prevents the rapid co-adaptation that leads to collapse — the student can't "cheat" by making both branches produce the same constant, because the teacher only changes slowly. The teacher provides stable targets that maintain diversity.

3. Stop-gradient (BYOL, SimSiam): gradients flow through the student branch but are completely blocked from the teacher branch. Chen and He's SimSiam paper showed that stop-gradient alone — without momentum, without negatives — prevents collapse. The hypothesis is that this creates an implicit EM-like optimization where the two branches alternately refine different aspects of the representation.

4. Centering and regularization (DINO, Barlow Twins, VICReg): directly regularize the output statistics. DINO's centering prevents mode collapse. Barlow Twins forces the cross-correlation matrix between embedding dimensions to approximate the identity matrix — decorrelating the dimensions. VICReg explicitly maintains variance (avoid collapse), invariance (match positives), and covariance (decorrelate dimensions).

Method Negatives? Momentum? Stop-Grad? Key Innovation
SimCLR Yes (batch) No No Simple framework + projection head
MoCo Yes (queue) Yes No Dictionary queue of 65K embeddings
BYOL No Yes Yes Predictor MLP + EMA target
SimSiam No No Yes Stop-gradient alone suffices
DINO No Yes Yes Self-distillation + emergent segmentation
Barlow Twins No No No Cross-correlation → identity
CLIP Yes (cross-modal) No No Image-text alignment + zero-shot

The diversity of solutions is what makes this field so fascinating. Seven different approaches to the same core problem, each revealing different geometric or optimization principles. And they all work.

The Three Paradigms of Representation Learning

The elementary series has now covered three fundamentally different approaches to learning representations from data. It's worth stepping back to see the full picture.

Autoencoders and VAEs (from the autoencoders post): learn by reconstruction. Compress the input into a bottleneck, then reconstruct it. The bottleneck forces the network to discover the most important features. Explicit likelihood, stable training, smooth latent spaces — but reconstructing every pixel is expensive and tends to produce blurry representations that focus on low-level details.

GANs (from the GANs post): learn by competition. A generator creates fake data, a discriminator tries to spot the fakes. The adversarial pressure forces the generator to capture the data distribution. Implicit likelihood, sharp outputs, no need for labels — but training is notoriously unstable and prone to mode collapse.

Contrastive learning: learn by comparison. No reconstruction, no generation, no adversarial game. Just: "are these two things similar?" Push matching pairs together, push non-matching pairs apart. No attempt to model the data distribution at all — just learn the structure of similarity.

Why did contrastive learning become the dominant paradigm for foundation models? Three reasons. First, it's efficient — you don't need to reconstruct every pixel or generate realistic images, just distinguish similar from dissimilar. Second, it scales — more data and bigger batches directly improve the quality of representations. Third, it transfers — contrastive representations capture high-level semantic features that generalize remarkably well to downstream tasks.

That said, the story isn't over. Masked image modeling (MAE, BEiT) blended the reconstruction approach with self-supervised learning and achieved competitive results. The future likely lies in combining these paradigms rather than choosing one.

Try It Yourself

Theory only goes so far. These three demos let you interact with the core concepts of contrastive learning: what augmentation pairs look like, how the InfoNCE loss shapes the embedding space, and how cross-modal contrastive training enables zero-shot classification.

Interactive Contrastive Learning Demos

Panel 1: Augmentation Pairs

Two views of the same shape are a positive pair (should have high similarity). Views of different shapes are negative pairs (low similarity). Toggle between untrained and trained to see what contrastive learning achieves.

Cosine Similarity: 0.000
Panel 2: The InfoNCE Loss Landscape

Drag points in embedding space to see how InfoNCE loss changes. The anchor (blue) is pulled toward its positive (green) and pushed away from negatives (red). Arrows show gradient directions. Adjust temperature to see how it affects which negatives matter most.

Temperature τ: 0.50
InfoNCE Loss: 0.000
Panel 3: Zero-Shot Classification (Mini-CLIP)

Text prompts (circles) and images (shapes) start scattered randomly. Hit Train to run contrastive learning — matching pairs attract, non-matching pairs repel. Then Classify to see zero-shot classification: each image is assigned to its nearest text.

Epoch: 0 Loss: — Accuracy: —

Where Contrastive Learning Fits in the Series

Contrastive learning touches nearly every concept in the elementary series. Here's how the ideas connect:

References & Further Reading