CLIP from Scratch: Teaching Machines to See and Read at the Same Time
The Model That Changed Everything
Here’s a party trick. Show a model a photograph it has never seen before — say, a capybara wearing a tiny hat — and describe the image in plain English: "a photo of a capybara wearing a hat". The model has never been trained on capybara hat photos. It has never even seen the word “capybara” paired with an image during training. And yet it correctly identifies the image. No fine-tuning. No additional training data. Just language.
That’s CLIP — Contrastive Language-Image Pre-training — and when OpenAI published it in 2021, it quietly rewrote the rules of computer vision. Before CLIP, if you wanted a model to recognize dogs, you needed thousands of labeled dog photos. Want it to recognize cats too? More labeled photos. Capybaras in hats? Good luck finding that dataset.
CLIP sidesteps this entirely. Instead of learning fixed categories, it learns to place images and text descriptions in the same vector space. A photo of a dog and the sentence “a photo of a dog” get mapped to nearby vectors. Any new category works at inference time because you just embed the category name and find the nearest images.
The impact has been enormous. Stable Diffusion’s text encoder? That’s a frozen CLIP model. Multimodal search engines that find images from text queries? CLIP embeddings. Zero-shot classification, image-text matching, content moderation at scale — all CLIP.
In this post, we’ll build CLIP from scratch. We’ll construct both encoders, derive the contrastive loss, walk through a concrete training example, and implement zero-shot classification. Along the way, we’ll connect to nearly every post in the elementary series — from embeddings and attention to contrastive learning and vision transformers. This is the post that ties vision and language together.
The Architecture: Two Encoders, One Space
CLIP’s architecture is disarmingly simple: two separate neural networks — one for images, one for text — that project their inputs into a shared embedding space. That’s it. No cross-attention between modalities, no shared weights between encoders. Just two towers that learn to agree on what “similar” means.
The Image Encoder: A Vision Transformer
The image encoder is a Vision Transformer (ViT) — exactly the architecture we built in the ViT post. An input image (224×224) is chopped into patches (e.g., 14×14 pixels each, giving 16×16 = 256 patches), each patch is linearly projected into an embedding vector, a learnable [CLS] token is prepended, positional embeddings are added, and the whole sequence passes through standard transformer layers. The [CLS] token’s output — a single vector summarizing the entire image — is our image representation.
The Text Encoder: A Transformer with BPE
The text encoder is a GPT-2-style transformer with causal (masked) self-attention. Text is tokenized using byte-pair encoding (BPE) — the same algorithm we built from scratch — into a sequence of up to 77 tokens. A special [EOS] token is appended, and the transformer’s output at that token position becomes the text representation. Think of [EOS] as the text equivalent of [CLS]: a single vector that summarizes the entire caption.
The Projection: Meeting in the Middle
Here’s the crucial step. Each encoder outputs vectors of different dimensions — the ViT might produce 1024-dimensional image features, the text transformer 512-dimensional text features. We need them in the same space. So each encoder has a learned linear projection that maps its output to a shared dimension (typically 512 or 768). Both projected vectors are then L2-normalized, making cosine similarity equal to a simple dot product.
Let’s build it:
import torch
import torch.nn as nn
import torch.nn.functional as F
class VisionEncoder(nn.Module):
"""Simplified ViT image encoder for CLIP."""
def __init__(self, image_size=224, patch_size=16, channels=3,
embed_dim=768, depth=12, num_heads=12):
super().__init__()
num_patches = (image_size // patch_size) ** 2
# Patch embedding: project each patch to embed_dim
self.patch_embed = nn.Conv2d(
channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, embed_dim))
# Standard transformer encoder
layer = nn.TransformerEncoderLayer(
d_model=embed_dim, nhead=num_heads,
dim_feedforward=embed_dim * 4, batch_first=True
)
self.transformer = nn.TransformerEncoder(layer, num_layers=depth)
self.ln = nn.LayerNorm(embed_dim)
def forward(self, images):
# images: [B, C, H, W]
x = self.patch_embed(images) # [B, embed_dim, grid, grid]
x = x.flatten(2).transpose(1, 2) # [B, num_patches, embed_dim]
cls = self.cls_token.expand(x.size(0), -1, -1)
x = torch.cat([cls, x], dim=1) # [B, num_patches + 1, embed_dim]
x = x + self.pos_embed
x = self.transformer(x)
return self.ln(x[:, 0]) # [CLS] token output
class TextEncoder(nn.Module):
"""Simplified text encoder for CLIP."""
def __init__(self, vocab_size=49152, max_len=77,
embed_dim=512, depth=12, num_heads=8):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, embed_dim)
self.pos_embed = nn.Parameter(torch.randn(1, max_len, embed_dim))
layer = nn.TransformerEncoderLayer(
d_model=embed_dim, nhead=num_heads,
dim_feedforward=embed_dim * 4, batch_first=True
)
self.transformer = nn.TransformerEncoder(layer, num_layers=depth)
self.ln = nn.LayerNorm(embed_dim)
def forward(self, token_ids):
# token_ids: [B, seq_len]
x = self.token_embed(token_ids) + self.pos_embed[:, :token_ids.size(1)]
# Causal mask: each token can only attend to previous tokens
mask = torch.triu(
torch.ones(token_ids.size(1), token_ids.size(1), device=x.device),
diagonal=1
).bool()
x = self.transformer(x, mask=mask)
# Extract the [EOS] token output (CLIP uses the highest-id token,
# which is [EOS] at position 49407 in the vocabulary)
eos_indices = token_ids.argmax(dim=-1)
x = x[torch.arange(x.size(0)), eos_indices]
return self.ln(x)
class CLIP(nn.Module):
"""Complete CLIP model: two encoders projecting to a shared space."""
def __init__(self, embed_dim=512, vision_dim=768, text_dim=512):
super().__init__()
self.vision_encoder = VisionEncoder(embed_dim=vision_dim)
self.text_encoder = TextEncoder(embed_dim=text_dim)
# Linear projections to shared embedding space
self.vision_proj = nn.Linear(vision_dim, embed_dim, bias=False)
self.text_proj = nn.Linear(text_dim, embed_dim, bias=False)
# Learned temperature (initialized to 1/0.07 ≈ 14.3)
self.logit_scale = nn.Parameter(torch.ones([]) * torch.log(
torch.tensor(1.0 / 0.07)
))
def encode_image(self, images):
features = self.vision_encoder(images)
projected = self.vision_proj(features)
return F.normalize(projected, dim=-1) # L2 normalize
def encode_text(self, token_ids):
features = self.text_encoder(token_ids)
projected = self.text_proj(features)
return F.normalize(projected, dim=-1) # L2 normalize
def forward(self, images, token_ids):
image_embeds = self.encode_image(images) # [B, embed_dim]
text_embeds = self.encode_text(token_ids) # [B, embed_dim]
return image_embeds, text_embeds, self.logit_scale.exp()
Notice the logit_scale parameter — a single learned scalar initialized to log(1/0.07). We store it in log-space for numerical stability, then exponentiate during the forward pass. This is the temperature parameter from our softmax & temperature post, but here it’s learned, not fixed. The model discovers its own optimal sharpness during training.
The Contrastive Loss: N-Way Classification in Both Directions
Here’s where the magic happens. In our contrastive learning post, we built InfoNCE loss to pull similar items together and push dissimilar items apart. CLIP uses the exact same principle, but applied across modalities: images and text.
Given a batch of N image-text pairs, CLIP computes an N×N similarity matrix. Each row asks: “for this image, which of the N texts is its match?” Each column asks: “for this text, which of the N images is its match?” The correct answer is always the diagonal — image i matches text i.
Walking Through a 4×4 Batch
Let’s make this concrete. Suppose our batch has four pairs:
- Pair 0: (photo of a dog, “a golden retriever playing fetch”)
- Pair 1: (photo of a car, “a red sports car on a highway”)
- Pair 2: (photo of a cake, “a chocolate birthday cake with candles”)
- Pair 3: (photo of a mountain, “snow-capped peaks at sunset”)
After encoding and L2-normalizing, we compute cosine similarity between every image and every text. Multiplying by the learned temperature (let’s say τ = 14.3), we get a logit matrix:
| “retriever…” | “sports car…” | “birthday cake…” | “peaks…” | |
|---|---|---|---|---|
| Dog photo | 9.1 | 1.2 | 0.8 | 0.5 |
| Car photo | 0.9 | 8.7 | 0.3 | 1.1 |
| Cake photo | 0.4 | 0.6 | 9.4 | 0.2 |
| Mountain photo | 0.7 | 1.3 | 0.1 | 8.9 |
The diagonal (green) should be high — matching pairs. Everything else should be low. Now we apply cross-entropy loss in both directions:
- Image → Text: Softmax across each row. For row 0, the target is column 0. This is 4-way classification: “which text matches this image?”
- Text → Image: Softmax down each column. For column 0, the target is row 0. Also 4-way classification: “which image matches this text?”
The total loss averages both directions. This symmetry matters — it ensures neither modality dominates the learning signal.
where labels = [0, 1, 2, …, N−1]
The label for every sample is just its index — image i should match text i. This transforms contrastive learning into standard classification, which is why we can use plain cross-entropy loss.
Here’s the implementation — just seven lines that power the entire model:
def clip_loss(image_embeds, text_embeds, logit_scale):
"""Symmetric contrastive loss for CLIP.
Args:
image_embeds: [N, D] L2-normalized image embeddings
text_embeds: [N, D] L2-normalized text embeddings
logit_scale: scalar (exp of learned log-temperature)
"""
# Cosine similarity matrix scaled by temperature
logits = logit_scale * image_embeds @ text_embeds.T # [N, N]
# Labels: image_i matches text_i
labels = torch.arange(len(image_embeds), device=logits.device)
# Cross-entropy in both directions
loss_i2t = F.cross_entropy(logits, labels) # rows: image -> text
loss_t2i = F.cross_entropy(logits.T, labels) # cols: text -> image
return (loss_i2t + loss_t2i) / 2
That’s it. The entire CLIP training objective. The paper’s own pseudocode (their famous Figure 3) looks almost identical — here it is in NumPy notation, annotated:
# ---- CLIP pseudocode (adapted from Radford et al., 2021, Figure 3) ----
# I_f = image_encoder(images) # [N, d_image]
# T_f = text_encoder(texts) # [N, d_text]
# Project to shared space and normalize
# I_e = l2_normalize(I_f @ W_image) # [N, d_embed]
# T_e = l2_normalize(T_f @ W_text) # [N, d_embed]
# Scaled pairwise cosine similarities
# logits = I_e @ T_e.T * exp(temperature) # [N, N]
# Symmetric cross-entropy
# labels = arange(N) # [0, 1, 2, ..., N-1]
# loss_i = cross_entropy(logits, labels, axis=0) # image-to-text
# loss_t = cross_entropy(logits, labels, axis=1) # text-to-image
# loss = (loss_i + loss_t) / 2
The symmetry in CLIP’s loss means every batch provides 2N classification tasks — N image-to-text and N text-to-image. With a batch size of 32,768, that’s 65,536 classification problems per gradient step. This density of supervision is why CLIP learns such powerful representations.
Training: Teaching Vision and Language to Agree
With the architecture and loss defined, training CLIP is conceptually straightforward: encode images, encode text, compute loss, backprop. But the details matter.
Why Batch Size Is Everything
CLIP was trained with batches of 32,768 image-text pairs. That number isn’t arbitrary. In contrastive learning, every non-matching pair in the batch is a negative example. With N = 32,768, each image has 32,767 negative texts to compare against. More negatives means a harder classification task, which forces the model to learn finer-grained distinctions.
We saw this same principle in the contrastive learning post with SimCLR — larger batches consistently produced better representations. CLIP takes this to an extreme.
The Dataset: 400 Million Pairs
OpenAI assembled WebImageText (WIT): 400 million image-text pairs scraped from the internet. The text isn’t carefully written labels — it’s alt text, captions, and surrounding text from web pages. Noisy, messy, and enormous. The open-source community later created LAION-400M and LAION-5B as alternatives.
A Minimal Training Loop
Here’s what training looks like in practice (simplified to fit on a screen):
def train_clip(model, dataloader, epochs=10, lr=3e-4):
"""Minimal CLIP training loop."""
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.1)
for epoch in range(epochs):
total_loss = 0
for images, token_ids in dataloader:
# Forward pass: encode both modalities
image_embeds, text_embeds, logit_scale = model(images, token_ids)
# Compute symmetric contrastive loss
loss = clip_loss(image_embeds, text_embeds, logit_scale)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Clamp logit_scale to prevent instability (after the update)
with torch.no_grad():
model.logit_scale.clamp_(max=torch.log(torch.tensor(100.0)))
total_loss += loss.item()
avg = total_loss / len(dataloader)
temp = model.logit_scale.exp().item()
print(f"Epoch {epoch+1}: loss={avg:.4f}, temperature=1/{1/temp:.4f}")
Note the clamp_ call after each optimizer step — the logit scale is clamped to prevent the temperature from becoming too extreme (capping the scale at 100, equivalent to τ = 0.01). Without this, training can become unstable as the model tries to make the softmax infinitely sharp.
| Training Detail | CLIP (Full) | Our Toy Version |
|---|---|---|
| Dataset size | 400M image-text pairs | Hundreds to thousands |
| Batch size | 32,768 | 32–256 |
| Image encoder | ViT-L/14 (428M params) | ViT-Tiny (a few million) |
| Training hardware | 256 V100 GPUs, 12 days | One GPU, minutes |
| Augmentation | Random resized crop only | Same (surprisingly minimal) |
The data augmentation is surprisingly restrained — just random resized cropping. Unlike SimCLR, which relied on aggressive augmentations (color jitter, Gaussian blur, cropping), CLIP gets its diversity from the sheer variety of internet image-text pairs. When your dataset has 400 million examples, you don’t need to artificially manufacture variety.
Zero-Shot Classification: The Killer Application
Here’s where CLIP earns its place in history. Traditional image classifiers learn a fixed set of categories during training — they can only recognize what they were explicitly taught. CLIP can classify images into categories it has never seen, using nothing but their names in plain English.
The trick is beautifully simple:
- Write a text description for each possible class:
"a photo of a dog","a photo of a cat", etc. - Encode each description through CLIP’s text encoder to get class embeddings.
- Encode the input image through CLIP’s image encoder.
- Pick the class whose text embedding has the highest cosine similarity to the image embedding.
def zero_shot_classify(model, image, class_names, prompt="a photo of a {}"):
"""Classify an image into one of the given classes, zero-shot.
No training on these classes required — just their names.
"""
# Create text prompts for each class
prompts = [prompt.format(name) for name in class_names]
# Encode everything (tokenize() wraps your BPE tokenizer — see our tokenization post)
with torch.no_grad():
image_embed = model.encode_image(image.unsqueeze(0)) # [1, D]
text_embeds = model.encode_text(tokenize(prompts)) # [C, D]
# Cosine similarity (both are L2-normalized)
similarities = (image_embed @ text_embeds.T).squeeze(0) # [C]
# Return the class with highest similarity
best_idx = similarities.argmax().item()
return class_names[best_idx], similarities
That’s it. Ten lines of meaningful code. No training, no labeled data, no fine-tuning. Just embed and compare.
The results are remarkable. CLIP’s ViT-L/14@336px achieves 76.2% top-1 accuracy on ImageNet — without seeing a single one of ImageNet’s 1.28 million labeled training images. That matches the original supervised ResNet-50, which was specifically trained on those 1.28 million labeled images. Zero-shot matching supervised performance was unthinkable before CLIP.
| Model | ImageNet Training Data Used | Top-1 Accuracy |
|---|---|---|
| ResNet-50 (supervised) | 1.28M labeled images | 76.1% |
| CLIP ViT-B/32 (zero-shot) | 0 (zero) | 63.2% |
| CLIP ViT-B/16 (zero-shot) | 0 (zero) | 68.3% |
| CLIP ViT-L/14 (zero-shot) | 0 (zero) | 75.3% |
| CLIP ViT-L/14@336 (zero-shot) | 0 (zero) | 76.2% |
Prompt Engineering for Vision
Here’s something surprising: the exact text you use for class descriptions matters a lot. This connects directly to our embeddings post — small changes in text produce different embedding vectors, and those differences compound into classification accuracy.
Using just the bare class name (“dog”) performs poorly. Why? Because CLIP was trained on web captions, not single-word labels. The text encoder has learned to expect sentence-like input. "a photo of a dog" immediately boosts accuracy because it matches the distribution of web captions.
The CLIP authors took this further. They created 80 different prompt templates for ImageNet classification:
# A sampling of CLIP's 80 prompt templates for ImageNet
CLIP_TEMPLATES = [
"a photo of a {}.",
"a blurry photo of a {}.",
"a black and white photo of the {}.",
"a low contrast photo of the {}.",
"a bright photo of the {}.",
"a cropped photo of a {}.",
"a close-up photo of a {}.",
"a photo of a large {}.",
"a photo of a small {}.",
"a drawing of a {}.",
"a painting of a {}.",
"a sculpture of a {}.",
"a rendering of a {}.",
"a cartoon {}.",
"art of a {}.",
"a pixelated photo of a {}.",
"a photo of the dirty {}.",
"a photo of the clean {}.",
"a tattoo of a {}.",
"the origami {}.",
"a plushie {}.",
"a toy {}.",
"itap of a {}.", # "I took a picture of a..."
"a {} in a video game.",
"graffiti of a {}.",
]
The trick is called prompt ensembling: for each class, you generate embeddings from all 80 templates and average them. This smooths out the noise from any single phrasing:
def build_ensemble_classifier(model, class_names, templates):
"""Build zero-shot classifier with prompt ensembling.
Average embeddings across multiple prompt templates per class.
Gains ~3.5% accuracy on ImageNet vs. a single template.
"""
ensemble_weights = []
with torch.no_grad():
for class_name in class_names:
# Embed all templates for this class
prompts = [t.format(class_name) for t in templates]
embeddings = model.encode_text(tokenize(prompts)) # [T, D]
# Average and re-normalize
class_embedding = embeddings.mean(dim=0)
class_embedding = F.normalize(class_embedding, dim=0)
ensemble_weights.append(class_embedding)
# Stack into a [num_classes, D] matrix for fast classification
return torch.stack(ensemble_weights)
def classify_with_ensemble(model, image, ensemble_weights, class_names):
"""Classify using pre-computed ensemble weights."""
with torch.no_grad():
image_embed = model.encode_image(image.unsqueeze(0)) # [1, D]
similarities = (image_embed @ ensemble_weights.T).squeeze(0) # [C]
best_idx = similarities.argmax().item()
return class_names[best_idx], similarities[best_idx].item()
This ensemble approach boosted ImageNet accuracy by about 3.5 percentage points. The reason is intuitive: averaging across “a photo of a dog”, “a painting of a dog”, and “a plushie dog” produces a more robust “dog-ness” vector than any single description.
Prompt engineering isn’t just for chatbots. In CLIP, the right prompt template can swing classification accuracy by 5+ percentage points. The words you choose shape the embedding space you search through.
Beyond Classification: Search, Similarity, and Stable Diffusion
Zero-shot classification is just the beginning. Because CLIP maps images and text into the same vector space, any operation that works on embeddings now works across modalities.
Text-to-Image Search
Embed a text query. Find the nearest images by cosine similarity. That’s it — you’ve built Google Image Search:
def text_to_image_search(model, query, image_embeds, images, top_k=5):
"""Search images using a text query."""
query_embed = model.encode_text(tokenize([query])) # [1, D]
scores = (query_embed @ image_embeds.T).squeeze(0) # [num_images]
top_indices = scores.topk(top_k).indices
return [(images[i], scores[i].item()) for i in top_indices]
# Usage: text_to_image_search(model, "sunset over the ocean", db_embeds, db_images)
Image-to-Image Search
Embed an image. Find similar images via the shared space — even though no text was involved, the images cluster by semantic meaning:
def image_to_image_search(model, query_image, image_embeds, images, top_k=5):
"""Find visually similar images using the shared embedding space."""
query_embed = model.encode_image(query_image.unsqueeze(0))
scores = (query_embed @ image_embeds.T).squeeze(0)
top_indices = scores.topk(top_k).indices
return [(images[i], scores[i].item()) for i in top_indices]
This ties directly to our vector search benchmarks post — searching CLIP embeddings with FAISS or pgvector is exactly how production image search works. And our RAG post’s retrieval pipeline? Replace text embeddings with CLIP embeddings and you get multimodal RAG that retrieves images alongside documents.
How Stable Diffusion Uses CLIP
When you type “a cyberpunk cityscape at night” into Stable Diffusion, the first thing that happens is your prompt passes through a frozen CLIP text encoder (specifically ViT-L/14’s text transformer). The resulting embedding sequence — not the pooled vector, but the full 77-token sequence of 768-dimensional vectors — is fed into the diffusion U-Net via cross-attention layers. CLIP gives Stable Diffusion its understanding of language.
This is why prompt engineering works for image generation: different wordings produce different CLIP embeddings, which steer the diffusion process differently. “A dog” and “a professional photograph of a golden retriever, detailed, 8K” land in very different regions of CLIP’s text embedding space.
Try It: CLIP Embedding Explorer
CLIP Embedding Explorer
The Shared Embedding Space
This visualization shows how CLIP maps images and text into the same 2D space. Matching image-text pairs cluster together. Type a description below to see where it would land. Hover over any point to see its label and nearest neighbors.
Similarity Heatmap
The 8×8 similarity matrix between sample images and texts. Bright diagonal = matching pairs have high similarity. Click any cell to see the pair.
Where CLIP Fits in the Series
CLIP is a convergence point — it ties together nearly every concept we’ve built in the elementary series. Here’s how the ideas connect:
- Contrastive Learning from Scratch — CLIP’s training objective is InfoNCE applied across modalities. We built the same loss for images alone; CLIP extends it to image-text pairs.
- Vision Transformers from Scratch —
CLIP’s image encoder is literally a ViT. Same patches, same
[CLS]token, same transformer layers. - Embeddings from Scratch — We went from word embeddings (words as vectors) to multimodal embeddings (images AND text as vectors in the same space). CLIP is the natural conclusion.
- Tokenization from Scratch — CLIP’s text encoder uses BPE tokenization, the same algorithm we built byte by byte.
- Softmax & Temperature from Scratch — The learned temperature parameter in CLIP’s loss controls how sharp the softmax distribution is. Same concept, but now the model discovers its own optimal τ.
- Loss Functions from Scratch — CLIP uses symmetric cross-entropy — the same loss we derived from maximum likelihood, applied to the NxN similarity matrix.
- Attention from Scratch — Both encoders are transformers. Self-attention in the image encoder lets patches attend to each other; masked self-attention in the text encoder processes tokens sequentially.
The Bridge Between Seeing and Reading
CLIP demonstrated something profound: with enough data and the right objective, you can teach two completely different neural networks to agree on what the world looks like. An image of a sunset and the words “a beautiful sunset over the ocean” converge to the same point in a 512-dimensional space — not because anyone labeled that pair, but because the model learned the correspondence from 400 million examples of how humans describe what they see.
The shared embedding space idea has become foundational. SigLIP simplified the loss. OpenCLIP reproduced the results openly. ALIGN scaled to a billion pairs. LLaVA plugged CLIP’s image encoder into a language model to build multimodal AI assistants. Every one of these builds on the core insight: contrastive pre-training on naturally occurring image-text pairs produces representations that transfer to virtually any visual task.
We’ve now connected vision and language — the two most important modalities in AI. From tokenizing text to patchifying images, from word vectors to contrastive objectives, the entire elementary series has been building toward this moment: a single model that reads and sees in the same breath.
References & Further Reading
- Radford et al. — “Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021) — The original CLIP paper from OpenAI. Clean writing, excellent pseudocode.
- OpenAI — CLIP Blog Post — Accessible overview with interactive demos and key results.
- OpenAI — CLIP GitHub Repository — Official code including the model, tokenizer, and evaluation scripts.
- OpenCLIP — Open-Source CLIP Reproduction — Community-built open-source implementation trained on LAION datasets.
- Dosovitskiy et al. — “An Image is Worth 16x16 Words” (ViT, 2020) — The Vision Transformer that powers CLIP’s image encoder.
- Lilian Weng — “Contrastive Representation Learning” (2021) — Comprehensive survey connecting SimCLR, MoCo, CLIP, and more.
- LAION — LAION-400M Open Dataset — The open-source alternative to CLIP’s proprietary training data.
- Zhai et al. — “Sigmoid Loss for Language Image Pre-Training” (SigLIP, 2023) — Simplifies CLIP’s loss by replacing softmax with sigmoid, removing the need for large batches.