Self-Supervised Learning from Scratch: Creating Labels from Nothing
The Label Bottleneck — Why Self-Supervision Matters
ImageNet took 25,000 workers two and a half years to label 14 million images. GPT-4 was trained on trillions of tokens that nobody labeled at all. The revolution in modern AI isn’t about bigger models or faster GPUs — it’s the discovery that you can create your own training signal from raw data, and those self-generated labels teach richer representations than human annotations ever could.
This idea has a name: self-supervised learning. And it has quietly become the dominant training paradigm in AI.
Consider three ways to learn from data. Supervised learning requires expensive, hand-crafted labels — a radiologist marking tumors, a linguist annotating parts of speech. Unsupervised learning finds patterns without guidance, but the signal is weak and unfocused. Self-supervised learning splits the difference: it generates pseudo-labels automatically from the structure of the data itself, creating a rich supervisory signal for free.
The core idea is beautifully simple: hide part of the input, then train the model to predict what’s missing. You can’t predict a missing word without understanding grammar and semantics. You can’t reconstruct a missing image patch without understanding objects and scenes. The task of prediction becomes a proxy for understanding.
The timeline of this revolution is remarkably recent: word2vec (2013) showed that predicting context words learns useful word embeddings; BERT (2018) demonstrated that masked language modeling creates powerful text representations; SimCLR (2020) proved that comparing augmented views works for images; and MAE (2022) showed that masking 75% of image patches forces even deeper visual understanding.
Let’s build each of these approaches from scratch.
Masked Language Modeling: The Idea Behind BERT
The simplest self-supervised task is masked language modeling. Take a sentence, randomly mask 15% of the tokens, and train the model to predict them. BERT uses a clever 80/10/10 strategy: 80% of masked tokens become a [MASK] token, 10% become a random word, and 10% stay unchanged. This prevents the model from only learning to look for [MASK] markers — it must stay vigilant about every position.
import numpy as np
# Build vocabulary from a small corpus
corpus = [
"the cat sat on the mat",
"the dog played in the park",
"a bird flew over the tree",
"the fish swam in the pond"
]
words = sorted(set(w for s in corpus for w in s.split()))
word2idx = {w: i + 3 for i, w in enumerate(words)} # 0=PAD, 1=MASK, 2=UNK
idx2word = {i: w for w, i in word2idx.items()}
idx2word[1] = "[MASK]"
vocab_size = len(word2idx) + 3
# Tokenize
sentences = [[word2idx[w] for w in s.split()] for s in corpus]
# BERT-style masking: 15% of tokens, with 80/10/10 strategy
def mask_tokens(tokens, mask_prob=0.15):
masked = tokens.copy()
labels = [-1] * len(tokens) # -1 means "ignore in loss"
for i in range(len(tokens)):
if np.random.random() < mask_prob:
labels[i] = tokens[i] # remember true token
r = np.random.random()
if r < 0.8:
masked[i] = 1 # 80%: replace with [MASK]
elif r < 0.9:
masked[i] = np.random.randint(3, vocab_size) # 10%: random
# remaining 10%: keep original (no change)
return masked, labels
# Simple encoder: embedding lookup + linear prediction head
embed_dim = 16
np.random.seed(42)
W_embed = np.random.randn(vocab_size, embed_dim) * 0.1
W_predict = np.random.randn(embed_dim, vocab_size) * 0.1
# Forward pass on "the cat sat on the mat"
sentence = sentences[0]
masked, labels = mask_tokens(sentence, mask_prob=0.5) # high rate for demo
embeddings = W_embed[masked] # (seq_len, embed_dim)
logits = embeddings @ W_predict # (seq_len, vocab_size)
# Cross-entropy loss ONLY on masked positions
loss = 0.0
n_masked = 0
for i, label in enumerate(labels):
if label != -1:
exp_logits = np.exp(logits[i] - logits[i].max())
probs = exp_logits / exp_logits.sum()
loss -= np.log(probs[label] + 1e-10)
n_masked += 1
predicted = idx2word.get(np.argmax(logits[i]), "?")
actual = idx2word.get(label, "?")
print(f"Position {i}: true='{actual}', predicted='{predicted}'")
if n_masked:
print(f"\nAvg loss over {n_masked} masked tokens: {loss / n_masked:.3f}")
print(f"Random baseline loss: {np.log(vocab_size):.3f}")
With an untrained model, predictions are essentially random — the loss starts high (around log(vocab_size)). As training progresses, the model learns that “the cat sat ___ the mat” probably fills in “on” — not from a human-provided label, but from the statistical structure of language itself. This is the essence of self-supervised learning: the data labels itself.
Masked Image Modeling — BERT for Pixels
What if we applied the same idea to images? That’s exactly what Masked Autoencoders (MAE), introduced by He et al. in 2022, accomplish. Divide an image into a grid of patches, mask out a large fraction, and train the model to reconstruct the missing patches from the visible ones.
But here’s a critical insight: images are far more redundant than text. If you mask only 15% of image patches (like BERT does with words), the model can cheat by interpolating from neighbors — it never needs to understand what’s actually in the image. MAE solves this by masking a staggering 75% of patches. At this level, trivial interpolation fails. The model must develop genuine understanding of objects, textures, and spatial relationships.
Another elegant design choice: the encoder only processes the visible 25% of patches, cutting compute by 4x. A smaller decoder takes the encoded visible patches plus learnable placeholder tokens and reconstructs the masked patches. The loss is computed only on masked positions — the model is judged solely on what it couldn’t see.
import numpy as np
# Simulate a 4x4 "image" as 16 patches (each patch = 8-dim feature vector)
np.random.seed(42)
patches = np.random.randn(16, 8)
# Add spatial structure: top patches = "sky", bottom patches = "ground"
for i in range(8): # rows 0-1 (top half)
patches[i] += 2.0
for i in range(8, 16): # rows 2-3 (bottom half)
patches[i] -= 2.0
def masked_autoencoder(patches, mask_ratio=0.75):
n = len(patches)
n_mask = int(n * mask_ratio)
# Random shuffle to split into masked and visible
perm = np.random.permutation(n)
mask_idx = perm[:n_mask]
vis_idx = perm[n_mask:]
# KEY INSIGHT 1: Encoder only processes VISIBLE patches (saves compute!)
visible = patches[vis_idx]
W_enc = np.random.randn(8, 16) * 0.1
encoded = np.tanh(visible @ W_enc) # (n_visible, 16)
# Decoder: reconstruct ALL patches from encoded visible ones
mask_token = np.zeros(16) # learnable placeholder for masked positions
full_seq = np.zeros((n, 16))
full_seq[vis_idx] = encoded
full_seq[mask_idx] = mask_token
W_dec = np.random.randn(16, 8) * 0.1
reconstructed = full_seq @ W_dec # (n, 8)
# KEY INSIGHT 2: Loss only on MASKED patches
mse = np.mean((reconstructed[mask_idx] - patches[mask_idx]) ** 2)
return mse, len(vis_idx), len(mask_idx)
# Compare reconstruction difficulty at different masking ratios
for ratio in [0.25, 0.50, 0.75, 0.90]:
results = [masked_autoencoder(patches, ratio) for _ in range(10)]
avg_mse = np.mean([r[0] for r in results])
n_vis, n_mask = results[0][1], results[0][2]
print(f"Mask {ratio:.0%}: see {n_vis} patches, reconstruct {n_mask} -> MSE={avg_mse:.2f}")
Notice how the MSE loss climbs with the masking ratio. At 25% masking, reconstruction is trivially easy — plenty of neighbors to copy from. At 75% (MAE’s default), the task is genuinely hard. At 90%, it’s nearly impossible even for a trained model. The sweet spot of 75% forces the model to learn semantic features: understanding that sky continues as sky, that object boundaries are coherent, that textures have patterns. This connection between task difficulty and representation quality is a recurring theme in self-supervised learning.
Try It: Masking Strategy Explorer
Drag the slider to change the masking ratio. Masked patches turn gray. Hit “Reconstruct” to fill them in using nearest-neighbor interpolation from visible patches. Watch how quality drops as masking increases.
Contrastive Self-Supervised Learning — Learning by Comparison
Masked modeling learns by predicting missing content. Contrastive learning takes a completely different approach: it learns by comparing. Take a single image, create two augmented views (different crops, color jitter, blurs), and train the model to recognize that these two views came from the same source — while pushing apart views from different images.
SimCLR, introduced by Chen et al. in 2020, formalized this with the NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss. For each image in a batch, create two augmented views — forming a “positive pair.” Every other image’s views serve as “negative pairs.” The loss pulls positive pairs together in embedding space and pushes negative pairs apart.
Temperature is the key hyperparameter. Low temperature (τ = 0.1) makes the model hypersensitive to similar-but-different examples. High temperature (τ = 2.0) treats everything more uniformly. SimCLR uses τ = 0.5 — a balance between discrimination and stability. For a deeper dive into SimCLR, CLIP, and the full contrastive story, see our contrastive learning from scratch post. Here, we’ll focus on the loss function itself.
import numpy as np
def nt_xent_loss(z, temperature=0.5):
"""NT-Xent: Normalized Temperature-scaled Cross-Entropy Loss.
z: (2N, dim) where z[2k] and z[2k+1] are a positive pair."""
N = len(z) // 2
# L2 normalize embeddings onto the unit hypersphere
z_norm = z / (np.linalg.norm(z, axis=1, keepdims=True) + 1e-8)
# Full cosine similarity matrix, scaled by temperature
sim = z_norm @ z_norm.T / temperature # (2N, 2N)
# Numerical stability
sim -= np.max(sim, axis=1, keepdims=True)
exp_sim = np.exp(sim)
# Mask out self-similarity (diagonal)
mask = ~np.eye(2 * N, dtype=bool)
loss = 0.0
for i in range(2 * N):
# Positive partner: (2k, 2k+1) are paired
j = i + 1 if i % 2 == 0 else i - 1
numerator = exp_sim[i, j]
denominator = (exp_sim[i] * mask[i]).sum()
loss -= np.log(numerator / denominator + 1e-10)
return loss / (2 * N)
# Create a batch: 4 images, each with 2 augmented views
np.random.seed(42)
dim = 32
batch = []
for _ in range(4):
anchor = np.random.randn(dim)
positive = anchor + np.random.randn(dim) * 0.1 # similar view
batch.extend([anchor, positive])
z = np.array(batch)
# How does temperature affect the loss?
for temp in [0.1, 0.5, 1.0, 2.0]:
loss = nt_xent_loss(z, temperature=temp)
print(f"Temperature {temp:.1f}: loss = {loss:.4f}")
Lower temperatures produce higher loss because the model is making finer-grained distinctions — the softmax becomes sharper and the penalty for not perfectly separating positives from negatives grows. As temperature increases, the loss converges — all negatives look equally irrelevant. The art of contrastive learning is choosing a temperature that makes the task hard enough to learn useful features but not so hard that training becomes unstable.
Non-Contrastive Methods — BYOL and the Collapse Problem
Contrastive methods have a practical problem: they need negative pairs. Lots of them. SimCLR requires batch sizes of 4,096+ to work well because each batch provides the negative examples that prevent trivial solutions. What if you could learn without any negatives at all?
BYOL (Bootstrap Your Own Latent), introduced by Grill et al. in 2020, does exactly that. It uses two networks: an online network and a target network. The online network processes one augmented view and tries to predict the target network’s output on a different view. The target network is an exponential moving average (EMA) of the online network — it slowly and smoothly tracks the online weights.
The architecture has a deliberate asymmetry: the online network has an extra predictor MLP that the target network lacks. Combined with stop-gradient (no backpropagation through the target) and EMA updates, this asymmetry prevents representational collapse.
Why would everything collapse without safeguards? The model is trained to make two outputs similar, but there are no negatives pushing things apart. The trivial solution: output the same constant vector for every input. Loss drops to zero, but the model has learned nothing. Three mechanisms prevent this:
- Predictor MLP — The extra layer means the two networks compute different functions. A constant output can’t optimally satisfy both asymmetric paths.
- Stop-gradient — The target doesn’t receive gradients, so it can’t chase the online network toward collapse.
- EMA update — The target changes slowly, providing a stable reference that the online network must genuinely match.
Remove any single component and collapse occurs rapidly.
import numpy as np
class BYOL:
"""Bootstrap Your Own Latent -- learns without negative pairs."""
def __init__(self, in_dim=8, hid=16, proj=8):
# Online network: encoder -> projector -> predictor
self.W_enc = np.random.randn(in_dim, hid) * 0.3
self.W_proj = np.random.randn(hid, proj) * 0.3
self.W_pred = np.random.randn(proj, proj) * 0.3 # only online has this!
# Target network: encoder -> projector (NO predictor)
self.W_enc_t = self.W_enc.copy()
self.W_proj_t = self.W_proj.copy()
def online(self, x):
"""Full online path: encode -> project -> predict."""
h = np.tanh(x @ self.W_enc)
z = np.tanh(h @ self.W_proj)
return np.tanh(z @ self.W_pred)
def target(self, x):
"""Target path: encode -> project (no predictor, no gradients)."""
h_t = np.tanh(x @ self.W_enc_t)
return np.tanh(h_t @ self.W_proj_t)
def compute_loss(self, view1, view2):
"""Cosine distance between online(view1) and target(view2)."""
p = self.online(view1)
z = self.target(view2) # stop-gradient: target not updated by loss
p_n = p / (np.linalg.norm(p, axis=-1, keepdims=True) + 1e-8)
z_n = z / (np.linalg.norm(z, axis=-1, keepdims=True) + 1e-8)
return 2 - 2 * np.mean(np.sum(p_n * z_n, axis=-1))
def ema_update(self, tau=0.996):
"""Target slowly tracks online weights."""
self.W_enc_t = tau * self.W_enc_t + (1 - tau) * self.W_enc
self.W_proj_t = tau * self.W_proj_t + (1 - tau) * self.W_proj
# Demonstration
np.random.seed(42)
data = np.random.randn(50, 8)
model = BYOL()
# Compute BYOL loss on augmented views
view1 = data[:16] + np.random.randn(16, 8) * 0.1
view2 = data[:16] + np.random.randn(16, 8) * 0.1
loss = model.compute_loss(view1, view2)
model.ema_update(tau=0.996)
# Check representation diversity (healthy model = varied outputs)
reps = model.online(data)
print(f"BYOL loss: {loss:.4f}")
print(f"Output diversity (std): {np.std(reps):.4f}")
print(f"Output range: [{np.min(reps):.3f}, {np.max(reps):.3f}]")
# What does COLLAPSE look like?
collapsed = np.full_like(reps, 0.5) # every input -> same vector
print(f"\nCollapsed diversity: {np.std(collapsed):.4f} (all identical!)")
print("\nThree things prevent collapse:")
print(" 1. Predictor MLP adds asymmetry between networks")
print(" 2. Stop-gradient keeps target from chasing online")
print(" 3. EMA updates keep target slowly drifting, staying informative")
print("Remove ANY one and representations collapse to a constant.")
The diversity metric is the simplest diagnostic. A healthy model produces varied outputs — different inputs yield different representations with high standard deviation. A collapsed model maps everything to the same point: diversity drops to zero. Monitoring representation diversity during training is the fastest way to detect that something has gone wrong.
Vision Transformers Meet Self-Supervision — DINO
DINO (Distillation with NO labels), introduced by Caron et al. in 2021, combines BYOL-style self-distillation with Vision Transformers to produce something remarkable. Like BYOL, it uses a student-teacher framework with EMA updates. But DINO adds two powerful innovations.
First, multi-crop strategy. Instead of just two views, DINO creates multiple crops at different scales: two “global” crops covering over half the image, and several small “local” crops covering less than half. The teacher only sees global crops. The student sees all crops — and must learn to predict the teacher’s global-view representation from its limited local view. This forces the student to understand global context from local information.
Second, centering. To prevent collapse without negative pairs, DINO subtracts a running mean of the teacher’s outputs (the “center”). This prevents any single output dimension from dominating. Combined with sharpening (applying a very low temperature to the teacher’s softmax), this keeps the output distribution balanced and diverse.
The remarkable payoff: DINO’s attention maps spontaneously segment objects without ever seeing a single pixel-level annotation. The model learns “this is an object” purely from the self-supervised objective — one of the most striking demonstrations that SSL features capture genuine visual understanding.
import numpy as np
def softmax(x, temp=1.0):
"""Temperature-scaled softmax."""
e = np.exp((x - np.max(x, axis=-1, keepdims=True)) / temp)
return e / e.sum(axis=-1, keepdims=True)
def dino_loss(student_out, teacher_out, center, temp_s=0.1, temp_t=0.04):
"""DINO loss: cross-entropy between sharpened teacher and student."""
# Teacher: sharpen via low temperature + subtract center
t_probs = softmax(teacher_out - center, temp=temp_t)
# Student: higher temperature (softer distribution)
s_log_probs = np.log(softmax(student_out, temp=temp_s) + 1e-10)
# Cross-entropy: teacher distribution is the "label"
return -np.mean(np.sum(t_probs * s_log_probs, axis=-1))
# Setup: 4 images, each gets multiple crops at different scales
np.random.seed(42)
dim = 16
batch_size = 4
n_global = 2 # teacher processes global crops (large image regions)
n_local = 4 # student processes all crops including small local ones
# Simulate encoder outputs (normally these come from a ViT)
student_global = np.random.randn(batch_size, n_global, dim) * 0.5
student_local = np.random.randn(batch_size, n_local, dim) * 0.5
teacher_global = student_global + np.random.randn(batch_size, n_global, dim) * 0.02
# Center: running mean of teacher outputs (prevents mode collapse)
center = np.zeros(dim)
center_ema = 0.9
# DINO objective: every student crop predicts every teacher global crop
total_loss = 0.0
n_pairs = 0
for img in range(batch_size):
for t in range(n_global):
teacher_out = teacher_global[img, t]
# Student global crops (skip matching same crop index)
for s in range(n_global):
if s == t:
continue
total_loss += dino_loss(student_global[img, s], teacher_out, center)
n_pairs += 1
# Student local crops (always included)
for s in range(n_local):
total_loss += dino_loss(student_local[img, s], teacher_out, center)
n_pairs += 1
# Update center with EMA
batch_center = teacher_global.reshape(-1, dim).mean(axis=0)
center = center_ema * center + (1 - center_ema) * batch_center
avg_loss = total_loss / n_pairs
t_sharp = softmax(teacher_global[0, 0], temp=0.04)
t_entropy = -np.sum(t_sharp * np.log(t_sharp + 1e-10))
print(f"DINO loss: {avg_loss:.4f}")
print(f"Loss pairs per image: {n_pairs // batch_size}")
print(f" (= {n_global}x{n_global-1} global-global + {n_global}x{n_local} global-local)")
print(f"Teacher sharpness (entropy): {t_entropy:.3f} (lower = sharper)")
print(f"Center norm: {np.linalg.norm(center):.4f}")
Notice the multi-crop loss creates dense supervision from each image. With 2 global and 4 local crops, each image generates 2×1 + 2×4 = 10 loss pairs (each teacher global crop is matched against every other student crop). This rich within-image signal is one reason DINO trains efficiently despite using no negative pairs across images. The low teacher temperature (0.04 vs student’s 0.1) creates sharp pseudo-labels, and the centering keeps the teacher from collapsing into a single dominant mode.
Try It: Contrastive vs Non-Contrastive Embedding Space
Watch how different SSL objectives organize an embedding space. SimCLR uses attraction + repulsion. BYOL uses only attraction. Toggle “Disable EMA” to see collapse.
The Unified View — Why All Roads Lead to Representations
Step back and look at the big picture. Every self-supervised method we’ve explored follows the same recipe:
- Design a pretext task — mask tokens, mask patches, compare views, predict representations
- Train the model — using the data’s own structure as the supervisory signal
- Transfer the learned features — the encoder has learned representations useful far beyond the pretext task
Masked modeling learns by predicting missing content, developing contextual understanding. Contrastive methods learn by comparing views, developing invariance to irrelevant transformations. Non-contrastive methods learn by self-distillation, developing stable representations without requiring negative pairs. Different paths, same destination: features that capture the semantic structure of data.
What makes a good pretext task? It must be hard enough to require semantic understanding (this is why MAE uses 75% masking, not 15%) but tractable enough for gradient-based optimization. It must be general — learning to reconstruct generic images forces general visual features, not task-specific shortcuts.
The quality of learned features is measured by two standard evaluation protocols. Linear probing freezes the encoder and trains only a linear classifier on top — if a linear layer can separate classes, the features must already be well-organized. k-NN evaluation is even simpler: classify each test point by majority vote of its nearest neighbors in feature space, no training needed at all. Good SSL features naturally cluster by semantic class, making k-NN surprisingly effective.
import numpy as np
# Synthetic data: 3 classes with clear cluster structure in 2D
np.random.seed(42)
n_per_class = 50
centers = np.array([[2, 2], [-2, 2], [0, -2.5]])
X = np.vstack([c + np.random.randn(n_per_class, 2) * 0.6 for c in centers])
y = np.repeat([0, 1, 2], n_per_class)
# Simulate a self-supervised encoder: 2D input -> 8D representations
W_enc = np.random.randn(2, 8) * 0.5
b_enc = np.random.randn(8) * 0.1
features = np.tanh(X @ W_enc + b_enc)
# Train/test split
perm = np.random.permutation(150)
X_tr, X_te = features[perm[:100]], features[perm[100:]]
y_tr, y_te = y[perm[:100]], y[perm[100:]]
# METHOD 1: Linear Probing -- freeze encoder, train linear head
def linear_probe(X_tr, y_tr, X_te, y_te, n_cls=3, lr=0.1, epochs=200):
W = np.zeros((X_tr.shape[1], n_cls))
for _ in range(epochs):
logits = X_tr @ W
exp_l = np.exp(logits - logits.max(axis=1, keepdims=True))
probs = exp_l / exp_l.sum(axis=1, keepdims=True)
grad = X_tr.T @ (probs - np.eye(n_cls)[y_tr]) / len(y_tr)
W -= lr * grad
return np.mean(np.argmax(X_te @ W, axis=1) == y_te)
# METHOD 2: k-NN -- classify by majority vote of nearest neighbors
def knn_eval(X_tr, y_tr, X_te, y_te, k=5):
correct = 0
for i in range(len(X_te)):
dists = np.linalg.norm(X_tr - X_te[i], axis=1)
nearest_labels = y_tr[np.argsort(dists)[:k]]
pred = np.argmax(np.bincount(nearest_labels, minlength=3))
correct += (pred == y_te[i])
return correct / len(y_te)
lp_acc = linear_probe(X_tr, y_tr, X_te, y_te)
knn_acc = knn_eval(X_tr, y_tr, X_te, y_te)
print(f"Linear probe accuracy: {lp_acc:.1%}")
print(f"k-NN (k=5) accuracy: {knn_acc:.1%}")
print(f"\nThe encoder never saw labels during pre-training.")
print(f"High accuracy = SSL features captured the true class structure!")
Both evaluation methods achieve strong accuracy despite the encoder never seeing any class labels during training. This is the ultimate proof of concept: the pretext task forced the encoder to learn features that capture the true underlying structure of the data.
Self-supervised learning has transformed the economics of AI. Instead of labeling millions of examples, we harvest the supervisory signal that was always there in the data — in the statistical relationships between words, in the spatial coherence of images, in the similarity between augmented views. From BERT’s masked tokens to MAE’s masked patches to DINO’s multi-crop distillation, the message is clear: the richest training signal isn’t one that humans provide — it’s one the data provides about itself.
References & Further Reading
- Devlin et al. — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019) — the masked language model that launched a revolution
- He et al. — Masked Autoencoders Are Scalable Vision Learners (2022) — masking 75% of patches for powerful visual pre-training
- Chen et al. — A Simple Framework for Contrastive Learning of Visual Representations (SimCLR, 2020) — elegant contrastive framework with NT-Xent loss
- Grill et al. — Bootstrap Your Own Latent (BYOL, 2020) — no negatives needed, just asymmetry and EMA
- Caron et al. — Emerging Properties in Self-Supervised Vision Transformers (DINO, 2021) — self-distillation with multi-crop and emergent object segmentation
- Zbontar et al. — Barlow Twins: Self-Supervised Learning via Redundancy Reduction (2021) — an alternative approach using cross-correlation
- Balestriero et al. — A Cookbook of Self-Supervised Learning (2023) — comprehensive survey of the SSL landscape