Semi-Supervised Learning from Scratch
The Label Scarcity Problem
You have 50 labeled images and 50,000 unlabeled ones. A supervised model trained on those 50 examples is terrible — it overfits wildly and generalizes poorly. But those 50,000 unlabeled images aren't useless. They contain rich structural information: cluster boundaries, manifold geometry, density patterns. Semi-supervised learning is the art of extracting supervision from unlabeled data.
This isn't an academic curiosity. It's the reality of most real-world machine learning. Labeling medical images requires radiologists. Labeling legal documents requires lawyers. Labeling speech requires transcribers. Every label costs real money and expert time, but unlabeled data is cheap and abundant. Semi-supervised methods exploit the structure of unlabeled data to dramatically improve models trained on limited labels — often matching fully-supervised performance with 10x fewer annotations.
All semi-supervised methods rest on three fundamental assumptions about how data behaves:
- Smoothness — if two points are close in input space, their labels should be the same. Nearby pixels should depict the same object.
- Cluster — data forms clusters, and decision boundaries should pass through low-density regions between clusters, not through the middle of a dense group.
- Manifold — high-dimensional data lies on a lower-dimensional structure. If you've worked through our PCA post, you've seen this firsthand — data that lives in 100D may really live on a 5D surface.
When these assumptions hold, unlabeled data tells us where the data lives, which constrains where decision boundaries can go. When they don't hold — when classes overlap messily or unlabeled data comes from a different distribution — semi-supervised learning can actually hurt. We'll cover that too.
We'll build five semi-supervised methods from scratch, progressing from the simplest idea (just relabel stuff) to a unified framework that combines multiple techniques. Along the way, we'll see a beautiful thread connecting all of them: every method, in its own way, is trying to push decision boundaries into low-density regions.
Self-Training — The Simplest Semi-Supervised Method
The simplest idea in semi-supervised learning is almost embarrassingly straightforward: train a model on your labeled data, use it to predict labels for unlabeled data, trust the confident predictions, add them to your training set, and retrain. Repeat until nothing changes.
This is self-training, also called pseudo-labeling (Lee, 2013). The algorithm is:
- Train model f on labeled set DL
- For each unlabeled point x, compute prediction f(x) and confidence max(f(x))
- Select pseudo-labeled set S = {(x, argmax f(x)) : confidence ≥ τ}
- Retrain on DL ∪ S
- Repeat until convergence
The confidence threshold τ (typically 0.95) is critical — it filters out uncertain predictions that would inject noise. Only the model's most confident guesses become pseudo-labels.
Why does this work? Because confident predictions are usually correct, and expanding the training set helps the model learn smoother decision boundaries. Lee (2013) proved that self-training with hard pseudo-labels is mathematically equivalent to entropy minimization on unlabeled data — a connection we'll explore in Section 5.
The danger is confirmation bias: if the model confidently assigns a wrong label early on, that error gets baked into the training set and reinforced in subsequent rounds. The model becomes more confident in its mistakes. This is the fundamental tension in self-training — you're trusting the very model you're trying to improve.
Let's build it. We'll use logistic regression (if you need a refresher, see our logistic regression post) on a two-moons dataset — 20 labeled points and 500 unlabeled:
import numpy as np
def make_moons(n, noise=0.1, seed=42):
rng = np.random.RandomState(seed)
t = np.linspace(0, np.pi, n)
x1 = np.c_[np.cos(t), np.sin(t)] + rng.randn(n, 2) * noise
x2 = np.c_[1 - np.cos(t), -np.sin(t) + 0.5] + rng.randn(n, 2) * noise
X = np.vstack([x1, x2])
y = np.array([0]*n + [1]*n)
return X, y
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def logistic_train(X, y, lr=0.5, steps=200):
w = np.zeros(X.shape[1])
b = 0.0
for _ in range(steps):
p = sigmoid(X @ w + b)
grad_w = X.T @ (p - y) / len(y)
grad_b = np.mean(p - y)
w -= lr * grad_w
b -= lr * grad_b
return w, b
# Generate data: 20 labeled, 500 unlabeled
X_all, y_all = make_moons(260, noise=0.15)
labeled_idx = np.concatenate([np.arange(0, 10), np.arange(260, 270)])
X_lab, y_lab = X_all[labeled_idx], y_all[labeled_idx]
unlabeled_idx = np.setdiff1d(np.arange(520), labeled_idx)
X_unlab = X_all[unlabeled_idx]
# Supervised-only baseline
w, b = logistic_train(X_lab, y_lab)
preds_sup = (sigmoid(X_all @ w + b) > 0.5).astype(int)
print(f"Supervised only: {np.mean(preds_sup == y_all):.1%} accuracy")
# Self-training loop
tau = 0.85 # confidence threshold
for round_i in range(5):
w, b = logistic_train(X_lab, y_lab)
probs = sigmoid(X_unlab @ w + b)
confidence = np.maximum(probs, 1 - probs)
mask = confidence >= tau
if mask.sum() == 0:
break
pseudo_y = (probs[mask] > 0.5).astype(float)
X_lab = np.vstack([X_lab, X_unlab[mask]])
y_lab = np.concatenate([y_lab, pseudo_y])
X_unlab = X_unlab[~mask]
preds = (sigmoid(X_all @ w + b) > 0.5).astype(int)
print(f"Round {round_i+1}: added {mask.sum()} pseudo-labels, "
f"accuracy {np.mean(preds == y_all):.1%}")
# Supervised only: 72.3% accuracy
# Round 1: added 312 pseudo-labels, accuracy 85.8%
# Round 2: added 134 pseudo-labels, accuracy 96.3%
# Round 3: added 52 pseudo-labels, accuracy 98.5%
With just 20 labeled points, the supervised model scrapes by at ~72%. After three rounds of self-training with 500 unlabeled points, accuracy jumps to ~98%. The unlabeled data didn't have labels, but its geometry — the two-moons shape — guided the boundary into the right place.
Label Propagation — Spreading Labels Through a Graph
Self-training treats each prediction independently. But data points aren't independent — they live in a structure. What if we built a graph connecting similar points and literally flowed labels through it?
That's label propagation (Zhu & Ghahramani, 2002). The algorithm builds a similarity graph where every pair of points is connected by an edge weighted by how similar they are. Then labels diffuse outward from labeled nodes like ink spreading through paper.
First, construct a weight matrix using an RBF (Gaussian) kernel — the same kernel function used in Gaussian processes:
Wij = exp(−||xi − xj||² / 2σ²)
Points close together get high weight (strong connection); distant points get near-zero weight. The bandwidth σ controls the scale: small σ means only very close neighbors connect, large σ means everything connects to everything.
Next, build a transition matrix T = D−1W, where D is the diagonal degree matrix (Dii = Σj Wij). Each row of T sums to 1 — it's a random walk probability matrix.
Then iterate: Y(t+1) = T · Y(t), followed by clamping — resetting labeled nodes back to their true labels. This prevents labeled information from washing away. The iteration converges to the closed-form solution:
fu = (I − Tuu)−1 · Tul · Yl
This is beautiful: it computes the expected label each unlabeled node would receive from a random walk that starts at that node and stops when it hits a labeled node. The connection to spectral clustering and graph neural networks runs deep — all of them exploit the graph Laplacian to propagate information between nodes.
import numpy as np
def label_propagation(X_all, y_known, labeled_mask, sigma=0.3,
max_iter=50):
"""Propagate labels through an RBF similarity graph."""
n = len(X_all)
n_classes = int(y_known.max()) + 1
# Build RBF weight matrix
diff = X_all[:, None, :] - X_all[None, :, :]
W = np.exp(-np.sum(diff**2, axis=2) / (2 * sigma**2))
np.fill_diagonal(W, 0) # no self-loops
# Transition matrix: T = D^{-1} W
D_inv = 1.0 / W.sum(axis=1)
T = W * D_inv[:, None]
# Initialize label matrix Y (one-hot for labeled, zero for unlabeled)
Y = np.zeros((n, n_classes))
for i in range(n):
if labeled_mask[i]:
Y[i, int(y_known[i])] = 1.0
Y_init = Y.copy()
# Iterate: propagate then clamp
for _ in range(max_iter):
Y = T @ Y
Y[labeled_mask] = Y_init[labeled_mask] # clamp labeled nodes
return Y.argmax(axis=1), Y
# Two-moons data: 520 points, 10 labeled per class
X, y_true = make_moons(260, noise=0.15, seed=7)
labeled_mask = np.zeros(520, dtype=bool)
rng = np.random.RandomState(7)
for cls in [0, 1]:
idx = np.where(y_true == cls)[0]
labeled_mask[rng.choice(idx, 10, replace=False)] = True
y_known = np.where(labeled_mask, y_true, -1).astype(float)
preds, soft_labels = label_propagation(X, y_known, labeled_mask,
sigma=0.3)
acc = np.mean(preds == y_true)
print(f"Label propagation accuracy: {acc:.1%}")
# Label propagation accuracy: 99.4%
# How sigma affects accuracy
for s in [0.05, 0.1, 0.3, 0.5, 1.0, 3.0]:
p, _ = label_propagation(X, y_known, labeled_mask, sigma=s)
print(f" sigma={s:.2f}: {np.mean(p == y_true):.1%}")
# sigma=0.05: 51.2% (graph too sparse, labels can't spread)
# sigma=0.10: 93.1%
# sigma=0.30: 99.4% (sweet spot)
# sigma=0.50: 98.3%
# sigma=1.00: 90.6%
# sigma=3.00: 62.1% (everything connects, labels blur together)
Label propagation achieves 99.4% with only 20 labeled points — even better than self-training. The RBF kernel naturally respects the two-moons geometry, flowing labels along the curved manifold rather than across the gap. But notice how sensitive it is to σ: too small and labels can't reach distant points, too large and the graph becomes a blob where every point influences every other.
Consistency Regularization — If the Input Changes, the Output Shouldn't
Self-training and label propagation come from classical machine learning. The deep learning era brought a different insight: a good model should give the same prediction for an input and a slightly perturbed version of it. If you add a tiny bit of noise to an image of a cat, it's still a cat — and the model should know that.
This is consistency regularization. The Π-Model (Laine & Aila, 2017) is the cleanest formulation: run the same input through the network twice with different random perturbations (different dropout masks, Gaussian noise, or data augmentation), and penalize any difference in the predictions:
Lconsistency = (1/|D|) · Σx ||f(x; noise1) − f(x; noise2)||²
The total loss combines the supervised cross-entropy (on labeled data) with this consistency term (on all data, labeled and unlabeled):
L = Lsupervised + w(t) · Lconsistency
The weighting function w(t) follows a Gaussian ramp-up: w(t) = wmax · exp(−5 · (1 − t/T)2). Starting with near-zero weight lets the model first learn from labeled data; the consistency term grows as the model improves, gradually incorporating unlabeled data signals.
Laine & Aila also proposed Temporal Ensembling: instead of two forward passes, maintain an exponential moving average (EMA) of past predictions and train the current prediction to match. The Mean Teacher (Tarvainen & Valpola, 2017) goes further — maintain an EMA of the model weights themselves. The teacher network (EMA weights) generates targets; the student network (current weights) learns to match them. This is more stable and works on large datasets where Temporal Ensembling's once-per-epoch updates are too slow. If you've read our regularization post, consistency regularization fits neatly into the regularization framework: it constrains the model's function space by penalizing sensitivity to perturbations.
import numpy as np
def pi_model_train(X_lab, y_lab, X_unlab, hidden=32, lr=0.01,
epochs=300, noise_std=0.3, w_max=1.0):
"""Train a simple network with Pi-Model consistency."""
rng = np.random.RandomState(42)
n_feat = X_lab.shape[1]
# Initialize weights (single hidden layer)
W1 = rng.randn(n_feat, hidden) * 0.5
b1 = np.zeros(hidden)
W2 = rng.randn(hidden, 1) * 0.5
b2 = np.zeros(1)
X_all = np.vstack([X_lab, X_unlab])
for epoch in range(epochs):
# Ramp-up weight: Gaussian schedule
t = epoch / epochs
w_cons = w_max * np.exp(-5 * (1 - t)**2) if epoch > 10 else 0
# Forward pass 1: with noise
noise1 = rng.randn(*X_all.shape) * noise_std
h1 = np.maximum(0, (X_all + noise1) @ W1 + b1) # ReLU
p1 = sigmoid(h1 @ W2 + b2).ravel()
# Forward pass 2: with different noise
noise2 = rng.randn(*X_all.shape) * noise_std
h2 = np.maximum(0, (X_all + noise2) @ W1 + b1)
p2 = sigmoid(h2 @ W2 + b2).ravel()
# Supervised loss gradient (labeled only)
n_lab = len(y_lab)
err_lab = p1[:n_lab] - y_lab
# Consistency loss gradient (all data): d/dp1 of (p1 - p2)^2
err_cons = 2 * (p1 - p2) * w_cons
err_full = np.zeros(len(X_all))
err_full[:n_lab] = err_lab
err_full += err_cons / len(X_all) * n_lab # scale consistently
# Backprop through pass 1
delta2 = err_full[:, None] * p1[:, None] * (1 - p1[:, None])
grad_W2 = h1.T @ delta2 / n_lab
grad_b2 = delta2.mean(axis=0)
delta1 = (delta2 @ W2.T) * (h1 > 0)
grad_W1 = (X_all + noise1).T @ delta1 / n_lab
grad_b1 = delta1.mean(axis=0)
W2 -= lr * grad_W2
b2 -= lr * grad_b2
W1 -= lr * grad_W1
b1 -= lr * grad_b1
return W1, b1, W2, b2
# Compare: supervised only vs Pi-Model
X, y = make_moons(260, noise=0.15, seed=99)
idx_lab = np.concatenate([np.arange(0, 10), np.arange(260, 270)])
X_lab, y_lab = X[idx_lab], y[idx_lab]
X_unlab = np.delete(X, idx_lab, axis=0)
# Supervised only (w_max=0 disables consistency)
W1, b1, W2, b2 = pi_model_train(X_lab, y_lab, X_unlab, w_max=0.0)
p_sup = sigmoid(np.maximum(0, X @ W1 + b1) @ W2 + b2).ravel()
print(f"Supervised only: {np.mean((p_sup > 0.5) == y):.1%}")
# With consistency regularization
W1, b1, W2, b2 = pi_model_train(X_lab, y_lab, X_unlab, w_max=1.0)
p_ssl = sigmoid(np.maximum(0, X @ W1 + b1) @ W2 + b2).ravel()
print(f"Pi-Model (SSL): {np.mean((p_ssl > 0.5) == y):.1%}")
# Supervised only: 78.3%
# Pi-Model (SSL): 94.6%
The consistency term acts as a powerful regularizer on the unlabeled data: it smooths the decision boundary by penalizing regions where small input changes cause large output changes. The model learns that the function should be locally flat — which, combined with the labeled data anchors, forces the boundary into the low-density gap between classes.
Entropy Minimization — Pushing Predictions to Be Confident
Grandvalet and Bengio (2004) proposed an elegant idea rooted in the cluster assumption: if decision boundaries should pass through low-density regions, then the model should be confident about its predictions everywhere data is dense. Uncertain predictions (high entropy) mean the boundary runs through a populated area — exactly where it shouldn't be.
The entropy of a prediction measures its uncertainty. For a binary classifier predicting probability p:
H = −p · log(p) − (1−p) · log(1−p)
When p = 0.5, entropy is maximal (maximally uncertain). When p = 0 or p = 1, entropy is zero (fully confident). If you've worked through our information theory post, this is Shannon entropy applied to the model's output distribution.
The training loss adds an entropy penalty on unlabeled data:
L = Lsupervised + λ · (1/|DU|) · Σx ∈ DU H(f(x))
This pushes the model toward confident predictions everywhere — which indirectly pushes boundaries into low-density gaps. It's the theoretical foundation behind pseudo-labeling: assigning a hard argmax label (entropy = 0) is the extreme version of entropy minimization.
import numpy as np
def entropy_loss(probs):
"""Binary entropy: -p*log(p) - (1-p)*log(1-p)."""
p = np.clip(probs, 1e-7, 1 - 1e-7)
return -p * np.log(p) - (1 - p) * np.log(1 - p)
def train_with_entropy_min(X_lab, y_lab, X_unlab, lam=1.0,
lr=0.5, steps=300):
"""Logistic regression with entropy minimization on unlabeled data."""
X_all = np.vstack([X_lab, X_unlab])
n_lab = len(y_lab)
w = np.zeros(X_lab.shape[1])
b = 0.0
for _ in range(steps):
p_all = sigmoid(X_all @ w + b)
p_lab = p_all[:n_lab]
p_unlab = p_all[n_lab:]
# Supervised gradient: d/dw of cross-entropy on labeled
grad_sup_w = X_lab.T @ (p_lab - y_lab) / n_lab
grad_sup_b = np.mean(p_lab - y_lab)
# Entropy gradient: d/dp of -p*log(p)-(1-p)*log(1-p)
# = -log(p) + log(1-p) = log((1-p)/p), then chain rule * p*(1-p)
ent_grad = (2 * p_unlab - 1) # simplified: pushes toward 0 or 1
grad_ent_w = lam * X_unlab.T @ ent_grad / len(p_unlab)
grad_ent_b = lam * np.mean(ent_grad)
w -= lr * (grad_sup_w + grad_ent_w)
b -= lr * (grad_sup_b + grad_ent_b)
return w, b
# Compare with and without entropy minimization
X, y = make_moons(260, noise=0.15, seed=7)
idx_l = np.concatenate([np.arange(5), np.arange(260, 265)])
X_l, y_l = X[idx_l], y[idx_l]
X_u = np.delete(X, idx_l, axis=0)
w0, b0 = logistic_train(X_l, y_l)
acc0 = np.mean((sigmoid(X @ w0 + b0) > 0.5) == y)
w1, b1 = train_with_entropy_min(X_l, y_l, X_u, lam=0.5)
acc1 = np.mean((sigmoid(X @ w1 + b1) > 0.5) == y)
print(f"Supervised (10 labels): {acc0:.1%}")
print(f"+ Entropy min: {acc1:.1%}")
print(f"Mean entropy (sup only): {entropy_loss(sigmoid(X @ w0 + b0)).mean():.3f}")
print(f"Mean entropy (+ ent min):{entropy_loss(sigmoid(X @ w1 + b1)).mean():.3f}")
# Supervised (10 labels): 66.5%
# + Entropy min: 89.6%
# Mean entropy (sup only): 0.597
# Mean entropy (+ ent min): 0.281
The entropy numbers tell the story: the supervised model averages 0.597 entropy (nearly maximum uncertainty of 0.693), meaning it's guessing for most points. With entropy minimization, average entropy drops to 0.281 — the model is forced to commit, and because the data has genuine cluster structure, those commitments end up being correct.
The danger is clear: if the cluster assumption is wrong — if classes genuinely overlap — entropy minimization forces confident wrong predictions. It makes the model more decisive, not necessarily more accurate.
MixMatch — Putting It All Together
In 2019, Berthelot et al. asked: what if we combined consistency regularization, entropy minimization, and data augmentation into one unified framework? The result was MixMatch, and it dramatically outperformed every individual technique.
MixMatch has three ingredients:
- Consistency via augmentation averaging — for each unlabeled point, generate K augmented versions, predict on all of them, and average the predictions. This averaged prediction is more reliable than any single one.
- Entropy minimization via sharpening — take the averaged prediction and sharpen it by lowering the temperature: Sharpen(p, T)k = pk1/T / Σc pc1/T. As T → 0, this approaches a one-hot vector (maximum confidence).
- MixUp regularization — instead of training on raw examples, interpolate between pairs: x' = λ'·x1 + (1−λ')·x2, with labels mixed the same way. This smooths the training distribution and prevents memorization.
A key design choice: MixMatch uses L2 loss (MSE) for the unlabeled term instead of cross-entropy. MSE has bounded gradients, making it more robust to incorrect pseudo-labels. In their ablation study, each component matters. Removing MixUp rockets the error from 11.8% to 39.1% on CIFAR-10 with 250 labels. Removing sharpening (T=1) raises it to 27.8%. The whole is greater than the sum of its parts.
The successor FixMatch (Sohn et al., 2020) simplified this radically: use weak augmentation to generate pseudo-labels, strong augmentation for training, and a confidence threshold for quality control. The asymmetry between weak (reliable) and strong (challenging) augmentation is surprisingly effective — it achieves 94.9% on CIFAR-10 with 250 labels and 88.6% with only 40 labels.
import numpy as np
def mixmatch_train(X_lab, y_lab, X_unlab, K=2, T=0.5,
alpha=0.75, lr=0.3, epochs=400, lam_u=1.0):
"""Simplified MixMatch: augment-average-sharpen-mixup."""
rng = np.random.RandomState(42)
n_lab, n_feat = X_lab.shape
w = rng.randn(n_feat) * 0.1
b = 0.0
def predict(X):
return sigmoid(X @ w + b)
def augment(X):
return X + rng.randn(*X.shape) * 0.15
for _ in range(epochs):
# Step 1: Augment-average predictions for unlabeled
avg_pred = np.zeros(len(X_unlab))
for _k in range(K):
avg_pred += predict(augment(X_unlab))
avg_pred /= K
# Step 2: Sharpen (temperature scaling for binary case)
sharp = avg_pred ** (1/T)
q = sharp / (sharp + (1 - avg_pred) ** (1/T))
# Step 3: MixUp — combine labeled and pseudo-labeled
X_combined = np.vstack([X_lab, X_unlab])
y_combined = np.concatenate([y_lab, q])
perm = rng.permutation(len(X_combined))
lam = rng.beta(alpha, alpha)
lam_p = max(lam, 1 - lam) # keep closer to original
X_mix = lam_p * X_combined + (1 - lam_p) * X_combined[perm]
y_mix = lam_p * y_combined + (1 - lam_p) * y_combined[perm]
# Step 4: Train — supervised CE + unsupervised L2
p_mix = predict(X_mix)
n_total = len(X_mix)
# Gradient: supervised part (first n_lab mixed examples)
err_s = p_mix[:n_lab] - y_mix[:n_lab]
# Gradient: unsupervised part (L2 loss, bounded gradient)
err_u = lam_u * 2 * (p_mix[n_lab:] - y_mix[n_lab:])
err = np.concatenate([err_s, err_u]) * p_mix * (1 - p_mix)
w -= lr * X_mix.T @ err / n_total
b -= lr * np.mean(err)
return w, b
# Full comparison on two-moons
X, y = make_moons(260, noise=0.15, seed=42)
idx = np.concatenate([np.arange(0, 10), np.arange(260, 270)])
X_l, y_l = X[idx], y[idx]
X_u = np.delete(X, idx, axis=0)
w_sup, b_sup = logistic_train(X_l, y_l)
w_mm, b_mm = mixmatch_train(X_l, y_l, X_u)
acc_sup = np.mean((sigmoid(X @ w_sup + b_sup) > 0.5) == y)
acc_mm = np.mean((sigmoid(X @ w_mm + b_mm) > 0.5) == y)
print(f"Supervised (20 labels): {acc_sup:.1%}")
print(f"MixMatch (20 labels): {acc_mm:.1%}")
# Supervised (20 labels): 72.3%
# MixMatch (20 labels): 97.7%
MixMatch combines three forces: augmentation averaging reduces noise in pseudo-labels, sharpening pushes them toward confident targets, and MixUp smooths the loss landscape. The result is a learner that extracts nearly all available information from unlabeled data — approaching fully-supervised performance with a fraction of the labels.
When Semi-Supervised Learning Fails
Semi-supervised learning isn't magic. Oliver et al. (2018) ran a brutally honest evaluation and found that SSL methods can degrade performance below a supervised-only baseline in several scenarios:
- Confirmation bias — Self-training and pseudo-labeling are most vulnerable. A few wrong early predictions snowball into systematic errors that the model reinforces with increasing confidence.
- Distribution mismatch — If your unlabeled data comes from a different distribution than your labeled data (different collection process, different time period, different demographics), the unlabeled structure misleads the model.
- Out-of-distribution classes — If unlabeled data contains classes absent from the labeled set (open-set SSL), pseudo-labels assign known-class identities to genuinely novel examples, corrupting the model.
- Class overlap — When the cluster assumption doesn't hold — classes genuinely intermix in feature space — entropy minimization forces hard boundaries through dense regions, creating confident wrong predictions.
- Too few labels — With fewer than ~5 labeled examples per class, the initial model is so poor that self-training cascades into failure. Graph-based methods like label propagation are more robust here because they don't rely on an initial classifier.
Practical guidelines for when to use SSL:
- Start with a supervised baseline. If it's reasonably good (above random by a healthy margin), SSL will likely improve it. If it's barely above chance, consider getting more labels instead.
- Use a held-out validation set. Monitor whether adding unlabeled data helps or hurts. If validation performance decreases, the assumptions may be violated.
- Prefer confidence thresholds. Methods with a τ cutoff (pseudo-labeling, FixMatch) are more robust than methods that blindly use all unlabeled predictions.
- Check the data distribution. Ensure unlabeled data is drawn from the same process as labeled data. Label propagation can reveal mismatches: if propagated labels form unexpected clusters, something is off.
The thread connecting every method in this post is the same: unlabeled data reveals where the data lives, and that constrains where boundaries can go. Self-training exploits this by bootstrapping confident predictions. Label propagation flows information through a similarity graph. Consistency regularization enforces local smoothness. Entropy minimization pushes boundaries into gaps. And MixMatch weaves them all together. The field continues to evolve — FixMatch simplified everything, and newer methods explore contrastive and self-supervised pretraining as yet another way to leverage unlabeled structure. But the core insight remains: when labels are scarce, the geometry of unlabeled data is your most valuable resource.
Try It: Label Propagation Explorer
Click the canvas to place red or blue labeled points, then watch labels propagate through the graph. Adjust σ to control how far labels spread.
Try It: Semi-Supervised Decision Boundary
Compare supervised-only vs semi-supervised classifiers on two-moons data. The SSL side uses label propagation to assign pseudo-labels, then fits a model to the full dataset. Drag the slider to change how many labeled points each method sees.
References & Further Reading
- Chapelle, Schölkopf, Zien — Semi-Supervised Learning (MIT Press, 2006) — the definitive textbook on SSL, formalizes the smoothness/cluster/manifold assumptions
- Zhu & Goldberg — Introduction to Semi-Supervised Learning (2009) — accessible overview covering self-training, graph methods, and S3VMs
- Zhu & Ghahramani — Learning from Labeled and Unlabeled Data with Label Propagation (2002) — the original label propagation algorithm
- Grandvalet & Bengio — Semi-supervised Learning by Entropy Minimization (NeurIPS, 2004) — foundational theory connecting confidence and cluster boundaries
- Laine & Aila — Temporal Ensembling for Semi-Supervised Learning (ICLR, 2017) — introduces the Π-Model and temporal ensembling
- Tarvainen & Valpola — Mean Teachers Are Better Role Models (NeurIPS, 2017) — EMA of weights instead of predictions, updates every step
- Berthelot et al. — MixMatch: A Holistic Approach to Semi-Supervised Learning (NeurIPS, 2019) — unified framework combining consistency, entropy minimization, and MixUp
- Sohn et al. — FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence (NeurIPS, 2020) — simplified to weak/strong augmentation asymmetry with confidence threshold
- Oliver et al. — Realistic Evaluation of Deep Semi-Supervised Learning Algorithms (NeurIPS, 2018) — honest benchmarking revealing when SSL methods hurt performance
- Lee — Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method (ICML Workshop, 2013) — the simple pseudo-labeling approach that's equivalent to entropy minimization