Active Learning from Scratch: Teaching Your Model to Ask the Right Questions
The Labeling Bottleneck
You have 100,000 unlabeled images and a budget to label just 500. Which 500 should you pick?
Random selection wastes your budget on easy examples the model already understands. A photo of a golden retriever on a lawn? The model nailed that after seeing ten dogs. Meanwhile, it's completely baffled by that blurry photo of a cat perched on a TV that looks vaguely like a bird — but nobody showed it that example because the random number generator didn't pick it.
Active learning flips the script: instead of randomly feeding data to a passive model, we let the model choose which examples to learn from. "Show me the ones I'm most confused about." This simple idea — letting a model ask questions instead of passively receiving answers — is one of the most practical techniques in machine learning, and it's increasingly critical. Fine-tuning an LLM costs $1-5 per labeled example. Active learning can cut that budget by 10x while improving model quality.
There are three settings for active learning:
- Pool-based: You have a big pool of unlabeled data. Score everything, query the best. This is by far the most common setting.
- Stream-based: Examples arrive one at a time. For each, decide: label this one or skip it?
- Membership query synthesis: The learner generates its own queries. Powerful in theory, often produces unnatural examples in practice.
We'll focus on pool-based active learning. The loop is elegant:
- Train a model on the currently labeled set
- Score every unlabeled example with an acquisition function
- Query the top-scoring example(s) — get a human to label them
- Add the newly labeled examples to the training set
- Retrain and repeat
The entire game is in step 2: what makes a good acquisition function? That's what the rest of this post is about.
Uncertainty Sampling — The Simplest Strategy
The most intuitive acquisition function is also the simplest: query the example the model is most uncertain about. If the model assigns a 51% probability to "cat" and 49% to "dog," it desperately needs a label for that example. If it assigns 99.7% to "cat," labeling that example teaches it almost nothing.
There are three common ways to measure uncertainty for a classifier with predicted probabilities P(y|x):
- Least confidence: 1 - maxy P(y|x). How far is the top prediction from certainty?
- Margin sampling: P(y1|x) - P(y2|x) for the top-2 classes. Query when the margin is smallest — the model can't decide between its top two guesses.
- Entropy: -Σ P(y|x) log P(y|x). The full information-theoretic uncertainty over all classes.
For binary classification, all three are equivalent — they all query the point closest to the decision boundary. For multiclass problems, they differ: entropy considers the entire probability distribution, while margin only looks at the top two classes. A prediction of [0.34, 0.33, 0.33] has higher entropy than [0.50, 0.49, 0.01], even though the margin is similar.
Here's uncertainty sampling in action:
import numpy as np
from sklearn.linear_model import LogisticRegression
def least_confidence(probs):
return 1 - np.max(probs, axis=1)
def margin_sampling(probs):
sorted_probs = np.sort(probs, axis=1)
return 1 - (sorted_probs[:, -1] - sorted_probs[:, -2])
def entropy_sampling(probs):
return -np.sum(probs * np.log(probs + 1e-10), axis=1)
def active_learning_loop(X_pool, y_pool, X_seed, y_seed,
n_queries=50, acquisition_fn=entropy_sampling):
X_train, y_train = X_seed.copy(), y_seed.copy()
pool_mask = np.ones(len(X_pool), dtype=bool)
accuracies = []
for _ in range(n_queries):
model = LogisticRegression().fit(X_train, y_train)
accuracies.append(model.score(X_pool, y_pool))
# Score the remaining pool
probs = model.predict_proba(X_pool[pool_mask])
scores = acquisition_fn(probs)
query_idx = np.where(pool_mask)[0][np.argmax(scores)]
# "Label" the queried point and add to training set
X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
y_train = np.append(y_train, y_pool[query_idx])
pool_mask[query_idx] = False
return accuracies
The key insight: acquisition_fn is a plug-in. We score every unlabeled point, pick the highest-scoring one, label it, retrain, and repeat. The acquisition function is the only thing that changes between strategies. Note the cold start problem: when the model has only seen a few seed examples, its uncertainty estimates are unreliable. The first few queries are essentially random — you need a reasonable seed set to bootstrap the process.
Query-by-Committee — When Models Disagree
Uncertainty sampling relies on a single model's uncertainty estimates. But what if that model is miscalibrated? A model that's confidently wrong will never query the examples it needs most.
Query-by-Committee (QBC) sidesteps this by training a committee of models and querying examples where they disagree. The idea dates back to Seung, Opper & Sompolinsky (1992) and has a beautiful information-theoretic justification: each query that maximizes disagreement approximately halves the version space — the set of hypotheses consistent with the labeled data.
Building a committee is straightforward: train multiple models on different bootstrap samples of the labeled set. Each model sees a slightly different training set, so they'll disagree in regions where the data hasn't pinned down the right answer. Two natural disagreement measures:
- Vote entropy: Each committee member votes for a class. Compute the entropy of the vote distribution. Maximum disagreement = maximum entropy.
- Average KL divergence: Measure how much each member's prediction diverges from the committee consensus. Average over all members.
from sklearn.utils import resample
def query_by_committee(X_pool, y_pool, X_train, y_train,
n_committee=5, n_queries=50):
pool_mask = np.ones(len(X_pool), dtype=bool)
accuracies = []
for _ in range(n_queries):
# Train committee on bootstrap samples
committee = []
for _ in range(n_committee):
X_boot, y_boot = resample(X_train, y_train)
model = LogisticRegression().fit(X_boot, y_boot)
committee.append(model)
# Evaluate with full committee
full_model = LogisticRegression().fit(X_train, y_train)
accuracies.append(full_model.score(X_pool, y_pool))
# Measure disagreement via vote entropy
X_unlabeled = X_pool[pool_mask]
votes = np.array([m.predict(X_unlabeled) for m in committee])
n_classes = len(np.unique(y_pool))
vote_entropy = np.zeros(len(X_unlabeled))
for i in range(len(X_unlabeled)):
counts = np.bincount(votes[:, i], minlength=n_classes)
freqs = counts / n_committee
vote_entropy[i] = -np.sum(freqs * np.log(freqs + 1e-10))
query_idx = np.where(pool_mask)[0][np.argmax(vote_entropy)]
X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
y_train = np.append(y_train, y_pool[query_idx])
pool_mask[query_idx] = False
return accuracies
QBC is more robust than single-model uncertainty sampling because disagreement between different models captures epistemic uncertainty — the kind that shrinks with more data. A single model might be overconfident, but a committee of models trained on different bootstrap samples will naturally disagree in the regions where data is scarce. This connects directly to the epistemic/aleatoric uncertainty decomposition from uncertainty quantification.
Expected Model Change & Information-Theoretic Approaches
Uncertainty sampling and QBC ask: "Where is the model confused?" A more ambitious question is: "Which example would change the model the most if labeled?"
Expected Gradient Length (EGL) operationalizes this directly: for each unlabeled example, compute the expected gradient magnitude if we were to label it. Points that would cause large gradient updates are points the model would learn the most from.
import torch
import torch.nn as nn
def expected_gradient_length(model, X_unlabeled, n_classes):
"""Score each point by expected gradient norm."""
scores = np.zeros(len(X_unlabeled))
loss_fn = nn.CrossEntropyLoss()
for i, x in enumerate(X_unlabeled):
x_tensor = torch.FloatTensor(x).unsqueeze(0)
probs = torch.softmax(model(x_tensor), dim=1).detach()
total_grad_norm = 0.0
for y in range(n_classes):
# Simulate labeling x with class y
model.zero_grad()
output = model(x_tensor)
loss = loss_fn(output, torch.LongTensor([y]))
loss.backward()
grad_norm = sum(p.grad.norm().item() ** 2
for p in model.parameters()) ** 0.5
# Weight by probability of this label
total_grad_norm += probs[0, y].item() * grad_norm
scores[i] = total_grad_norm
return scores
EGL is computationally heavier than uncertainty sampling — it requires a backward pass for each (example, label) pair — but it captures something subtler. A point can have high uncertainty (near the decision boundary) but low gradient norm (in a flat region of the loss landscape). EGL combines "what's uncertain" with "what would actually move the model."
At the theoretical extreme sits Expected Error Reduction (EER): for each candidate query, simulate labeling it with each possible label, retrain the model, and measure the expected reduction in error on the entire unlabeled pool. This is the gold-standard acquisition function — it directly optimizes what we care about. But it's O(|pool| × |labels| × retrain cost), which is prohibitively expensive for anything beyond tiny datasets.
A more tractable information-theoretic approach is BALD (Bayesian Active Learning by Disagreement): measure the mutual information between the prediction and the model parameters, I(y; θ | x, D). Points with high mutual information are ones where knowing the label would tell us the most about the model parameters — exactly the information-rich examples we want. BALD can be approximated efficiently with MC dropout, connecting it to the Bayesian inference framework.
Batch Active Learning — Selecting Multiple Points at Once
In the real world, you don't query one point at a time. You send a batch of 100 images to your annotation team and wait for all the labels to come back. Naive approach: take the top-k highest-scoring points from your acquisition function.
The problem? The top-k are almost always redundant. They're all clustered in the same uncertain region near the decision boundary. You've asked 100 nearly identical questions instead of 100 different ones.
The solution is to diversify the batch. Think of it as the explore-exploit tradeoff from bandit algorithms: high acquisition score is exploitation (query where you know the model is struggling), while diversity is exploration (query in different parts of the input space).
Three strategies for batch diversity:
- Top-K (greedy, no diversity): Just take the k highest acquisition scores. Fast but redundant.
- K-Centers (pure diversity, greedy coreset): Iteratively select the point farthest from all previously selected points. Great coverage, but ignores uncertainty entirely.
- Hybrid (uncertainty × diversity): Score each candidate by its acquisition score multiplied by its minimum distance to already-selected points. Balances informativeness with coverage.
from scipy.spatial.distance import cdist
def select_batch_topk(scores, k):
"""Naive: take top-k highest scores."""
return np.argsort(scores)[-k:]
def select_batch_kcenter(X_unlabeled, X_labeled, k):
"""Greedy k-centers: maximize coverage."""
selected = []
for _ in range(k):
if len(selected) == 0:
ref_points = X_labeled
else:
ref_points = np.vstack([X_labeled, X_unlabeled[selected]])
dists = cdist(X_unlabeled, ref_points).min(axis=1)
dists[selected] = -1 # exclude already selected
selected.append(np.argmax(dists))
return np.array(selected)
def select_batch_hybrid(X_unlabeled, X_labeled, scores, k, alpha=0.5):
"""Uncertainty x diversity: best of both worlds."""
norm_scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-10)
selected = []
for _ in range(k):
if len(selected) == 0:
ref_points = X_labeled
else:
ref_points = np.vstack([X_labeled, X_unlabeled[selected]])
dists = cdist(X_unlabeled, ref_points).min(axis=1)
norm_dists = (dists - dists.min()) / (dists.max() - dists.min() + 1e-10)
combined = alpha * norm_scores + (1 - alpha) * norm_dists
combined[selected] = -1
selected.append(np.argmax(combined))
return np.array(selected)
The visual difference is striking (try the Batch Diversity Explorer demo below): top-k creates a tight cluster of queries, k-centers spreads points across the entire space including regions the model already understands, and the hybrid places queries near decision boundaries but spreads them across different boundary regions. In practice, batch diversity can improve data efficiency by 20-40% over naive top-k selection.
When Active Learning Fails — And How to Fix It
Active learning isn't magic. It has well-known failure modes, and understanding them is crucial for deploying it in practice.
Failure 1: Miscalibration. If the model is overconfident on its wrong predictions, uncertainty sampling will never query those examples — it thinks it already knows the answer. This is the most insidious failure mode because it's invisible: the model confidently ignores an entire region of the input space. Fix: calibrate the model before computing acquisition scores. Temperature scaling (from uncertainty quantification) is cheap and effective.
Failure 2: Sampling bias. By exclusively querying near the decision boundary, you build a labeled set that's unrepresentative of the true data distribution. Entire clusters of data might never get labeled. Fix: use an ε-greedy strategy — with probability ε, query a random point instead of the highest-scoring one. Even 10-20% random exploration prevents catastrophic blind spots.
Failure 3: Outlier attraction. Outliers and noisy examples have the highest uncertainty — they're inherently hard to classify because they don't belong to any clear class. Querying them wastes your budget on uninformative examples. Fix: multiply the acquisition score by a density estimate. A point must be both uncertain and in a dense region to be queried.
Failure 4: Cold start. When the initial model has seen only a handful of examples, its uncertainty estimates are meaningless. The first 10-20 queries are essentially random. Fix: initialize with a diverse seed set, selected by clustering the unlabeled pool and picking one example per cluster.
def hybrid_active_learning(X_pool, y_pool, X_seed, y_seed,
n_queries=50, epsilon=0.15):
"""Active learning with random exploration to avoid sampling bias."""
X_train, y_train = X_seed.copy(), y_seed.copy()
pool_mask = np.ones(len(X_pool), dtype=bool)
accuracies = []
rng = np.random.RandomState(42)
for _ in range(n_queries):
model = LogisticRegression().fit(X_train, y_train)
accuracies.append(model.score(X_pool, y_pool))
pool_indices = np.where(pool_mask)[0]
if rng.random() < epsilon:
# Explore: random query
query_idx = rng.choice(pool_indices)
else:
# Exploit: uncertainty query
probs = model.predict_proba(X_pool[pool_indices])
uncertainty = entropy_sampling(probs)
query_idx = pool_indices[np.argmax(uncertainty)]
X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
y_train = np.append(y_train, y_pool[query_idx])
pool_mask[query_idx] = False
return accuracies
The ε-greedy fix is simple but remarkably effective. It echoes the same principle from bandit algorithms: pure exploitation (always querying the most uncertain point) can miss important information, just as always pulling the best-known arm misses better options. A small amount of random exploration prevents the learner from developing a distorted view of the world.
Try It: Active Learning Arena
Two learners race to classify overlapping blob data. The active learner picks the most uncertain points near the decision boundary; the random learner picks blindly. Watch how strategic querying finds the boundary faster with fewer labels.
Active Learner (Uncertainty)
Random Learner
Active Learning for Modern ML
The strategies above were developed for classical ML, but they're more relevant than ever in the age of foundation models.
LLM fine-tuning data selection. When curating instruction-tuning datasets (as in instruction tuning), which examples should you include? Active learning over the embedding space can identify the most informative fine-tuning examples — ones where the base model's uncertainty is highest or where different model checkpoints disagree.
Foundation model adaptation. With LoRA and other parameter-efficient methods, fine-tuning is cheap — but labeling data for the target domain isn't. Active learning selects which examples to annotate, using the foundation model's embeddings as features. You don't need to retrain from scratch; just compute embedding-space uncertainty.
LLM-as-annotator + active learning. An emerging pattern: use a strong LLM (like GPT-4) to label examples where a smaller model is confident, and route only the uncertain examples to human annotators. This creates a two-tier labeling system that's both cheap and accurate — active learning decides which tier handles each example.
Human-in-the-loop tools. Modern data labeling platforms like Prodigy, Label Studio, and Argilla integrate active learning directly into their annotation workflows. The annotator sees the most informative examples first, accelerating the creation of high-quality datasets.
Try It: Batch Diversity Explorer
Three batch selection strategies pick from the same unlabeled pool. Top-K greedily takes the most uncertain points (redundant!). K-Centers maximizes coverage (but ignores uncertainty). Hybrid balances both. Adjust batch size to see the effect.
Conclusion
Active learning answers a deceptively simple question: "If I can only label N examples, which N should they be?" The answer — let the model choose — leads to a rich family of strategies that all share a common theme: information is not uniformly distributed across your data.
Some examples are worth a hundred labels. Others teach the model nothing it didn't already know. Active learning finds the valuable ones by measuring what the model doesn't know — through uncertainty, disagreement, expected model change, or information gain — and directing human effort there.
The practical takeaway: if you're building any ML system where labeling is expensive (and when isn't it?), active learning should be your first optimization. Before you try fancier models, bigger architectures, or cleverer features — first, label smarter data.
References & Further Reading
- Burr Settles — Active Learning Literature Survey (2009) — The definitive survey; 40 pages covering every major AL strategy
- Lewis & Gale — A Sequential Algorithm for Training Text Classifiers (1994) — Introduced uncertainty sampling for text classification
- Seung, Opper & Sompolinsky — Query by Committee (1992) — The original QBC paper with information-theoretic analysis
- Gal, Islam & Ghahramani — Deep Bayesian Active Learning with Image Data (2017) — MC dropout for deep active learning; BALD acquisition function
- Ash et al. — Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds (BADGE, 2020) — Batch AL using gradient embeddings with k-means++ initialization
- Kirsch, van Amersfoort & Gal — BatchBALD (2019) — Information-theoretic batch AL that avoids redundant queries
- Ren et al. — A Survey of Deep Active Learning (2021) — Comprehensive modern survey covering deep learning-specific challenges