← Back to Blog

Active Learning from Scratch: Teaching Your Model to Ask the Right Questions

The Labeling Bottleneck

You have 100,000 unlabeled images and a budget to label just 500. Which 500 should you pick?

Random selection wastes your budget on easy examples the model already understands. A photo of a golden retriever on a lawn? The model nailed that after seeing ten dogs. Meanwhile, it's completely baffled by that blurry photo of a cat perched on a TV that looks vaguely like a bird — but nobody showed it that example because the random number generator didn't pick it.

Active learning flips the script: instead of randomly feeding data to a passive model, we let the model choose which examples to learn from. "Show me the ones I'm most confused about." This simple idea — letting a model ask questions instead of passively receiving answers — is one of the most practical techniques in machine learning, and it's increasingly critical. Fine-tuning an LLM costs $1-5 per labeled example. Active learning can cut that budget by 10x while improving model quality.

There are three settings for active learning:

We'll focus on pool-based active learning. The loop is elegant:

  1. Train a model on the currently labeled set
  2. Score every unlabeled example with an acquisition function
  3. Query the top-scoring example(s) — get a human to label them
  4. Add the newly labeled examples to the training set
  5. Retrain and repeat

The entire game is in step 2: what makes a good acquisition function? That's what the rest of this post is about.

Uncertainty Sampling — The Simplest Strategy

The most intuitive acquisition function is also the simplest: query the example the model is most uncertain about. If the model assigns a 51% probability to "cat" and 49% to "dog," it desperately needs a label for that example. If it assigns 99.7% to "cat," labeling that example teaches it almost nothing.

There are three common ways to measure uncertainty for a classifier with predicted probabilities P(y|x):

For binary classification, all three are equivalent — they all query the point closest to the decision boundary. For multiclass problems, they differ: entropy considers the entire probability distribution, while margin only looks at the top two classes. A prediction of [0.34, 0.33, 0.33] has higher entropy than [0.50, 0.49, 0.01], even though the margin is similar.

Here's uncertainty sampling in action:

import numpy as np
from sklearn.linear_model import LogisticRegression

def least_confidence(probs):
    return 1 - np.max(probs, axis=1)

def margin_sampling(probs):
    sorted_probs = np.sort(probs, axis=1)
    return 1 - (sorted_probs[:, -1] - sorted_probs[:, -2])

def entropy_sampling(probs):
    return -np.sum(probs * np.log(probs + 1e-10), axis=1)

def active_learning_loop(X_pool, y_pool, X_seed, y_seed,
                         n_queries=50, acquisition_fn=entropy_sampling):
    X_train, y_train = X_seed.copy(), y_seed.copy()
    pool_mask = np.ones(len(X_pool), dtype=bool)
    accuracies = []

    for _ in range(n_queries):
        model = LogisticRegression().fit(X_train, y_train)
        accuracies.append(model.score(X_pool, y_pool))

        # Score the remaining pool
        probs = model.predict_proba(X_pool[pool_mask])
        scores = acquisition_fn(probs)
        query_idx = np.where(pool_mask)[0][np.argmax(scores)]

        # "Label" the queried point and add to training set
        X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
        y_train = np.append(y_train, y_pool[query_idx])
        pool_mask[query_idx] = False

    return accuracies

The key insight: acquisition_fn is a plug-in. We score every unlabeled point, pick the highest-scoring one, label it, retrain, and repeat. The acquisition function is the only thing that changes between strategies. Note the cold start problem: when the model has only seen a few seed examples, its uncertainty estimates are unreliable. The first few queries are essentially random — you need a reasonable seed set to bootstrap the process.

Query-by-Committee — When Models Disagree

Uncertainty sampling relies on a single model's uncertainty estimates. But what if that model is miscalibrated? A model that's confidently wrong will never query the examples it needs most.

Query-by-Committee (QBC) sidesteps this by training a committee of models and querying examples where they disagree. The idea dates back to Seung, Opper & Sompolinsky (1992) and has a beautiful information-theoretic justification: each query that maximizes disagreement approximately halves the version space — the set of hypotheses consistent with the labeled data.

Building a committee is straightforward: train multiple models on different bootstrap samples of the labeled set. Each model sees a slightly different training set, so they'll disagree in regions where the data hasn't pinned down the right answer. Two natural disagreement measures:

from sklearn.utils import resample

def query_by_committee(X_pool, y_pool, X_train, y_train,
                       n_committee=5, n_queries=50):
    pool_mask = np.ones(len(X_pool), dtype=bool)
    accuracies = []

    for _ in range(n_queries):
        # Train committee on bootstrap samples
        committee = []
        for _ in range(n_committee):
            X_boot, y_boot = resample(X_train, y_train)
            model = LogisticRegression().fit(X_boot, y_boot)
            committee.append(model)

        # Evaluate with full committee
        full_model = LogisticRegression().fit(X_train, y_train)
        accuracies.append(full_model.score(X_pool, y_pool))

        # Measure disagreement via vote entropy
        X_unlabeled = X_pool[pool_mask]
        votes = np.array([m.predict(X_unlabeled) for m in committee])
        n_classes = len(np.unique(y_pool))
        vote_entropy = np.zeros(len(X_unlabeled))
        for i in range(len(X_unlabeled)):
            counts = np.bincount(votes[:, i], minlength=n_classes)
            freqs = counts / n_committee
            vote_entropy[i] = -np.sum(freqs * np.log(freqs + 1e-10))

        query_idx = np.where(pool_mask)[0][np.argmax(vote_entropy)]
        X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
        y_train = np.append(y_train, y_pool[query_idx])
        pool_mask[query_idx] = False

    return accuracies

QBC is more robust than single-model uncertainty sampling because disagreement between different models captures epistemic uncertainty — the kind that shrinks with more data. A single model might be overconfident, but a committee of models trained on different bootstrap samples will naturally disagree in the regions where data is scarce. This connects directly to the epistemic/aleatoric uncertainty decomposition from uncertainty quantification.

Expected Model Change & Information-Theoretic Approaches

Uncertainty sampling and QBC ask: "Where is the model confused?" A more ambitious question is: "Which example would change the model the most if labeled?"

Expected Gradient Length (EGL) operationalizes this directly: for each unlabeled example, compute the expected gradient magnitude if we were to label it. Points that would cause large gradient updates are points the model would learn the most from.

import torch
import torch.nn as nn

def expected_gradient_length(model, X_unlabeled, n_classes):
    """Score each point by expected gradient norm."""
    scores = np.zeros(len(X_unlabeled))
    loss_fn = nn.CrossEntropyLoss()

    for i, x in enumerate(X_unlabeled):
        x_tensor = torch.FloatTensor(x).unsqueeze(0)
        probs = torch.softmax(model(x_tensor), dim=1).detach()

        total_grad_norm = 0.0
        for y in range(n_classes):
            # Simulate labeling x with class y
            model.zero_grad()
            output = model(x_tensor)
            loss = loss_fn(output, torch.LongTensor([y]))
            loss.backward()

            grad_norm = sum(p.grad.norm().item() ** 2
                           for p in model.parameters()) ** 0.5
            # Weight by probability of this label
            total_grad_norm += probs[0, y].item() * grad_norm

        scores[i] = total_grad_norm
    return scores

EGL is computationally heavier than uncertainty sampling — it requires a backward pass for each (example, label) pair — but it captures something subtler. A point can have high uncertainty (near the decision boundary) but low gradient norm (in a flat region of the loss landscape). EGL combines "what's uncertain" with "what would actually move the model."

At the theoretical extreme sits Expected Error Reduction (EER): for each candidate query, simulate labeling it with each possible label, retrain the model, and measure the expected reduction in error on the entire unlabeled pool. This is the gold-standard acquisition function — it directly optimizes what we care about. But it's O(|pool| × |labels| × retrain cost), which is prohibitively expensive for anything beyond tiny datasets.

A more tractable information-theoretic approach is BALD (Bayesian Active Learning by Disagreement): measure the mutual information between the prediction and the model parameters, I(y; θ | x, D). Points with high mutual information are ones where knowing the label would tell us the most about the model parameters — exactly the information-rich examples we want. BALD can be approximated efficiently with MC dropout, connecting it to the Bayesian inference framework.

Batch Active Learning — Selecting Multiple Points at Once

In the real world, you don't query one point at a time. You send a batch of 100 images to your annotation team and wait for all the labels to come back. Naive approach: take the top-k highest-scoring points from your acquisition function.

The problem? The top-k are almost always redundant. They're all clustered in the same uncertain region near the decision boundary. You've asked 100 nearly identical questions instead of 100 different ones.

The solution is to diversify the batch. Think of it as the explore-exploit tradeoff from bandit algorithms: high acquisition score is exploitation (query where you know the model is struggling), while diversity is exploration (query in different parts of the input space).

Three strategies for batch diversity:

from scipy.spatial.distance import cdist

def select_batch_topk(scores, k):
    """Naive: take top-k highest scores."""
    return np.argsort(scores)[-k:]

def select_batch_kcenter(X_unlabeled, X_labeled, k):
    """Greedy k-centers: maximize coverage."""
    selected = []
    for _ in range(k):
        if len(selected) == 0:
            ref_points = X_labeled
        else:
            ref_points = np.vstack([X_labeled, X_unlabeled[selected]])
        dists = cdist(X_unlabeled, ref_points).min(axis=1)
        dists[selected] = -1  # exclude already selected
        selected.append(np.argmax(dists))
    return np.array(selected)

def select_batch_hybrid(X_unlabeled, X_labeled, scores, k, alpha=0.5):
    """Uncertainty x diversity: best of both worlds."""
    norm_scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-10)
    selected = []
    for _ in range(k):
        if len(selected) == 0:
            ref_points = X_labeled
        else:
            ref_points = np.vstack([X_labeled, X_unlabeled[selected]])
        dists = cdist(X_unlabeled, ref_points).min(axis=1)
        norm_dists = (dists - dists.min()) / (dists.max() - dists.min() + 1e-10)
        combined = alpha * norm_scores + (1 - alpha) * norm_dists
        combined[selected] = -1
        selected.append(np.argmax(combined))
    return np.array(selected)

The visual difference is striking (try the Batch Diversity Explorer demo below): top-k creates a tight cluster of queries, k-centers spreads points across the entire space including regions the model already understands, and the hybrid places queries near decision boundaries but spreads them across different boundary regions. In practice, batch diversity can improve data efficiency by 20-40% over naive top-k selection.

When Active Learning Fails — And How to Fix It

Active learning isn't magic. It has well-known failure modes, and understanding them is crucial for deploying it in practice.

Failure 1: Miscalibration. If the model is overconfident on its wrong predictions, uncertainty sampling will never query those examples — it thinks it already knows the answer. This is the most insidious failure mode because it's invisible: the model confidently ignores an entire region of the input space. Fix: calibrate the model before computing acquisition scores. Temperature scaling (from uncertainty quantification) is cheap and effective.

Failure 2: Sampling bias. By exclusively querying near the decision boundary, you build a labeled set that's unrepresentative of the true data distribution. Entire clusters of data might never get labeled. Fix: use an ε-greedy strategy — with probability ε, query a random point instead of the highest-scoring one. Even 10-20% random exploration prevents catastrophic blind spots.

Failure 3: Outlier attraction. Outliers and noisy examples have the highest uncertainty — they're inherently hard to classify because they don't belong to any clear class. Querying them wastes your budget on uninformative examples. Fix: multiply the acquisition score by a density estimate. A point must be both uncertain and in a dense region to be queried.

Failure 4: Cold start. When the initial model has seen only a handful of examples, its uncertainty estimates are meaningless. The first 10-20 queries are essentially random. Fix: initialize with a diverse seed set, selected by clustering the unlabeled pool and picking one example per cluster.

def hybrid_active_learning(X_pool, y_pool, X_seed, y_seed,
                           n_queries=50, epsilon=0.15):
    """Active learning with random exploration to avoid sampling bias."""
    X_train, y_train = X_seed.copy(), y_seed.copy()
    pool_mask = np.ones(len(X_pool), dtype=bool)
    accuracies = []
    rng = np.random.RandomState(42)

    for _ in range(n_queries):
        model = LogisticRegression().fit(X_train, y_train)
        accuracies.append(model.score(X_pool, y_pool))

        pool_indices = np.where(pool_mask)[0]
        if rng.random() < epsilon:
            # Explore: random query
            query_idx = rng.choice(pool_indices)
        else:
            # Exploit: uncertainty query
            probs = model.predict_proba(X_pool[pool_indices])
            uncertainty = entropy_sampling(probs)
            query_idx = pool_indices[np.argmax(uncertainty)]

        X_train = np.vstack([X_train, X_pool[query_idx:query_idx+1]])
        y_train = np.append(y_train, y_pool[query_idx])
        pool_mask[query_idx] = False

    return accuracies

The ε-greedy fix is simple but remarkably effective. It echoes the same principle from bandit algorithms: pure exploitation (always querying the most uncertain point) can miss important information, just as always pulling the best-known arm misses better options. A small amount of random exploration prevents the learner from developing a distorted view of the world.

Try It: Active Learning Arena

Two learners race to classify overlapping blob data. The active learner picks the most uncertain points near the decision boundary; the random learner picks blindly. Watch how strategic querying finds the boundary faster with fewer labels.

Active Learner (Uncertainty)
Random Learner
Queries: 0 | Active: —% | Random: —%

Active Learning for Modern ML

The strategies above were developed for classical ML, but they're more relevant than ever in the age of foundation models.

LLM fine-tuning data selection. When curating instruction-tuning datasets (as in instruction tuning), which examples should you include? Active learning over the embedding space can identify the most informative fine-tuning examples — ones where the base model's uncertainty is highest or where different model checkpoints disagree.

Foundation model adaptation. With LoRA and other parameter-efficient methods, fine-tuning is cheap — but labeling data for the target domain isn't. Active learning selects which examples to annotate, using the foundation model's embeddings as features. You don't need to retrain from scratch; just compute embedding-space uncertainty.

LLM-as-annotator + active learning. An emerging pattern: use a strong LLM (like GPT-4) to label examples where a smaller model is confident, and route only the uncertain examples to human annotators. This creates a two-tier labeling system that's both cheap and accurate — active learning decides which tier handles each example.

Human-in-the-loop tools. Modern data labeling platforms like Prodigy, Label Studio, and Argilla integrate active learning directly into their annotation workflows. The annotator sees the most informative examples first, accelerating the creation of high-quality datasets.

Try It: Batch Diversity Explorer

Three batch selection strategies pick from the same unlabeled pool. Top-K greedily takes the most uncertain points (redundant!). K-Centers maximizes coverage (but ignores uncertainty). Hybrid balances both. Adjust batch size to see the effect.

Top-K (uncertainty only) K-Centers (diversity only) Hybrid (uncertainty + diversity)

Conclusion

Active learning answers a deceptively simple question: "If I can only label N examples, which N should they be?" The answer — let the model choose — leads to a rich family of strategies that all share a common theme: information is not uniformly distributed across your data.

Some examples are worth a hundred labels. Others teach the model nothing it didn't already know. Active learning finds the valuable ones by measuring what the model doesn't know — through uncertainty, disagreement, expected model change, or information gain — and directing human effort there.

The practical takeaway: if you're building any ML system where labeling is expensive (and when isn't it?), active learning should be your first optimization. Before you try fancier models, bigger architectures, or cleverer features — first, label smarter data.

References & Further Reading