Curriculum Learning from Scratch

February 28, 2026 · Elementary · 12 min read

1. Why Training Order Matters

You learned to add before you learned calculus. You crawled before you walked. Every good teacher sequences lessons from simple to complex, building on what the student already knows. But when we train neural networks, we shuffle the dataset and present examples in completely random order every epoch. What if that's leaving performance on the table?

In 2009, Yoshua Bengio and colleagues formalized this intuition in a paper called simply Curriculum Learning. Their key idea: a curriculum is a sequence of training distributions where the entropy increases over time. In plain English, you start with a small pool of easy examples and gradually expand to include harder ones. The support of each distribution is a subset of the next — nothing is removed, only added.

Their experiments confirmed the intuition. On a shape recognition task, training only on "easy" shapes (regular squares, equilateral triangles, perfect circles) before introducing distorted versions led to lower test error than training on everything at once. On language modeling with Wikipedia text, starting with sentences containing only common words and gradually expanding the vocabulary produced better convergence.

Let's build a dataset that makes the concept concrete. We'll create a 2D classification problem with clearly easy and hard regions:

import numpy as np

def make_curriculum_dataset(n=200, seed=42):
    rng = np.random.RandomState(seed)

    # Class 0: cluster centered at (-2, 0)
    x0 = rng.randn(n // 2, 2) * 0.8 + np.array([-2, 0])
    # Class 1: cluster centered at (+2, 0)
    x1 = rng.randn(n // 2, 2) * 0.8 + np.array([2, 0])

    X = np.vstack([x0, x1])
    y = np.array([0] * (n // 2) + [1] * (n // 2))

    # Difficulty = inverse distance to decision boundary (x=0 line)
    difficulty = 1.0 / (np.abs(X[:, 0]) + 0.1)

    # Normalize to [0, 1] range
    difficulty = (difficulty - difficulty.min()) / (difficulty.max() - difficulty.min())

    return X, y, difficulty

X, y, difficulty = make_curriculum_dataset()
easy_mask = difficulty < 0.3   # 70% of examples are "easy"
hard_mask = difficulty >= 0.3  # 30% are near the boundary

print(f"Easy examples: {easy_mask.sum()}, Hard examples: {hard_mask.sum()}")
# Easy examples: 155, Hard examples: 45

The difficulty score is based on proximity to the decision boundary at x=0. Points far from the boundary are easy — the model barely needs to think. Points near the boundary are hard — they require a precise decision surface. This simple scoring already captures the key insight: not all training examples contribute equally to learning.

2. Continuation Methods: The Theory Behind Easy-First

Why would presenting easy examples first help? The answer lies in a beautiful connection to continuation methods from optimization theory. A continuation method (also called a homotopy method) solves a hard optimization problem by first solving an easy version, then gradually morphing it into the hard one:

Start with a smoothed, convex objective F_easy. Find its minimum. Then gradually deform it into the true, non-convex objective F_hard, using each solution to initialize the next step.

Formally, we construct a family of objectives parameterized by λ ∈ [0, 1]:

F_λ(θ) = (1 - λ) · F_easy(θ) + λ · F_hard(θ)

When λ = 0, we have the easy problem (smooth landscape, few local minima). When λ = 1, we have the original hard problem. The magic is that by tracking the solution as λ increases, we stay near a good basin of attraction rather than getting trapped in a bad local minimum from a random start.

How does this map to curriculum learning? When a model trains only on easy examples, the effective loss landscape is smoother. Easy examples have less noise, require simpler decision boundaries, and produce fewer conflicting gradients. The loss surface has fewer local minima and broader basins. As harder examples are gradually introduced, the landscape becomes more rugged — but the model is already in a good neighborhood, guided there by the easy examples.

Think of it like hiking in fog. If you start at the bottom of a smooth valley (easy problem), you naturally walk toward the lowest point. As the terrain gets rockier (hard examples added), you're already close to where you want to be. But if you started on the rocky terrain from the beginning, you might wander into a side canyon and never find the true valley floor.

Weinshall et al. (2018) proved this formally for convex losses: the convergence rate of SGD with curriculum is monotonically increasing with example difficulty. Easy examples contribute most to early convergence. For non-convex neural networks, the theory is less complete, but the empirical evidence is strong — and the continuation method intuition explains why.

3. Defining "Easy": Difficulty Metrics

The central question in curriculum learning is: who decides what's easy? There are surprisingly many ways to measure example difficulty, and they tend to agree with each other.

Loss-based scoring is the simplest: an example's difficulty is its current loss. High loss = hard, low loss = easy. This is used directly in self-paced learning (Section 4). The downside: difficulty depends on the model's current state, so it changes during training.

Transfer teacher scoring (Hacohen & Weinshall, 2019) trains a separate "teacher" model and uses its confidence on each example as the difficulty score. Low confidence = hard. This gives a fixed difficulty ranking independent of the learner's current state.

Forgetting events (Toneva et al., 2019) track how many times an example flips from correctly classified to incorrectly classified during training. An example that's learned once and never forgotten is "easy." An example that's repeatedly forgotten is "hard." On CIFAR-10, a striking 31% of examples are never forgotten once learned.

EL2N scores (Paul et al., 2021) measure the L2 norm of the error vector: ||f(x; θ) − y_onehot||₂, averaged over a few early training epochs. This cheap computation identifies informative examples remarkably well.

Let's compute multiple difficulty metrics and see how they correlate:

import numpy as np

def compute_difficulty_metrics(X, y, n_epochs=10, seed=42):
    rng = np.random.RandomState(seed)
    n = len(X)

    # Simple linear model: w^T x + b
    w = rng.randn(2) * 0.01
    b = 0.0
    lr = 0.05

    losses_over_time = np.zeros((n_epochs, n))
    forgetting_counts = np.zeros(n)
    prev_correct = np.zeros(n, dtype=bool)

    for epoch in range(n_epochs):
        for i in range(n):
            logit = X[i] @ w + b
            prob = 1.0 / (1.0 + np.exp(-logit))
            loss = -y[i] * np.log(prob + 1e-8) - (1 - y[i]) * np.log(1 - prob + 1e-8)
            losses_over_time[epoch, i] = loss

            # Track forgetting events
            correct = (prob > 0.5) == y[i]
            if epoch > 0 and prev_correct[i] and not correct:
                forgetting_counts[i] += 1
            prev_correct[i] = correct

            # SGD update
            grad = prob - y[i]
            w -= lr * grad * X[i]
            b -= lr * grad

    avg_loss = losses_over_time.mean(axis=0)        # Loss-based difficulty
    margin_dist = np.abs(X[:, 0])                    # Margin-based difficulty
    return avg_loss, 1.0 / (margin_dist + 0.1), forgetting_counts

avg_loss, margin_diff, forgetting = compute_difficulty_metrics(X, y)
# All three metrics correlate: examples near the boundary
# have high avg_loss, high margin_diff, and more forgetting events

The reassuring finding: these metrics largely agree. Examples that are hard by one measure tend to be hard by others. "Easy" is a robust concept — it's not an artifact of any particular scoring method.

4. Self-Paced Learning: Let the Model Decide

Manually defining a curriculum requires domain knowledge — someone has to decide what's "easy." Self-paced learning (SPL), introduced by Kumar et al. in 2010, flips this: the model decides its own curriculum by selecting examples based on their current loss.

The SPL objective adds binary selection weights v_i ∈ {0, 1} to each example:

min_{w, v} ∑_i v_i · L(y_i, f(x_i; w)) − λ ∑_i v_i

The first term says: minimize the loss, but only on selected examples. The second term, −λ ∑ v_i, rewards including more examples (without it, the trivial solution is v_i = 0 for all i). The pace parameter λ controls the difficulty threshold.

The optimization alternates between three steps:

Fix w, update v: For each example, set v_i = 1 if L(y_i, f(x_i; w)) < λ, else v_i = 0. Include examples the model finds easy.
Fix v, update w: Train the model using only selected examples (where v_i = 1).
Anneal λ: Increase λ by a factor μ > 1. This raises the difficulty threshold, admitting harder examples.

As λ grows, the model progressively trains on harder examples — exactly the easy-to-hard curriculum, but determined automatically by the model's own competence. Here's a complete implementation:

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def self_paced_learning(X, y, n_epochs=30, lambda_init=0.5, mu=1.3, lr=0.1, seed=42):
    rng = np.random.RandomState(seed)
    n, d = X.shape
    w = rng.randn(d) * 0.01
    b = 0.0
    lam = lambda_init

    history = []
    for epoch in range(n_epochs):
        # Step 1: Compute losses and select examples (fix w, update v)
        logits = X @ w + b
        probs = sigmoid(logits)
        losses = -y * np.log(probs + 1e-8) - (1 - y) * np.log(1 - probs + 1e-8)
        v = (losses < lam).astype(float)  # Select easy examples

        n_selected = int(v.sum())
        acc = ((probs > 0.5) == y).mean()
        history.append({'epoch': epoch, 'lambda': lam,
                        'n_selected': n_selected, 'accuracy': acc})

        # Step 2: Train on selected examples only (fix v, update w)
        if n_selected > 0:
            selected = v > 0
            for i in np.where(selected)[0]:
                grad = sigmoid(X[i] @ w + b) - y[i]
                w -= lr * grad * X[i]
                b -= lr * grad

        # Step 3: Anneal lambda (raise difficulty threshold)
        lam *= mu

    return w, b, history

w, b, history = self_paced_learning(X, y)
for h in history[::5]:
    print(f"Epoch {h['epoch']:2d}: lambda={h['lambda']:.2f}, "
          f"selected={h['n_selected']:3d}/{len(X)}, acc={h['accuracy']:.3f}")
# Epoch  0: lambda=0.50, selected= 82/200, acc=0.510
# Epoch  5: lambda=1.86, selected=173/200, acc=0.855
# Epoch 10: lambda=6.88, selected=200/200, acc=0.960
# ...the model grows its own curriculum from easy to hard

Watch the progression: at epoch 0, only 82 of 200 examples pass the threshold — the easiest ones far from the boundary. By epoch 5, the model has learned enough that 173 examples are "easy" enough to include. By epoch 10, all examples are selected and the model has 96% accuracy. The model literally taught itself an easy-to-hard curriculum.

Platanios et al. (2019) extended this idea with competence-based curricula for neural machine translation. They defined a competence function c(t) that determines what fraction of data to use at step t, using a square-root schedule: c(t) = min(1, √(c₀² + (1 − c₀²) · t/T)). Their "Baby Step" method sorts examples by difficulty and includes only the easiest c(t) fraction at each step — achieving up to 70% faster training and 2.2 BLEU improvement on translation tasks.

Try It: Curriculum vs Random Training

Watch two neural networks train on the same dataset. The left network sees easy examples first (sorted by distance from boundary). The right network sees examples in random order. Click Train to compare.

Speed:

Click Train to start both models simultaneously.

5. Anti-Curriculum: When Hard-First Wins

Before you reorganize every training pipeline to sort by difficulty, a reality check: easy-first doesn't always win. Sometimes the hard examples are exactly what the model needs most.

Focal loss (Lin et al., 2017) is an implicit anti-curriculum. Designed for object detection where most candidate boxes contain background (easy negatives), it downweights easy examples and focuses on hard ones:

import numpy as np

def focal_loss(p_true, gamma=2.0, alpha=0.25):
    """Focal loss: upweights hard examples, downweights easy ones.

    p_true: model's predicted probability for the correct class
    gamma: focusing parameter (0 = standard cross-entropy)
    alpha: class-balancing weight
    """
    return -alpha * (1 - p_true) ** gamma * np.log(p_true + 1e-8)

# Compare focal loss for easy vs hard examples
p_values = np.array([0.1, 0.3, 0.5, 0.7, 0.9])  # model confidence
for gamma in [0, 1, 2, 5]:
    losses = focal_loss(p_values, gamma=gamma)
    weights = (1 - p_values) ** gamma  # the modulating factor alone
    print(f"gamma={gamma}: weights = {np.round(weights, 4)}")
# gamma=0: weights = [1.     1.     1.     1.     1.    ]  (standard CE)
# gamma=1: weights = [0.9    0.7    0.5    0.3    0.1   ]  (mild focusing)
# gamma=2: weights = [0.81   0.49   0.25   0.09   0.01  ]  (strong focusing)
# gamma=5: weights = [0.5905 0.1681 0.0312 0.0024 0.0   ]  (extreme)

With γ = 2, an easy example (p = 0.9) contributes only 1% of its original loss, while a hard example (p = 0.1) retains 81%. The model spends almost all of its learning budget on the hardest cases.

Wu et al. (2021) ran over 540 training runs to systematically test when curricula help. Their findings were nuanced:

Standard settings: Curriculum provides marginal benefits. Random ordering performs comparably.
Limited training time: Curriculum gives substantial improvement — easy examples are more efficiently learned.
Noisy labels: Easy-first clearly wins, because hard examples are often mislabeled.
Class imbalance: Anti-curriculum (focal loss) helps, because rare-class examples always look "hard."

The takeaway: there's no universal winner. The right strategy depends on your data quality, compute budget, and class distribution. But understanding why each approach works lets you make informed choices.

6. Data Pruning: Modern Curriculum at Scale

Curriculum learning's modern incarnation isn't about ordering — it's about permanently removing uninformative examples. If easy examples are "learned and never forgotten," do you even need them?

Toneva et al. (2019) introduced forgetting events — counting how many times each example flips from correctly to incorrectly classified during training. Their findings were striking:

MNIST: 91.7% of examples are unforgettable (learned once, never forgotten)
CIFAR-10: 31.3% are unforgettable
Removing all unforgettable examples and retraining from scratch barely hurts accuracy

Think about that: you can throw away nearly a third of CIFAR-10 and the model doesn't notice. Those examples were never providing useful gradient signal after the first few epochs.

import numpy as np

def compute_forgetting_events(X, y, n_epochs=20, lr=0.1, seed=42):
    """Track forgetting events for each example across training."""
    rng = np.random.RandomState(seed)
    n = len(X)
    w = rng.randn(X.shape[1]) * 0.01
    b = 0.0

    forgetting_counts = np.zeros(n)
    prev_correct = np.zeros(n, dtype=bool)

    for epoch in range(n_epochs):
        # Evaluate all examples
        logits = X @ w + b
        preds = logits > 0
        correct = preds == y

        # Count forgetting: was correct, now wrong
        if epoch > 0:
            forgot = prev_correct & ~correct
            forgetting_counts += forgot

        prev_correct = correct.copy()

        # SGD update (random order)
        order = rng.permutation(n)
        for i in order:
            prob = 1.0 / (1.0 + np.exp(-X[i] @ w - b))
            grad = prob - y[i]
            w -= lr * grad * X[i]
            b -= lr * grad

    # Classify examples
    unforgettable = forgetting_counts == 0
    print(f"Unforgettable: {unforgettable.sum()}/{n} ({unforgettable.mean():.1%})")
    print(f"Forgotten 1+ times: {(forgetting_counts > 0).sum()}/{n}")
    return forgetting_counts

forgetting = compute_forgetting_events(X, y)
# Unforgettable: 168/200 (84.0%)
# Forgotten 1+ times: 32/200

Paul et al. (2021) pushed this further with EL2N scores — measuring the error vector's L2 norm after just a few training epochs. The key insight: you don't need to train to completion to identify which examples matter. A few epochs of training with multiple random seeds gives you a stable importance ranking. They pruned 50% of CIFAR-10 while maintaining test accuracy by keeping the highest-scoring (hardest) examples.

Here's the paradox that ties everything together: curriculum learning says train on easy examples first. Data pruning says keep the hard examples permanently. These aren't contradictory — they address different phases of learning. Easy examples are useful early (for finding a good basin), but become redundant later. Hard examples are useless early (the model can't learn from them yet), but become essential later. The optimal strategy combines both: an easy-to-hard curriculum during training, and aggressive pruning of the easy examples once the model has matured.

Try It: Self-Paced Learner

Watch which examples the model considers easy (green) vs hard (red). Large circles are selected for training; small circles are excluded. As training progresses and λ increases, the model "reaches" from easy toward hard examples.

λ pace: 1.12

Epoch 0 | λ = 0.68 | Selected: —/150 | Accuracy: —

7. Conclusion

Curriculum learning reveals a simple but profound truth: training data isn't created equal. The order in which examples are presented, and which examples are presented at all, affects how well a model learns.

The theoretical foundation is elegant — curriculum learning is a continuation method that smooths the loss landscape, guiding gradient descent toward better minima. Self-paced learning automates the curriculum by letting the model decide what it's ready to learn. Data pruning takes this to its logical conclusion: if an example is never forgotten, it was never needed.

And the anti-curriculum perspective keeps us honest: sometimes the hard examples are exactly what matters most. The right strategy depends on your data, your compute budget, and your noise level. But the deeper insight is this: understanding which examples are easy and which are hard tells you as much about your model as it does about your data. It's a diagnostic tool as much as a training technique.

Next time you shuffle(dataset) before training, ask yourself: is random really the best you can do?

References & Further Reading

Bengio et al. — "Curriculum Learning" (ICML 2009) — the foundational paper formalizing easy-to-hard training order
Kumar, Packer & Koller — "Self-Paced Learning for Latent Variable Models" (NeurIPS 2010) — model-driven curriculum via loss thresholding
Lin et al. — "Focal Loss for Dense Object Detection" (2017) — implicit anti-curriculum via loss reweighting
Hacohen & Weinshall — "On The Power of Curriculum Learning in Training Deep Networks" (ICML 2019) — transfer scoring and pacing functions
Platanios et al. — "Competence-based Curriculum Learning for NMT" (NAACL 2019) — baby step and one-pass methods for machine translation
Toneva et al. — "An Empirical Study of Example Forgetting" (ICLR 2019) — forgetting events and data pruning
Paul et al. — "Deep Learning on a Data Diet" (NeurIPS 2021) — EL2N scores for early-training importance estimation
Wu et al. — "When Do Curricula Work?" (ICLR 2021) — systematic study of when curriculum ordering helps vs. doesn't