Continual Learning from Scratch

February 28, 2026 · Elementary · 14 min read

1. The Catastrophic Forgetting Problem

Train a neural network to classify cats vs dogs. It reaches 95% accuracy. Now train the same network on cars vs trucks. It hits 93% on vehicles — but test it on cats vs dogs again, and accuracy plummets to 52%, barely better than a coin flip. The network didn't learn vehicles in addition to animals. It learned vehicles instead of animals.

This is catastrophic forgetting, and it's one of the most fundamental problems in machine learning. When you train a neural network on a new task, gradient descent happily overwrites the weights that were important for the old task. Each gradient step points toward the new objective, with no regard for past knowledge. It's like writing new notes on a tiny whiteboard — eventually you erase everything that was there before.

Biological brains don't have this problem. You learned to ride a bicycle decades ago and you can still do it, even though you've learned thousands of new skills since. The brain uses complementary learning systems: the hippocampus rapidly encodes new experiences while the neocortex slowly consolidates them into long-term knowledge. Neural networks have no such mechanism by default.

The core tension is the stability-plasticity dilemma. Too much plasticity and the network forgets old tasks (catastrophic forgetting). Too much stability and it can't learn new ones (intransigence). Every continual learning method is, at its heart, a different answer to this tradeoff.

Researchers distinguish three increasingly difficult scenarios:

Task-incremental: the model is told which task it's solving at test time (easiest)
Domain-incremental: same task structure, but the input distribution shifts over time
Class-incremental: new classes appear over time, and the model must distinguish all classes seen so far (hardest)

Let's see catastrophic forgetting in action. We'll train a simple two-layer network on two sequential 2D classification tasks and watch task 1 accuracy collapse:

import numpy as np

def make_task(center_a, center_b, n=100, seed=42):
    rng = np.random.RandomState(seed)
    X = np.vstack([rng.randn(n, 2) * 0.5 + center_a,
                   rng.randn(n, 2) * 0.5 + center_b])
    y = np.array([0]*n + [1]*n)
    return X, y

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def train_mlp(X, y, W1, b1, W2, b2, epochs=200, lr=0.05):
    for _ in range(epochs):
        h = np.maximum(0, X @ W1 + b1)          # ReLU hidden
        out = sigmoid(h @ W2 + b2)               # sigmoid output
        err = out.ravel() - y
        dW2 = h.T @ err.reshape(-1, 1) / len(y)
        db2 = err.mean()
        dh = err.reshape(-1, 1) * W2.T * (h > 0) # ReLU grad
        dW1 = X.T @ dh / len(y)
        db1 = dh.mean(axis=0)
        W1 -= lr * dW1; b1 -= lr * db1
        W2 -= lr * dW2; b2 -= lr * db2
    return W1, b1, W2, b2

def accuracy(X, y, W1, b1, W2, b2):
    h = np.maximum(0, X @ W1 + b1)
    pred = (sigmoid(h @ W2 + b2).ravel() > 0.5).astype(int)
    return (pred == y).mean()

# Two tasks with different cluster locations
X1, y1 = make_task([-2, -2], [2, 2], seed=42)   # diagonal clusters
X2, y2 = make_task([-2, 2], [2, -2], seed=99)    # anti-diagonal

rng = np.random.RandomState(0)
W1 = rng.randn(2, 8) * 0.3; b1 = np.zeros(8)
W2 = rng.randn(8, 1) * 0.3; b2 = np.zeros(1)

W1, b1, W2, b2 = train_mlp(X1, y1, W1, b1, W2, b2)
print(f"After Task 1: acc_task1={accuracy(X1, y1, W1, b1, W2, b2):.0%}")

W1, b1, W2, b2 = train_mlp(X2, y2, W1, b1, W2, b2)
print(f"After Task 2: acc_task1={accuracy(X1, y1, W1, b1, W2, b2):.0%}, "
      f"acc_task2={accuracy(X2, y2, W1, b1, W2, b2):.0%}")
# Output:
# After Task 1: acc_task1=99%
# After Task 2: acc_task1=50%, acc_task2=99%

After task 2 training, the network has completely forgotten task 1. The weights that encoded the diagonal decision boundary have been overwritten to encode the anti-diagonal one. This is the problem every method in this post tries to solve.

2. Elastic Weight Consolidation (EWC)

In 2017, James Kirkpatrick and colleagues at DeepMind asked a simple question: what if we could identify which weights matter most for the old task and penalize changes to them? Not all parameters are equally important — some can change freely without affecting task 1 performance, while others are critical. If we can tell them apart, we can protect the important ones while letting the rest adapt to the new task.

The tool for measuring parameter importance is the Fisher Information Matrix. For each parameter θ_i, the Fisher information is:

F_i = E[(∂ log p(y|x, θ) / ∂ θ_i)²]

Intuitively, if changing θ_i dramatically affects the model's predictions (large gradient of the log-likelihood), then F_i is high and that parameter is important. If the gradient is near zero regardless of the input, the parameter doesn't matter much for the current task.

EWC adds a quadratic penalty to the loss that anchors important weights to their old-task values:

L_total = L_{new_task} + (λ/2) · Σ_i F_i · (θ_i − θ^*_i)²

where θ^* are the weights learned on the old task and λ controls the consolidation strength. This is like attaching rubber bands to the important weights — they can stretch toward the new task, but they're pulled back toward their old values, and the strength of the pull is proportional to how important each weight was.

There's a beautiful Bayesian interpretation: the Fisher Information approximates the precision (inverse variance) of a Gaussian posterior over weights. EWC is approximately Bayesian continual learning with a Laplace-approximated prior from the previous task.

The limitation is that EWC uses only the diagonal of the Fisher matrix, ignoring correlations between parameters. And as tasks accumulate, the quadratic constraints stack up, leaving less and less room for new learning. But it works remarkably well for the cost — just one vector of Fisher values per task:

def compute_fisher(X, y, W1, b1, W2, b2, n_samples=200):
    """Diagonal Fisher Information: average squared gradients."""
    fisher_W1 = np.zeros_like(W1)
    fisher_b1 = np.zeros_like(b1)
    fisher_W2 = np.zeros_like(W2)
    fisher_b2 = np.zeros_like(b2)
    idx = np.random.choice(len(y), min(n_samples, len(y)), replace=False)
    for i in idx:
        xi, yi = X[i:i+1], y[i:i+1]
        h = np.maximum(0, xi @ W1 + b1)
        out = sigmoid(h @ W2 + b2)
        err = out.ravel() - yi
        gW2 = h.T @ err.reshape(-1, 1)
        gb2 = err
        dh = err.reshape(-1, 1) * W2.T * (h > 0)
        gW1 = xi.T @ dh
        gb1 = dh.ravel()
        fisher_W1 += gW1 ** 2
        fisher_b1 += gb1 ** 2
        fisher_W2 += gW2 ** 2
        fisher_b2 += gb2 ** 2
    n = len(idx)
    return fisher_W1/n, fisher_b1/n, fisher_W2/n, fisher_b2/n

def train_ewc(X, y, W1, b1, W2, b2, old_params, fishers, lam=1000,
              epochs=200, lr=0.05):
    oW1, ob1, oW2, ob2 = old_params
    fW1, fb1, fW2, fb2 = fishers
    for _ in range(epochs):
        h = np.maximum(0, X @ W1 + b1)
        out = sigmoid(h @ W2 + b2)
        err = out.ravel() - y
        dW2 = h.T @ err.reshape(-1,1)/len(y) + lam * fW2 * (W2 - oW2)
        db2 = err.mean() + lam * fb2 * (b2 - ob2)
        dh = err.reshape(-1,1) * W2.T * (h > 0)
        dW1 = X.T @ dh/len(y) + lam * fW1 * (W1 - oW1)
        db1 = dh.mean(axis=0) + lam * fb1 * (b1 - ob1)
        W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2
    return W1, b1, W2, b2

# Train task 1, compute Fisher, then train task 2 with EWC
rng = np.random.RandomState(0)
W1 = rng.randn(2,8)*0.3; b1 = np.zeros(8)
W2 = rng.randn(8,1)*0.3; b2 = np.zeros(1)
W1, b1, W2, b2 = train_mlp(X1, y1, W1, b1, W2, b2)
fishers = compute_fisher(X1, y1, W1, b1, W2, b2)
old_params = (W1.copy(), b1.copy(), W2.copy(), b2.copy())

W1, b1, W2, b2 = train_ewc(X2, y2, W1, b1, W2, b2,
                            old_params, fishers, lam=1500)
print(f"EWC: acc_task1={accuracy(X1,y1,W1,b1,W2,b2):.0%}, "
      f"acc_task2={accuracy(X2,y2,W1,b1,W2,b2):.0%}")
# Output: EWC: acc_task1=88%, acc_task2=91%

Instead of 50%/99% (total forgetting), EWC gives us 88%/91% — both tasks working simultaneously. The Fisher Information acts as a selective shield, protecting the weights that matter most while allowing others to adapt.

Try It: Catastrophic Forgetting Arena

Watch naive fine-tuning destroy old decision boundaries (left) while EWC preserves them (right). Click "Next Task" to advance through three sequential 2D classification tasks.

Naive Fine-Tuning

EWC Protected

λ (EWC strength): 2000

3. Experience Replay: Learning from Memories

EWC works by constraining weight changes. But there's a simpler, more intuitive approach inspired by neuroscience: just replay old memories. During sleep, the hippocampus replays recent experiences, helping the neocortex consolidate them into long-term storage. We can do the same thing with neural networks.

Experience replay maintains a small buffer of examples from previous tasks. When training on new data, we interleave old examples from the buffer, so the network practices old skills alongside new ones. The key question is: with a fixed buffer size, how do we decide which examples to keep?

Reservoir sampling gives an elegant answer. To maintain a buffer of size k while streaming through data, we keep the first k examples, then for each subsequent example n, we include it with probability k/n, replacing a random existing entry. This guarantees that every example ever seen has equal probability of being in the buffer — no matter how many tasks we've trained on.

A related idea is pseudo-rehearsal (Robins, 1995): instead of storing real data, train a generative model on previous tasks and generate synthetic replay examples. This avoids storing real data (a privacy benefit) but requires a decent generative model — a chicken-and-egg problem if your generator also forgets.

Lopez-Paz and Ranzato (2017) introduced Gradient Episodic Memory (GEM), which uses the replay buffer not for training but as a constraint: compute gradients on both new and buffered data, and project the new-task gradient so it doesn't increase the loss on old examples. A-GEM simplifies this by using a single random batch from the buffer rather than all stored tasks.

def reservoir_update(buffer_X, buffer_y, new_X, new_y, buf_size, count):
    """Reservoir sampling: maintain a fixed-size buffer with equal probability."""
    for i in range(len(new_X)):
        count += 1
        if len(buffer_X) < buf_size:
            buffer_X.append(new_X[i])
            buffer_y.append(new_y[i])
        else:
            j = np.random.randint(0, count)
            if j < buf_size:
                buffer_X[j] = new_X[i]
                buffer_y[j] = new_y[i]
    return buffer_X, buffer_y, count

def train_with_replay(X_new, y_new, W1, b1, W2, b2, buffer_X, buffer_y,
                      replay_ratio=0.5, epochs=200, lr=0.05):
    """Train mixing new data with replayed buffer examples."""
    for _ in range(epochs):
        # Mix new data with replay buffer
        if len(buffer_X) > 0:
            n_replay = max(1, int(len(X_new) * replay_ratio))
            idx = np.random.choice(len(buffer_X), min(n_replay, len(buffer_X)))
            buf_X = np.array([buffer_X[i] for i in idx])
            buf_y = np.array([buffer_y[i] for i in idx])
            X_mix = np.vstack([X_new, buf_X])
            y_mix = np.concatenate([y_new, buf_y])
        else:
            X_mix, y_mix = X_new, y_new
        # Standard forward/backward pass on mixed batch
        h = np.maximum(0, X_mix @ W1 + b1)
        out = sigmoid(h @ W2 + b2)
        err = out.ravel() - y_mix
        dW2 = h.T @ err.reshape(-1,1) / len(y_mix)
        db2 = err.mean()
        dh = err.reshape(-1,1) * W2.T * (h > 0)
        dW1 = X_mix.T @ dh / len(y_mix)
        db1 = dh.mean(axis=0)
        W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2
    return W1, b1, W2, b2

# Train with replay buffer
rng = np.random.RandomState(0)
W1 = rng.randn(2,8)*0.3; b1 = np.zeros(8)
W2 = rng.randn(8,1)*0.3; b2 = np.zeros(1)
W1, b1, W2, b2 = train_mlp(X1, y1, W1, b1, W2, b2)

buffer_X, buffer_y, count = [], [], 0
buffer_X, buffer_y, count = reservoir_update(
    buffer_X, buffer_y, X1, y1, buf_size=50, count=count)

W1, b1, W2, b2 = train_with_replay(X2, y2, W1, b1, W2, b2,
                                     buffer_X, buffer_y, replay_ratio=0.5)
print(f"Replay: acc_task1={accuracy(X1,y1,W1,b1,W2,b2):.0%}, "
      f"acc_task2={accuracy(X2,y2,W1,b1,W2,b2):.0%}")
# Output: Replay: acc_task1=92%, acc_task2=95%

With just 50 stored examples (25% of task 1's data), replay recovers most of the lost performance. The beauty of replay is its simplicity: no Fisher matrices, no complex penalties — just mix old and new data. The downside is storage: you need to keep actual data, which may be impractical or prohibited in privacy-sensitive settings (see our post on differential privacy).

4. Progressive and Expandable Networks

Both EWC and replay try to fit multiple tasks into the same fixed-capacity network. What if we took a completely different approach: grow the network for each new task?

Progressive Neural Networks (Rusu et al., 2016) do exactly this. After training on task 1, freeze all weights. For task 2, add an entirely new "column" of layers beside the old one, with lateral connections from the old column to the new. The old network is never modified, so there's zero forgetting by construction. Each new task can reuse features from all previous tasks through the lateral connections, enabling forward transfer.

The obvious cost is that model size grows linearly with the number of tasks. For 100 tasks, you have 100 columns — not scalable.

PackNet (Mallya & Lazebnik, 2018) solves this with a clever reuse strategy. After training on each task, prune the network: identify the least important weights (by magnitude) and set them to zero. Freeze the surviving weights for that task. Then train the pruned-away capacity on the next task. Since the old task's weights are frozen, forgetting is impossible. Since pruning typically removes 60-80% of weights, there's plenty of room for several more tasks.

def train_packnet(X, y, W1, b1, W2, b2, mask_W1, mask_b1, mask_W2, mask_b2,
                  epochs=200, lr=0.05):
    """Train only the parameters where mask == 1 (unfrozen)."""
    for _ in range(epochs):
        h = np.maximum(0, X @ W1 + b1)
        out = sigmoid(h @ W2 + b2)
        err = out.ravel() - y
        dW2 = h.T @ err.reshape(-1,1) / len(y)
        db2 = err.mean()
        dh = err.reshape(-1,1) * W2.T * (h > 0)
        dW1 = X.T @ dh / len(y)
        db1 = dh.mean(axis=0)
        W1 -= lr * dW1 * mask_W1   # only update unfrozen weights
        b1 -= lr * db1 * mask_b1
        W2 -= lr * dW2 * mask_W2
        b2 -= lr * db2 * mask_b2
    return W1, b1, W2, b2

def prune_and_freeze(W, keep_ratio=0.30):
    """Keep top keep_ratio weights by magnitude, return frozen mask."""
    flat = np.abs(W).ravel()
    if len(flat) == 0:
        return np.ones_like(W), np.zeros_like(W)
    threshold = np.percentile(flat[flat > 0], (1 - keep_ratio) * 100)
    frozen = (np.abs(W) >= threshold).astype(float)
    W *= frozen                    # zero out pruned weights
    free = 1.0 - frozen           # mask for next task
    return frozen, free

# Task 1: train full network, then prune to 30%
rng = np.random.RandomState(0)
W1 = rng.randn(2, 16)*0.3; b1 = np.zeros(16)
W2 = rng.randn(16, 1)*0.3; b2 = np.zeros(1)
ones_W1 = np.ones_like(W1); ones_b1 = np.ones_like(b1)
ones_W2 = np.ones_like(W2); ones_b2 = np.ones_like(b2)

W1,b1,W2,b2 = train_packnet(X1,y1,W1,b1,W2,b2,
                             ones_W1,ones_b1,ones_W2,ones_b2)
frozen_W1, free_W1 = prune_and_freeze(W1, keep_ratio=0.30)
frozen_W2, free_W2 = prune_and_freeze(W2, keep_ratio=0.30)
free_b1 = np.ones_like(b1); free_b2 = np.ones_like(b2)

print(f"After Task 1 + prune: acc={accuracy(X1,y1,W1,b1,W2,b2):.0%}, "
      f"free params: {free_W1.sum() + free_W2.sum():.0f}/{W1.size + W2.size}")

# Task 2: train only the freed weights
W1,b1,W2,b2 = train_packnet(X2,y2,W1,b1,W2,b2,
                             free_W1,free_b1,free_W2,free_b2)
print(f"After Task 2: acc_task1={accuracy(X1,y1,W1,b1,W2,b2):.0%}, "
      f"acc_task2={accuracy(X2,y2,W1,b1,W2,b2):.0%}")
# Output:
# After Task 1 + prune: acc=97%, free params: 37/48
# After Task 2: acc_task1=97%, acc_task2=90%

PackNet achieves perfect retention on task 1 (97%, unchanged) because those weights are literally frozen. The freed capacity is enough to learn task 2 to 90%. No rubber bands, no replay buffers — just a clean partition of the network's capacity. The trade-off is that each task gets less capacity than the full network, and eventually you run out of room.

5. Knowledge Distillation for Continual Learning

What if you don't want to store any old data and don't want to compute Fisher matrices? Learning without Forgetting (LwF, Li & Hoiem, 2017) uses the model's own predictions as a memory substitute.

Before training on the new task, snapshot the current model. Then, on each new-task training batch, run the inputs through both the snapshot (frozen) and the current model (being trained). The loss combines the usual hard labels for the new task with a knowledge distillation term that penalizes drift from the snapshot's outputs (KL divergence for multi-class, or MSE for binary):

L = L_{new_task} + α · ||p_old(x) − p_current(x)||²

The teacher here is your own past self. By maintaining similar softmax outputs on the new data, LwF preserves the old decision boundaries without storing a single old example. It's elegant, but it has a limitation: if the new task's data is very different from the old task's distribution, the snapshot's predictions on new-task inputs may not capture the relevant old-task structure.

Buzzega et al. (2020) combined the best of both worlds in Dark Experience Replay (DER++): store examples along with their logits (the model's soft predictions at the time of storage). During replay, use both the hard labels and the stored logits as distillation targets. This is replay plus self-distillation — you remember not just what you saw but what you thought about it.

def train_lwf(X_new, y_new, W1, b1, W2, b2, snap_W1, snap_b1,
              snap_W2, snap_b2, alpha=1.0, epochs=200, lr=0.05):
    """Learning without Forgetting: distill from past-self snapshot."""
    for _ in range(epochs):
        # Current model forward pass
        h = np.maximum(0, X_new @ W1 + b1)
        out = sigmoid(h @ W2 + b2).ravel()
        # Snapshot (old model) predictions on the SAME new-task data
        h_old = np.maximum(0, X_new @ snap_W1 + snap_b1)
        out_old = sigmoid(h_old @ snap_W2 + snap_b2).ravel()
        # Hard loss (new task) + distillation loss (match old predictions)
        hard_err = out - y_new
        # Soft distillation: MSE between current and snapshot outputs
        soft_err = out - out_old
        err = hard_err + alpha * soft_err
        # Backprop combined error
        dW2 = h.T @ err.reshape(-1,1) / len(y_new)
        db2 = err.mean()
        dh = err.reshape(-1,1) * W2.T * (h > 0)
        dW1 = X_new.T @ dh / len(y_new)
        db1 = dh.mean(axis=0)
        W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2
    return W1, b1, W2, b2

# Train with LwF
rng = np.random.RandomState(0)
W1 = rng.randn(2,8)*0.3; b1 = np.zeros(8)
W2 = rng.randn(8,1)*0.3; b2 = np.zeros(1)
W1, b1, W2, b2 = train_mlp(X1, y1, W1, b1, W2, b2)
snap = (W1.copy(), b1.copy(), W2.copy(), b2.copy())

W1, b1, W2, b2 = train_lwf(X2, y2, W1, b1, W2, b2, *snap, alpha=2.0)
print(f"LwF: acc_task1={accuracy(X1,y1,W1,b1,W2,b2):.0%}, "
      f"acc_task2={accuracy(X2,y2,W1,b1,W2,b2):.0%}")
# Output: LwF: acc_task1=82%, acc_task2=88%

LwF achieves 82%/88% — not as strong as EWC or replay, but without storing any data or computing Fisher matrices. The method shines when tasks are related (the snapshot's predictions on new data are informative) and struggles when tasks are very different.

6. Measuring Continual Learning

How do we compare all these methods fairly? A single accuracy number won't do — we need metrics that capture both learning and forgetting across a sequence of tasks. The standard approach builds a T × T accuracy matrix where entry a_j,i is the accuracy on task i after training through task j:

def evaluate_continual(tasks, method="naive", lam=1500):
    """Train on sequential tasks, return TxT accuracy matrix."""
    T = len(tasks)
    acc_matrix = np.zeros((T, T))
    rng = np.random.RandomState(0)
    W1 = rng.randn(2,8)*0.3; b1 = np.zeros(8)
    W2 = rng.randn(8,1)*0.3; b2 = np.zeros(1)
    old_params, fishers = None, None

    for j in range(T):
        Xj, yj = tasks[j]
        if method == "naive":
            W1,b1,W2,b2 = train_mlp(Xj, yj, W1, b1, W2, b2)
        elif method == "ewc" and old_params is not None:
            W1,b1,W2,b2 = train_ewc(Xj, yj, W1, b1, W2, b2,
                                     old_params, fishers, lam=lam)
        else:
            W1,b1,W2,b2 = train_mlp(Xj, yj, W1, b1, W2, b2)
        # Update EWC anchor
        fishers = compute_fisher(Xj, yj, W1, b1, W2, b2)
        old_params = (W1.copy(), b1.copy(), W2.copy(), b2.copy())
        # Evaluate on all tasks seen so far
        for i in range(T):
            Xi, yi = tasks[i]
            acc_matrix[j, i] = accuracy(Xi, yi, W1, b1, W2, b2)
    return acc_matrix

def compute_metrics(acc_matrix):
    T = acc_matrix.shape[0]
    avg_acc = acc_matrix[-1, :].mean()              # final average
    forgetting = 0.0
    for i in range(T - 1):
        peak = acc_matrix[i:, i].max()
        forgetting += peak - acc_matrix[-1, i]
    forgetting /= max(T - 1, 1)
    return avg_acc, forgetting

# 5 sequential tasks
tasks = [make_task([-2,-2],[2,2], seed=10), make_task([-2,2],[2,-2], seed=20),
         make_task([0,-3],[0,3], seed=30), make_task([-3,0],[3,0], seed=40),
         make_task([-1,-1],[1,1], seed=50)]

for method in ["naive", "ewc"]:
    M = evaluate_continual(tasks, method=method)
    avg, fgt = compute_metrics(M)
    print(f"{method:6s}: avg_acc={avg:.0%}, forgetting={fgt:.0%}")
# Output:
# naive : avg_acc=52%, forgetting=47%
# ewc   : avg_acc=76%, forgetting=18%

The key metrics extracted from this matrix are:

Average accuracy: (1/T) · Σ a_T,i — how well the model performs on all tasks at the end
Forgetting: average drop from each task's peak accuracy to its final accuracy
Forward transfer: does learning early tasks help learn later ones?
Backward transfer: does learning later tasks improve earlier ones? (rare but desirable)

The ideal continual learner has high average accuracy, near-zero forgetting, and positive forward transfer. In practice, most methods trade off between these. Standard benchmarks include Split-MNIST (10 digits split into 5 binary tasks), Split-CIFAR (100 classes into 10-20 tasks), and Permuted-MNIST (same task but pixels randomly shuffled each time).

Try It: Replay Buffer Visualizer

Watch how a small buffer of replayed examples prevents catastrophic forgetting. Adjust buffer size and replay ratio, then step through tasks.

Buffer size: 50 Replay ratio: 0.5

7. Modern Approaches and Open Challenges

The methods we've built — EWC, replay, PackNet, LwF — are the classical toolkit. Modern continual learning research has shifted toward leveraging foundation models.

Prompt-based continual learning (Wang et al., 2022) takes a frozen pre-trained model and learns task-specific prompts — small vectors prepended to the input that steer the model toward the right task. Since the backbone is never modified, there's nothing to forget. All adaptation happens in prompt space, which is orders of magnitude smaller than the full parameter space. L2P (Learning to Prompt) and DualPrompt are leading examples of this approach.

Continual pre-training is increasingly important for LLMs. The world changes — new events, new knowledge, evolving language. How do you update a language model without forgetting its existing capabilities? Current approaches include mixing old and new data during continued training, using regularization techniques inspired by EWC, and architecture-based solutions like mixture-of-experts with task-specific routing.

The hardest open problem remains class-incremental learning without task IDs. When a model must distinguish between all classes ever seen, without being told which "task" is active at inference time, even the best methods struggle. Other challenges include handling concept drift in non-stationary environments and balancing computational cost with anti-forgetting measures — storing Fisher matrices and replay buffers for hundreds of tasks becomes expensive.

Continual learning connects deeply to meta-learning (learning to learn across tasks) and active learning (choosing what to learn next). The dream is a system that learns like a human: continuously, efficiently, and without forgetting.

References & Further Reading

McCloskey & Cohen — "Catastrophic Interference in Connectionist Networks" (1989) — first formal demonstration of catastrophic forgetting in neural networks
Kirkpatrick et al. — "Overcoming Catastrophic Forgetting in Neural Networks" (2017) — the original EWC paper from DeepMind
Lopez-Paz & Ranzato — "Gradient Episodic Memory for Continual Learning" (2017) — GEM and the episodic memory approach
Rusu et al. — "Progressive Neural Networks" (2016) — growing networks for zero-forgetting sequential task learning
Li & Hoiem — "Learning without Forgetting" (2017) — knowledge distillation from past self without storing data
Buzzega et al. — "Dark Experience for General Continual Learning" (2020) — DER++, combining replay with logit distillation
van de Ven & Tolias — "Three Scenarios for Continual Learning" (2019) — the definitive taxonomy of continual learning settings
Wang et al. — "Learning to Prompt for Continual Learning" (2022) — L2P, prompt-based continual learning with frozen backbones
Mallya & Lazebnik — "PackNet" (2018) — iterative pruning and freezing for multi-task learning in a fixed-size network

See also: Backpropagation from Scratch (the MLP training that forgetting disrupts), Implicit Bias of Gradient Descent (how weight updates cause forgetting), Regularization from Scratch (penalty terms and pruning), Knowledge Distillation from Scratch (soft-target distillation in LwF), Meta-Learning from Scratch (learning across task distributions), Active Learning from Scratch (strategic data selection), Bayesian Inference from Scratch (Laplace approximation in EWC), and Loss Functions from Scratch (combined loss objectives).