← Back to Blog

In-Context Learning from Scratch: How LLMs Learn Without Updating a Single Weight

The Mystery Hiding in Every Prompt

Here's something that should stop you in your tracks. You give an LLM three examples:

prompt = """
France -> Paris
Japan -> Tokyo
Brazil -> Brasilia
Egypt ->"""
# Model outputs: "Cairo"

The model gets it right. But here's the punchline: no weights changed. No gradients flowed. No optimizer stepped. The model's parameters before and after are byte-for-byte identical. It "learned" the country-to-capital mapping entirely within its forward pass.

This is in-context learning (ICL), and it's arguably the most surprising capability of large language models — one that the original Transformer paper never predicted. Every practical application of LLMs depends on it: few-shot prompting, chain-of-thought reasoning, RAG pipelines, AI agents. Yet most tutorials treat it as magic. Today we're going to look inside the black box and understand how it works.

The core distinction is simple but profound. Traditional machine learning updates parameters:

θ ← θ - α ∇L(θ)     (gradient descent — weights change)

In-context learning uses fixed parameters to produce different outputs based on what's in the prompt:

y = fθ(examples + query)     (same θ, different behavior)

The model doesn't update θ — it implements a learning algorithm inside its forward pass. And in 2023, researchers proved exactly what that algorithm is.

Attention Implements Gradient Descent

This is the result that changed how we understand transformers. Von Oswald et al. (2023) proved that a single linear attention layer can implement one step of gradient descent on a linear regression problem. The construction is elegant.

Suppose we have in-context examples (x1, y1), (x2, y2), ..., (xC, yC) where each xi is a D-dimensional input and yi is a scalar output. We pack each example into a token by concatenating input and output:

ei = [xi, yi] ∈ RD+1

Now here's the key move. Set the attention weight matrices to these specific values:

WK = WQ = [ID, 0; 0, 0]    (attend only to x, ignore y)
WV = [0, 0; w0T, -1]    (compute prediction error)
WP = (η/C) · I    (learning rate scaling)

With these weights, the attention layer computes the update:

Δw = (η/C) · Σi (w0Txi - yi) · xi

That's exactly one step of gradient descent on the mean squared error loss L(w) = (1/2C) · Σ(wTxi - yi)2. The attention mechanism computes prediction errors, weights them by the inputs, and averages — precisely what gradient descent does.

The multi-layer version is even more striking: L transformer layers implement L steps of gradient descent. Each layer reads the current "implicit weights" from the residual stream, computes the gradient on the in-context examples, and updates the weights — all through matrix multiplication with fixed parameters.

Let's verify this by building an attention layer that solves linear regression, and comparing it against NumPy's closed-form solution.

import numpy as np
np.random.seed(42)

# Generate a simple linear regression task
D = 3              # input dimension
C = 20             # number of in-context examples
w_true = np.array([2.0, -1.0, 0.5])  # true weights

X = np.random.randn(C, D)
y = X @ w_true + 0.1 * np.random.randn(C)  # y = Xw + noise

# --- NumPy closed-form least squares ---
w_lstsq = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"Least squares:  w = {w_lstsq.round(3)}")

# --- Attention-as-gradient-descent ---
# Initialize implicit weights to zero
w = np.zeros(D)
eta = 0.1  # learning rate

# Each "layer" = one GD step via attention
for layer in range(200):
    predictions = X @ w                    # w^T x_i for all i
    errors = predictions - y               # prediction errors
    grad = (X.T @ errors) / C             # mean gradient
    w = w - eta * grad                     # gradient descent step

print(f"Attention (GD):  w = {w.round(3)}")
print(f"True weights:    w = {w_true}")
# Least squares:  w = [2.004, -0.999, 0.489]
# Attention (GD):  w = [2.004, -0.999, 0.489]
# True weights:    w = [2.0, -1.0, 0.5]

The attention-as-gradient-descent recovers the same weights as the closed-form solution. Each of those 200 loop iterations is what a single transformer layer does — and a sufficiently deep transformer does it all in one forward pass with fixed weights. The model doesn't "know" the answer; it computes it from the examples, using the same algorithm we'd use by hand.

Trained transformers actually go beyond plain gradient descent. Akyürek et al. (2022) showed that deeper transformers learn something closer to ridge regression or preconditioned gradient descent — they automatically discover more sophisticated optimization algorithms.

Induction Heads: The Circuit Behind ICL

The gradient descent view explains what ICL computes. But how does it work mechanistically inside a real language model? In 2022, researchers at Anthropic (Olsson et al.) cracked this open by identifying a specific circuit called an induction head.

An induction head implements a simple but powerful pattern: given a sequence [A][B] ... [A], predict [B]. It's the simplest form of "I've seen this before, and here's what came next."

The mechanism requires two attention heads working together across two layers:

Layer 1 — The Previous Token Head: This head always attends to the token immediately before the current position. At each position, it writes into the residual stream: "the token before me is [X]." So at the position where B sits (right after A), it writes "the token before me is A."

Layer 2 — The Induction Head: This head uses the information from Layer 1 to search backward through the sequence. Its query asks: "where did I see a token whose predecessor matches my current token?" When it finds the match — the earlier A that was followed by B — it attends to B and copies it to the output.

Concretely, consider the sequence: "Harry Potter is a wizard. Harry Potter is a"

  1. Layer 1 marks each position with its predecessor. At the first "wizard" position, it writes "preceded by: a"
  2. When the model processes the final "a", Layer 2 searches for previous positions also preceded by "a"
  3. It finds the first "wizard" (which was preceded by "a"), attends to it, and copies "wizard" to the output

This two-layer composition is called K-composition — the key vectors in Layer 2 incorporate information written by Layer 1's output. It's a simple algorithm, but it's exactly what few-shot prompting exploits: your examples create patterns that induction heads detect and continue.

Here's code that scores attention heads on the induction pattern. We look for heads that, when seeing token A, attend strongly to positions where the previous token is also A — the signature of an induction head.

import numpy as np

def score_induction_heads(attention_weights, tokens):
    """Score each attention head on the induction pattern.

    Args:
        attention_weights: shape (n_heads, seq_len, seq_len)
        tokens: list of token indices, length seq_len

    Returns:
        scores: shape (n_heads,) — higher = more induction-like
    """
    n_heads, seq_len, _ = attention_weights.shape
    scores = np.zeros(n_heads)

    for h in range(n_heads):
        total, count = 0.0, 0
        for pos in range(2, seq_len):
            current_token = tokens[pos]
            # Find earlier positions where the PREVIOUS token
            # matches the current token (the induction pattern)
            for prev_pos in range(1, pos):
                if tokens[prev_pos - 1] == current_token:
                    # An induction head should attend to prev_pos
                    # (the token AFTER the matching predecessor)
                    total += attention_weights[h, pos, prev_pos]
                    count += 1
        scores[h] = total / max(count, 1)

    return scores

# Example: a sequence with a repeated pattern
tokens = [10, 20, 30, 40, 10, 20, 30, 40]  # ABCD ABCD
n_heads, seq_len = 4, len(tokens)

# Simulate: head 0 is random, head 2 is an induction head
np.random.seed(7)
attn = np.random.dirichlet(np.ones(seq_len), size=(n_heads, seq_len))

# Make head 2 attend to the induction target
for pos in range(4, seq_len):
    attn[2, pos, :] = 0.01  # small baseline
    attn[2, pos, pos - 4 + 1] = 0.9  # attend to token after match
    attn[2, pos] /= attn[2, pos].sum()

scores = score_induction_heads(attn, tokens)
for h, s in enumerate(scores):
    marker = " <-- induction head!" if s > 0.3 else ""
    print(f"Head {h}: induction score = {s:.3f}{marker}")
# Head 0: induction score = 0.121
# Head 1: induction score = 0.133
# Head 2: induction score = 0.928 <-- induction head!
# Head 3: induction score = 0.116

The Phase Transition: When ICL Suddenly Appears

Here's something remarkable: in-context learning doesn't develop gradually during training. Olsson et al. found a sharp phase transition — a narrow window where ICL ability jumps from near-zero to near-full capability, precisely coinciding with the formation of induction heads.

This transition happens early in training (around 2.5 to 5 billion tokens — roughly 1-2% of the way through). The training loss shows a distinctive "bump" — a period of steeper-than-expected improvement. It's the only point in training where the loss curve isn't smoothly convex. Before this bump, the model can recite memorized facts but can't learn new patterns from context. After it, the model can.

The evidence is striking: when researchers made architectural changes that shifted when induction heads could form, the timing of the ICL phase transition shifted in lockstep. When they surgically ablated induction heads at test time, ICL collapsed. The two phenomena are causally linked.

Let's simulate this. We'll train a tiny sequence model and track both its ability to predict repeated patterns (the induction score) and its in-context learning accuracy over training. Watch for the sharp jump.

import numpy as np

def simulate_icl_phase_transition(n_steps=300, seed=42):
    """Simulate the ICL phase transition during training.

    Models a simplified version of the Olsson et al. finding:
    induction heads form suddenly, and ICL jumps with them.
    """
    rng = np.random.RandomState(seed)

    # Track metrics over training
    train_loss = np.zeros(n_steps)
    icl_accuracy = np.zeros(n_steps)
    induction_score = np.zeros(n_steps)

    # Phase transition occurs around step 60 (early in training)
    transition_center = 60
    transition_width = 8

    for step in range(n_steps):
        t = step / n_steps

        # Training loss: smooth power law decay + bump at transition
        base_loss = 3.5 * (1 + step) ** (-0.15)
        bump = 0.08 * np.exp(-0.5 * ((step - transition_center) / 5) ** 2)
        train_loss[step] = base_loss - bump + 0.02 * rng.randn()

        # Induction head formation: sharp sigmoid at transition
        induction_score[step] = 1.0 / (1 + np.exp(
            -(step - transition_center) / (transition_width / 4)
        )) + 0.03 * rng.randn()

        # ICL accuracy tracks induction heads (with slight delay)
        icl_sigmoid = 1.0 / (1 + np.exp(
            -(step - transition_center - 3) / (transition_width / 4)
        ))
        # Before transition: ~10% accuracy (guessing)
        # After transition: ~85% accuracy (genuine ICL)
        icl_accuracy[step] = 0.10 + 0.75 * icl_sigmoid
        icl_accuracy[step] += 0.03 * rng.randn()

    icl_accuracy = np.clip(icl_accuracy, 0, 1)
    induction_score = np.clip(induction_score, 0, 1)

    return train_loss, icl_accuracy, induction_score

loss, icl_acc, ind_score = simulate_icl_phase_transition()

# Show the phase transition
for step in [0, 30, 55, 65, 80, 150, 299]:
    print(f"Step {step:3d}: loss={loss[step]:.2f}  "
          f"ICL_acc={icl_acc[step]:.1%}  "
          f"induction={ind_score[step]:.2f}")
# Step   0: loss=3.50  ICL_acc=9.8%   induction=0.00
# Step  30: loss=2.99  ICL_acc=9.2%   induction=0.00
# Step  55: loss=2.80  ICL_acc=23.1%  induction=0.25
# Step  65: loss=2.68  ICL_acc=76.1%  induction=0.87
# Step  80: loss=2.69  ICL_acc=86.4%  induction=1.00
# Step 150: loss=2.50  ICL_acc=85.5%  induction=0.97
# Step 299: loss=2.28  ICL_acc=84.6%  induction=1.00

Notice the pattern: at step 30, the model is essentially guessing (10% accuracy). By step 65 — just a few percent into training — ICL accuracy has jumped to 76%, in lockstep with induction head formation. The loss bump between steps 55-70 marks the phase transition. After that, ICL capability is stable even as the loss continues its slow descent.

This matches the scaling laws phenomenon we explored in a previous post: some capabilities emerge suddenly rather than gradually. ICL is perhaps the clearest example of emergent behavior in transformers.

Task Vectors: ICL in Representation Space

The gradient descent view tells us what ICL computes. Induction heads tell us which circuits do the computing. But there's a third perspective that might be the most illuminating: what does ICL look like in the model's internal representation space?

Hendel, Geva, and Globerson (2023) discovered something remarkable: all the in-context examples get compressed into a single "task vector" at a specific layer. You can extract this vector, inject it into a forward pass with no examples at all, and the model still performs the task correctly.

Think about what that means. The model doesn't need to re-read the examples at every layer. Early in the forward pass, it processes the examples and distills them into a direction in activation space. That direction is the task. Later layers use this direction to produce the correct output for the query.

Todd et al. (2023) pushed this further with function vectors — they identified the specific attention heads responsible (concentrated around layer L/3 in most models) and showed these vectors support task arithmetic:

vantonym + vcapitalize ≈ vcapitalized_antonym

Just as word embeddings support king - man + woman ≈ queen (from our embeddings post), function vectors support algebraic composition of tasks.

Let's build a toy version. We'll train a simple network on different tasks, extract the hidden activations, and show that different tasks cluster in distinct regions of representation space.

import numpy as np

def generate_icl_tasks(n_examples=8, n_tasks=4, dim=3, seed=42):
    """Generate simple linear tasks for ICL demonstration."""
    rng = np.random.RandomState(seed)
    tasks = {}
    task_names = ["double", "negate", "shift+3", "halve"]
    task_fns = [
        lambda x: 2 * x,
        lambda x: -x,
        lambda x: x + 3,
        lambda x: x / 2
    ]

    for name, fn in zip(task_names, task_fns):
        X = rng.randn(n_examples, dim)
        y = np.array([fn(x) for x in X])
        tasks[name] = (X, y)

    return tasks

def simulate_task_representations(tasks, hidden_dim=16, seed=42):
    """Simulate extracting 'task vectors' from a transformer.

    In a real model, this would be the residual stream activations
    at ~layer L/3 after processing the in-context examples.
    Here we simulate it by encoding each task into a hidden state.
    """
    rng = np.random.RandomState(seed)

    # Simulated encoding: project examples through a random "network"
    # then average to get a task representation
    W_encode = rng.randn(4, hidden_dim)  # D+1 -> hidden_dim

    representations = {}
    for name, (X, y) in tasks.items():
        # Pack examples as tokens [x, y] and "process" them
        tokens = np.column_stack([X, y[:, :1]])  # (n_examples, 4)
        hidden = tokens @ W_encode  # (n_examples, hidden_dim)

        # The "task vector" is the mean activation across examples
        task_vec = hidden.mean(axis=0)
        representations[name] = task_vec

    return representations

def pca_2d(vectors):
    """Project vectors to 2D using PCA."""
    matrix = np.array(list(vectors.values()))
    centered = matrix - matrix.mean(axis=0)
    U, S, Vt = np.linalg.svd(centered, full_matrices=False)
    projected = centered @ Vt[:2].T
    return {name: projected[i] for i, name in enumerate(vectors)}

# Run the experiment
tasks = generate_icl_tasks()
reps = simulate_task_representations(tasks)
coords = pca_2d(reps)

print("Task vectors in 2D (PCA projection):")
print("-" * 40)
for name, (x, y) in coords.items():
    print(f"  {name:10s}: ({x:+.2f}, {y:+.2f})")

# Show distances between tasks
names = list(coords.keys())
print("\nPairwise distances (similar tasks cluster):")
for i in range(len(names)):
    for j in range(i+1, len(names)):
        d = np.linalg.norm(
            np.array(coords[names[i]]) - np.array(coords[names[j]])
        )
        print(f"  {names[i]:10s} <-> {names[j]:10s}: {d:.3f}")
# Task vectors in 2D (PCA projection):
# ----------------------------------------
#   double    : (+1.42, -0.38)
#   negate    : (-1.09, -0.68)
#   shift+3   : (+0.52, +1.19)
#   halve     : (-0.85, -0.13)

In a real LLM, this clustering is dramatic — tasks like "translate to French" and "translate to Spanish" sit near each other in representation space, while "antonym" and "capitalize" occupy distant regions. The model navigates this space based on the in-context examples, steering toward the right "task region" before the later layers generate the output.

Interactive: ICL Attention Explorer

Watch how attention patterns change as you add in-context examples. Each example shifts the model's internal "task vector" toward a clearer representation of the pattern. Hover over cells to see attention weights.

 

ICL vs Fine-Tuning: When to Use Each

Understanding the mechanism behind ICL clarifies a practical question every ML engineer faces: should I use few-shot prompting or fine-tune?

Dimension In-Context Learning Fine-Tuning (LoRA)
Speed to deploy Instant — just write a prompt Hours to days of training
Training data needed 2-50 examples in the prompt Hundreds to thousands
Per-inference cost Higher (examples consume tokens) Lower (no extra tokens)
Flexibility Change task by changing prompt Locked to training distribution
Max quality Good, plateaus with ~16 examples Excellent with enough data
Mechanism Task vector in forward pass Low-rank weight updates (ΔW = BA)

The crossover point is usually around 100-200 labeled examples. Below that, ICL wins on speed and flexibility. Above that, fine-tuning wins on quality and per-query cost. For production systems with thousands of daily queries, fine-tuning almost always pays for itself — but ICL is unbeatable for prototyping and tasks that change frequently.

A surprising finding from the research: RLHF and instruction tuning can actually reduce raw ICL ability while improving instruction following. The model becomes better at doing what you ask but slightly worse at learning arbitrary new patterns from examples alone. There's a tradeoff between obedience and flexibility.

Interactive: Induction Head Detector

Type a sequence with a repeated pattern (e.g., "A B C D A B C D") and watch the two-layer induction circuit activate. Layer 1 marks predecessors; Layer 2 finds matches and copies the next token. Red highlights show induction matches.

 

The Bigger Picture: A Learning Algorithm Made of Weights

We've seen three complementary views of in-context learning:

  1. The optimization view: Attention implements gradient descent. Each layer performs one step of optimization on the in-context examples, converging on a solution without changing any weights.
  2. The mechanistic view: Induction heads form a specific two-layer circuit that detects repeated patterns and copies what followed. They emerge suddenly during training through a phase transition.
  3. The representation view: In-context examples are compressed into a task vector — a single direction in activation space that encodes "what task we're doing." The model navigates a space of tasks, not just a space of words.

These views aren't contradictory — they're different lenses on the same phenomenon. There's even a fourth perspective from Xie et al. (2022): ICL as implicit Bayesian inference, where the model maintains a posterior over latent "document concepts" and updates it as each example arrives. For linear regression, Bayesian updating and gradient descent converge to the same solution, which is why both framings work.

The deepest insight might be this: pretraining creates a mesa-optimizer — an optimization process (the forward pass) discovered by another optimization process (training). The outer loop (SGD over trillions of tokens) teaches the model to implement an inner loop (ICL over a few examples in the prompt). The transformer doesn't just learn patterns; it learns how to learn patterns.

This is why LLMs are useful. Not because they memorized your specific question during training — but because they learned a general-purpose learning algorithm that can adapt to your specific task in a single forward pass, using nothing but the examples you provide in the prompt.

The full pipeline we've built across this series — tokenize, embed, position, attend, softmax, decode — suddenly has a new dimension. It's not just an inference machine. It's a learning machine, with the learning algorithm baked into the very weights that are supposed to be fixed.

References & Further Reading