Mechanistic Interpretability from Scratch
Why Look Inside?
This series has spent 48 posts teaching you how to build neural networks — from individual neurons to full transformers, from training loops to scaling laws. We’ve assembled the engine piece by piece. But we’ve never popped the hood on a trained network and asked: what did it actually learn in there?
That question haunts modern AI. We train a language model on terabytes of text, it develops the ability to write poetry, solve math problems, and argue about philosophy — and we have essentially no idea how. The weights are just a billion floating-point numbers. Where is the “poetry module”? Which neurons encode factual knowledge? What algorithm does attention head 7 in layer 15 actually implement?
Mechanistic interpretability is the science of answering these questions. Not by treating the model as a black box and measuring inputs versus outputs, but by reverse-engineering the internal computations — identifying the features, tracing the circuits, and reading the algorithms that networks learn during training.
The field coalesced around three bold claims from Chris Olah and collaborators in their landmark 2020 paper “Zoom In: An Introduction to Circuits”:
- Features are the fundamental units — meaningful directions in activation space, not individual neurons
- Circuits are how features connect — subgraphs of weights implementing specific, interpretable algorithms
- Universality — analogous features and circuits form across different models trained on different data
In this post we’ll build the core tools of the interpretability researcher’s toolkit from scratch: superposition models, probing classifiers, activation patching, the logit lens, attention pattern analysis, and sparse autoencoders for feature extraction. By the end, you’ll be able to open up any neural network and start reading its internals.
The Superposition Hypothesis
Here’s the naïve hope: each neuron learns one concept. Neuron 42 fires for “dogs,” neuron 137 fires for “the Eiffel Tower,” neuron 891 fires for “sarcasm.” If that were true, interpretability would be easy — just read the labels.
Reality is messier. Individual neurons are polysemantic: a single neuron might fire for dogs, cars, AND the color blue. Why? Because neural networks need to represent far more concepts than they have neurons. A language model encounters millions of distinct concepts during training, but even a large model might have only a few thousand neurons per layer.
The solution networks discover is superposition: packing multiple features into overlapping directions in activation space. Think of it like an apartment building — if you have 100 tenants but only 50 parking spots, you can make it work as long as not everyone drives at the same time. Similarly, if two features are rarely active simultaneously (they’re sparse), they can share the same neural dimensions with minimal interference.
Anthropic’s 2022 paper “Toy Models of Superposition” made this precise. They built a minimal autoencoder: encode n sparse input features through a bottleneck of m < n dimensions, then decode back. The key finding: when features are sparse enough, the network learns to represent ALL n features in just m dimensions — it discovers superposition spontaneously.
Let’s build that toy model ourselves and watch superposition emerge:
import numpy as np
def train_superposition_model(n_features=8, bottleneck=2, sparsity=0.05,
steps=2000, lr=0.01):
"""Toy model of superposition: compress n sparse features into a bottleneck."""
# Importance weights — first features matter more
importance = np.array([0.7 ** i for i in range(n_features)])
# Initialize encoder (n_features -> bottleneck) and decoder (bottleneck -> n_features)
W_enc = np.random.randn(bottleneck, n_features) * 0.5
W_dec = np.random.randn(n_features, bottleneck) * 0.5
b_dec = np.zeros(n_features)
for step in range(steps):
# Generate sparse input: each feature active with probability = sparsity
x = np.random.uniform(0, 1, (64, n_features))
mask = (np.random.rand(64, n_features) < sparsity).astype(float)
x = x * mask # sparse activations
# Forward: encode -> decode -> ReLU
h = x @ W_enc.T # (64, bottleneck)
x_hat = np.maximum(0, h @ W_dec.T + b_dec) # (64, n_features)
# Importance-weighted MSE loss
diff = (x_hat - x) * importance
loss = np.mean(diff ** 2)
# Backward (manual gradients)
grad_out = 2 * diff * importance / x.shape[0]
grad_out = grad_out * (x_hat > 0) # ReLU derivative
W_dec -= lr * (grad_out.T @ h)
b_dec -= lr * grad_out.sum(axis=0)
W_enc -= lr * (grad_out @ W_dec).T @ x
# Extract learned feature directions (columns of W_enc)
feature_dirs = W_enc.T # (n_features, bottleneck)
norms = np.linalg.norm(feature_dirs, axis=1, keepdims=True) + 1e-8
feature_dirs_normed = feature_dirs / norms
# Compute interference matrix: |cos similarity| between all feature pairs
cos_sim = np.abs(feature_dirs_normed @ feature_dirs_normed.T)
np.fill_diagonal(cos_sim, 0)
print(f"Features: {n_features}, Bottleneck: {bottleneck}, Sparsity: {sparsity}")
print(f"Final loss: {loss:.4f}")
print(f"Mean interference between features: {cos_sim.mean():.3f}")
print(f"Max interference: {cos_sim.max():.3f}")
return W_enc, cos_sim
# Dense features: only top-2 features survive (PCA-like)
print("=== Dense features (sparsity=0.9) ===")
_, sim_dense = train_superposition_model(sparsity=0.9)
# Sparse features: ALL 8 features packed into 2D (superposition!)
print("\n=== Sparse features (sparsity=0.05) ===")
_, sim_sparse = train_superposition_model(sparsity=0.05)
With dense features (sparsity=0.9), the bottleneck can only faithfully represent 2 features — the most important ones, exactly like PCA. The remaining 6 features are simply dropped. But with sparse features (sparsity=0.05), something remarkable happens: the model crams all 8 features into just 2 dimensions. The feature directions spread out into a regular geometric pattern, like the spokes of a wheel, maximizing the angular separation between them. This is superposition in action — and it’s why individual neurons are polysemantic.
Interactive: Superposition Explorer
Watch features pack into a 2D bottleneck. With few features, they align with the axes. Add more features to see superposition emerge — arrows spread into geometric patterns. Low activation probability means sparser features, which allows more superposition.
Probing Classifiers
Superposition tells us that features are directions, not individual neurons. But which directions encode which concepts? The simplest tool to answer this is the probing classifier: train a tiny linear model on a network’s internal activations to predict some property of the input.
The logic is straightforward. If a linear probe trained on layer 3’s activations can predict whether the input sentence is positive or negative with 95% accuracy, then sentiment information is linearly encoded at layer 3. If the same probe at layer 1 only gets 55%, then that layer hasn’t yet extracted sentiment — it’s still processing lower-level features.
Why insist on linear probes? Because a powerful nonlinear probe (an MLP with many layers) could learn the concept itself from raw activations, rather than detecting that the network already represents it. A linear probe can only succeed if the information is already present in a linearly accessible form. This is a feature, not a limitation.
Let’s probe a small network layer by layer and watch information build up:
import numpy as np
def build_and_probe_network():
"""Train a 3-layer MLP, then probe each layer for the target concept."""
np.random.seed(42)
# Synthetic task: classify 2D points as positive (y > sin(x)) or negative
n = 500
X = np.random.randn(n, 2) * 2
y = (X[:, 1] > np.sin(X[:, 0])).astype(float)
# 3-layer MLP: 2 -> 16 -> 16 -> 16 -> 1
dims = [2, 16, 16, 16, 1]
weights, biases = [], []
for i in range(len(dims) - 1):
w = np.random.randn(dims[i], dims[i+1]) * np.sqrt(2 / dims[i])
b = np.zeros(dims[i+1])
weights.append(w); biases.append(b)
# Train the MLP (simple SGD)
for epoch in range(300):
# Forward pass — save activations at each layer
activations = [X]
h = X
for i in range(len(weights) - 1):
h = np.maximum(0, h @ weights[i] + biases[i]) # ReLU
activations.append(h)
logits = h @ weights[-1] + biases[-1]
pred = 1 / (1 + np.exp(-logits.squeeze()))
# Backward pass and update (simplified)
grad = (pred - y).reshape(-1, 1) / n
for i in range(len(weights) - 1, -1, -1):
gw = activations[i].T @ grad
weights[i] -= 0.5 * gw
biases[i] -= 0.5 * grad.sum(axis=0)
if i > 0:
grad = (grad @ weights[i].T) * (activations[i] > 0)
# Now PROBE each hidden layer for the target label
for layer_idx in range(1, len(activations)):
A = activations[layer_idx] # (n, 16)
# Linear probe: logistic regression via closed-form (pseudo-inverse)
A_bias = np.column_stack([A, np.ones(n)])
w_probe = np.linalg.lstsq(A_bias, y, rcond=None)[0]
probe_pred = (A_bias @ w_probe > 0.5).astype(float)
accuracy = np.mean(probe_pred == y)
print(f"Layer {layer_idx} probe accuracy: {accuracy:.1%}")
build_and_probe_network()
# Layer 1 probe accuracy: ~72% (raw features, partial info)
# Layer 2 probe accuracy: ~88% (building abstract representations)
# Layer 3 probe accuracy: ~96% (task-relevant encoding)
The probe accuracy increases layer by layer — a direct window into how the network progressively transforms raw input features into task-relevant representations. Early layers detect simple boundaries, middle layers combine them, and late layers encode the exact decision the network needs to make. This is the information flow through the network, made visible.
Activation Patching
Probes tell us what information exists at each layer. But existence doesn’t imply usage. A network might encode the color of an object in layer 2 but never actually use it for classification. To establish causal importance, we need a stronger tool: activation patching.
The idea comes from neuroscience: to test whether a brain region is necessary for a behavior, lesion it and see what happens. For neural networks, we do something more surgical. Run the model on a “clean” input that produces the correct answer. Run it again on a “corrupted” input that produces the wrong answer. Then, one component at a time, patch the clean activation into the corrupted run and measure how much the output recovers.
If restoring a particular layer at a particular position brings back the correct answer, that component is causally responsible for the computation. This is exactly the method Kevin Meng and colleagues used in their 2022 paper to discover that factual knowledge lives in the MLP layers at the last subject token position.
import numpy as np
def activation_patching_demo():
"""Demonstrate activation patching on a tiny 3-layer network."""
np.random.seed(7)
n_tokens, d_model, n_layers = 4, 8, 3
# Simulated model: each layer transforms the residual stream
layer_weights = [np.random.randn(d_model, d_model) * 0.3 for _ in range(n_layers)]
unembed = np.random.randn(d_model, 10) * 0.3 # project to 10-class vocab
def forward(x, patch_layer=None, patch_pos=None, patch_val=None):
"""Forward pass with optional activation patching."""
residual = x.copy() # (n_tokens, d_model)
for layer in range(n_layers):
residual = residual + np.tanh(residual @ layer_weights[layer])
if patch_layer == layer and patch_val is not None:
residual[patch_pos] = patch_val[patch_pos]
logits = residual @ unembed # (n_tokens, vocab)
return residual, logits
# Clean input: a specific pattern
x_clean = np.random.randn(n_tokens, d_model)
clean_residuals = []
residual = x_clean.copy()
for layer in range(n_layers):
residual = residual + np.tanh(residual @ layer_weights[layer])
clean_residuals.append(residual.copy())
_, clean_logits = forward(x_clean)
target_class = clean_logits[-1].argmax() # correct answer at last position
# Corrupted input: add noise to first two token positions
x_corrupt = x_clean.copy()
x_corrupt[:2] += np.random.randn(2, d_model) * 2.0
_, corrupt_logits = forward(x_corrupt)
corrupt_prob = np.exp(corrupt_logits[-1]) / np.exp(corrupt_logits[-1]).sum()
# Patch each (layer, position) and measure recovery
clean_prob = np.exp(clean_logits[-1]) / np.exp(clean_logits[-1]).sum()
base_correct = corrupt_prob[target_class]
print(f"Target class: {target_class}")
print(f"Clean P(correct): {clean_prob[target_class]:.3f}")
print(f"Corrupt P(correct): {base_correct:.3f}\n")
print("Recovery after patching (layer x position):")
print(f"{'':<10}", end="")
for pos in range(n_tokens):
print(f"Pos {pos:<6}", end="")
print()
for layer in range(n_layers):
print(f"Layer {layer}: ", end="")
for pos in range(n_tokens):
_, patched_logits = forward(
x_corrupt, patch_layer=layer,
patch_pos=pos, patch_val=clean_residuals[layer]
)
patched_prob = np.exp(patched_logits[-1]) / np.exp(patched_logits[-1]).sum()
recovery = (patched_prob[target_class] - base_correct) / (clean_prob[target_class] - base_correct + 1e-8)
print(f"{recovery:<6.2f} ", end="")
print()
activation_patching_demo()
The output is a grid showing how much each (layer, position) pair contributes to the correct answer. High recovery values reveal the critical path — the components the network actually relies on. This is how researchers trace information flow through transformers, identifying which attention heads store factual knowledge and which MLP layers perform the final computation.
Probes ask “is this information here?” Activation patching asks “does this information matter?” The distinction is the difference between correlation and causation.
The Logit Lens
Here’s one of the most elegant ideas in interpretability. In a transformer, the residual stream is a shared communication channel — every layer reads from it and writes to it (as we built in the transformer from scratch post). The final layer projects this stream into vocabulary space to make a prediction. But what happens if we apply that same projection at intermediate layers?
That’s the logit lens, discovered by the pseudonymous researcher nostalgebraist in 2020. At every layer, take the current residual stream, multiply by the unembedding matrix, apply softmax, and read off the model’s “current best guess.” The result is a window into the model’s evolving thought process — watch a vague prediction crystallize into a confident answer as information flows through the layers.
import numpy as np
def logit_lens_demo():
"""Apply the logit lens to a tiny transformer — read predictions at each layer."""
np.random.seed(21)
vocab = ["cat", "dog", "fish", "bird", "tree", "rock", "sky", "sun"]
n_vocab, d_model, n_layers = len(vocab), 16, 6
# Random embeddings and layer transformations
embed = np.random.randn(n_vocab, d_model) * 0.5
unembed = np.random.randn(d_model, n_vocab) * 0.3
layers = [np.random.randn(d_model, d_model) * 0.2 for _ in range(n_layers)]
# Input: token index 0 ("cat")
x = embed[0] # (d_model,)
residual = x.copy()
print("Logit Lens — predictions at each layer:\n")
print(f"{'Layer':-<8} {'Top-1':-<8} {'P(top-1)':-<10} {'Top-3 predictions'}")
print("-" * 55)
for layer_idx in range(n_layers):
# Layer transform (simplified: tanh nonlinearity + residual)
residual = residual + np.tanh(residual @ layers[layer_idx]) * 0.5
# Logit lens: project to vocab space
logits = residual @ unembed
probs = np.exp(logits - logits.max()) / np.exp(logits - logits.max()).sum()
# Top-3 predictions
top3 = np.argsort(probs)[::-1][:3]
top1_word = vocab[top3[0]]
top1_prob = probs[top3[0]]
top3_str = ", ".join(f"{vocab[i]} ({probs[i]:.2f})" for i in top3)
print(f"L{layer_idx + 1:<6} {top1_word:<8} {top1_prob:<10.3f} {top3_str}")
print("\nWatch the prediction sharpen — early layers are uncertain,")
print("later layers converge as the residual stream accumulates information.")
logit_lens_demo()
In a real language model, the logit lens reveals something beautiful: early layers predict broad semantic categories (“something about animals”), middle layers narrow down (“probably a pet”), and the final layers commit to the exact token. Each layer refines the prediction, writing corrections into the residual stream. The model literally changes its mind as computation proceeds.
Interactive: Logit Lens Viewer
See how a model’s predictions evolve across layers. Each row is a layer, each cell shows a predicted token with size proportional to its probability. The gold-bordered cell shows where the final answer first appears.
A limitation worth noting: intermediate layers may use internal representations that don’t perfectly align with the unembedding matrix. This “representational drift” can make the logit lens noisy for some models. The Tuned Lens (Belrose et al. 2023) addresses this by learning a small affine transformation per layer, but the basic logit lens remains a remarkably effective first tool.
Attention Pattern Analysis
Attention heads are the most naturally interpretable component of a transformer. Unlike MLP layers (which apply a nonlinear transformation that’s hard to decompose), each attention head produces an explicit matrix of weights showing how much each token attends to every other token. We can literally read what the head is looking at.
Research has revealed that attention heads self-organize into distinct functional types during training:
- Previous-token heads — always attend to position i-1. They write “what came before me” into the residual stream, providing local context
- Induction heads — the crown jewel. They implement the pattern
[A][B]...[A] → [B]: find a previous occurrence of the current token, then predict what followed it last time. This is the core mechanism of in-context learning - Duplicate-token heads — attend to previous occurrences of the same token, regardless of what follows
- Positional heads — attend to fixed positions (first token, last token) across all inputs
The induction circuit is the most celebrated discovery in mechanistic interpretability. It requires two heads working together across layers: a previous-token head in an earlier layer writes “token B came after token A” into the residual stream, then an induction head in a later layer reads this signal to predict B whenever A appears again. This two-component circuit is sufficient to explain a large fraction of in-context learning behavior.
import numpy as np
def classify_attention_heads():
"""Generate and classify synthetic attention patterns by head type."""
seq_len = 8
tokens = ["The", "cat", "sat", "on", "the", "cat", "saw", "the"]
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
# Generate 4 attention pattern types
heads = {}
# 1. Previous-token head: strong diagonal offset by -1
score = np.full((seq_len, seq_len), -10.0)
for i in range(1, seq_len):
score[i, i - 1] = 5.0
score[0, 0] = 5.0 # first token attends to itself
heads["Previous-token"] = softmax(score)
# 2. Induction head: attends to token AFTER previous occurrence of current token
score = np.full((seq_len, seq_len), -10.0)
for i in range(seq_len):
for j in range(i):
if tokens[j] == tokens[i] and j + 1 < seq_len:
score[i, j + 1] = 5.0 # attend to what followed the match
# Fill remaining with uniform
for i in range(seq_len):
if score[i].max() < -5:
score[i, :i + 1] = 0.0
heads["Induction"] = softmax(score)
# 3. Duplicate-token head: attends to previous occurrences of same token
score = np.full((seq_len, seq_len), -10.0)
for i in range(seq_len):
for j in range(i):
if tokens[j] == tokens[i]:
score[i, j] = 5.0
if score[i].max() < -5:
score[i, :i + 1] = 0.0
heads["Duplicate-token"] = softmax(score)
# 4. Positional head: always attends to position 0
score = np.full((seq_len, seq_len), -10.0)
score[:, 0] = 5.0
heads["Positional (BOS)"] = softmax(score)
# Classify each head using diagnostic scores
print(f"Tokens: {tokens}\n")
for name, attn in heads.items():
# Previous-token score: mean attention on position i-1
prev_score = np.mean([attn[i, i-1] for i in range(1, seq_len)])
# Positional score: attention entropy (low = positional)
entropy = -np.sum(attn * np.log(attn + 1e-10), axis=-1).mean()
# Duplicate score: attention on matching tokens
dup_score = np.mean([max(attn[i, j] for j in range(i)
if tokens[j] == tokens[i])
for i in range(seq_len) if any(
tokens[j] == tokens[i] for j in range(i))])
print(f"{name:<20} prev={prev_score:.2f} entropy={entropy:.2f} dup={dup_score:.2f}")
classify_attention_heads()
Each head type has a distinctive statistical signature. Previous-token heads show near-1.0 attention on the diagonal offset. Positional heads have near-zero entropy (they always attend to the same position). Induction heads show high attention scores specifically at positions following repeated patterns. By computing these diagnostics across all heads in a trained model, we can automatically build a taxonomy of what each head does.
Sparse Autoencoders for Feature Extraction
We’ve now come full circle. Section 2 explained why neurons are polysemantic: superposition packs many features into few dimensions. The solution is to unpack them — and that’s exactly what sparse autoencoders do.
The architecture is simple (and we built it in detail in the sparse autoencoders post): take a layer’s activations, project them through an encoder into a much larger space (e.g., 512 neurons → 16,000 latent features), apply ReLU to enforce sparsity, then project back through a decoder. The loss is reconstruction error plus an L1 penalty on the latent codes:
Loss = ||x - decode(encode(x))||² + λ · ||encode(x)||₁
The L1 penalty is the key: it forces the autoencoder to represent each input using only a handful of active features. Each feature direction in the larger space ends up corresponding to a single interpretable concept — a monosemantic feature.
Anthropic demonstrated this at breathtaking scale. In 2023, Bricken et al. applied SAEs to a one-layer transformer and found that ~70% of extracted features were cleanly monosemantic. In 2024, Templeton et al. scaled up to Claude 3 Sonnet with 34 million latent features — finding abstract, multilingual features for concepts like “Golden Gate Bridge,” “code bugs,” and even “deceptive behavior.”
import numpy as np
def sae_interpretability_demo():
"""Train a sparse autoencoder to disentangle polysemantic activations."""
np.random.seed(99)
n_concepts = 5 # ground-truth concepts (sparse)
d_model = 3 # model's hidden dim (bottleneck — forces superposition)
d_sae = 12 # SAE latent dim (expanded — room for monosemantic features)
n_samples = 2000
# Ground-truth concept embeddings (randomly oriented in 3D)
concept_dirs = np.random.randn(n_concepts, d_model)
concept_dirs /= np.linalg.norm(concept_dirs, axis=1, keepdims=True)
concept_names = ["code", "math", "music", "sports", "cooking"]
# Generate polysemantic activations: sparse mixtures of concepts in 3D
X = np.zeros((n_samples, d_model))
labels = np.zeros((n_samples, n_concepts))
for i in range(n_samples):
active = np.random.rand(n_concepts) < 0.15 # each concept 15% active
strengths = np.random.uniform(0.5, 2.0, n_concepts) * active
X[i] = strengths @ concept_dirs
labels[i] = active
# Train sparse autoencoder
W_enc = np.random.randn(d_model, d_sae) * 0.3
W_dec = np.random.randn(d_sae, d_model) * 0.3
b_enc = np.zeros(d_sae)
lam = 0.05 # L1 sparsity penalty
for step in range(1500):
idx = np.random.choice(n_samples, 128)
x = X[idx]
# Forward
h = np.maximum(0, x @ W_enc + b_enc) # (128, d_sae) — sparse codes
x_hat = h @ W_dec # (128, d_model) — reconstruction
# Loss = MSE + L1
recon_loss = np.mean((x_hat - x) ** 2)
sparse_loss = lam * np.mean(np.abs(h))
# Backward
grad_xhat = 2 * (x_hat - x) / x.shape[0]
grad_h = grad_xhat @ W_dec.T + lam * np.sign(h) / x.shape[0]
grad_h = grad_h * (h > 0) # ReLU derivative
W_dec -= 0.02 * (h.T @ grad_xhat)
W_enc -= 0.02 * (x.T @ grad_h)
b_enc -= 0.02 * grad_h.sum(axis=0)
# Interpret: for each SAE feature, find which concept it best matches
print("SAE Feature Analysis:")
print(f"{'Feature':-<10} {'Best concept':-<12} {'Correlation':-<14} {'Avg L0'}")
print("-" * 50)
h_all = np.maximum(0, X @ W_enc + b_enc)
active_features = np.where(h_all.mean(axis=0) > 0.01)[0]
for feat_idx in active_features[:8]: # show top 8 active features
activations = h_all[:, feat_idx]
# Correlate with each ground-truth concept
best_corr, best_concept = 0, 0
for c in range(n_concepts):
corr = np.corrcoef(activations, labels[:, c])[0, 1]
if abs(corr) > abs(best_corr):
best_corr, best_concept = corr, c
avg_l0 = (h_all > 0).sum(axis=1).mean()
print(f"f{feat_idx:<8} {concept_names[best_concept]:<12} {best_corr:<14.3f} {avg_l0:.1f}")
sae_interpretability_demo()
The SAE successfully disentangles the 5 concepts that were crammed into 3 polysemantic dimensions. Each SAE feature now corresponds to a single concept — feature 3 fires for “code,” feature 7 fires for “music,” and so on. This is exactly what happens at scale in real language models: the SAE finds thousands of monosemantic features hiding inside polysemantic neurons.
The Interpretability Toolkit
We’ve now built five complementary tools, each answering a different question about what a neural network has learned:
| Technique | Question Answered | Strength |
|---|---|---|
| Probing | Does the network encode concept X? | Simple, fast, works at any layer |
| Activation Patching | Does component Y causally affect the output? | Causal (not just correlational) |
| Logit Lens | What does the model predict at each layer? | Intuitive, visual, no training needed |
| Attention Analysis | What roles do attention heads play? | Directly readable from weights |
| Sparse Autoencoders | What monosemantic features exist? | Scales to large production models |
Together, these tools form a pipeline for circuit analysis — the ultimate goal of mechanistic interpretability. A circuit is a complete subgraph of a neural network that implements a specific behavior. To find one, you use activation patching to identify the important components, attention analysis to trace information flow, probes and the logit lens to understand intermediate representations, and SAEs to decompose those representations into individual features.
The field has moved at remarkable speed. In 2020, Chris Olah’s team was manually analyzing individual neurons in image classifiers. By 2024, Anthropic was extracting 34 million interpretable features from a production language model. The core question — can we truly understand what neural networks learn? — is still open, but the tools we built today are the foundation of every serious attempt to answer it.
The black box is no longer completely black. We’ve built the microscope. Now the real work begins: reading the circuits that intelligence is made of.
References & Further Reading
- Chris Olah, Nick Cammarata et al. — “Zoom In: An Introduction to Circuits” — the foundational paper establishing the circuits framework for neural network interpretability (Distill, 2020)
- nostalgebraist — “interpreting GPT: the logit lens” — the original post introducing the logit lens technique for reading transformer predictions at each layer (LessWrong, 2020)
- Nelson Elhage, Tristan Hume et al. — “Toy Models of Superposition” — Anthropic’s definitive study of how neural networks represent more features than they have dimensions (2022)
- Catherine Olsson, Nelson Elhage, Neel Nanda et al. — “In-context Learning and Induction Heads” — the discovery that a two-head attention circuit implements in-context learning (Anthropic, 2022)
- Kevin Meng, David Bau et al. — “Locating and Editing Factual Associations in GPT” — the activation patching (causal tracing) method for identifying where factual knowledge is stored (NeurIPS, 2022)
- Trenton Bricken et al. — “Towards Monosemanticity” — first demonstration of sparse autoencoders extracting monosemantic features from a language model (Anthropic, 2023)
- Nora Belrose et al. — “Eliciting Latent Predictions from Transformers with the Tuned Lens” — an improved version of the logit lens using learned affine probes per layer (2023)
- Adly Templeton, Tom Conerly et al. — “Scaling Monosemanticity” — scaling sparse autoencoders to Claude 3 Sonnet, extracting 34 million interpretable features (Anthropic, 2024)
- Neel Nanda — “A Comprehensive Mechanistic Interpretability Explainer” — excellent practical guide for getting started with mech interp research
- Anthropic — Transformer Circuits Thread — the ongoing research thread publishing foundational interpretability results