Sparse Autoencoders from Scratch: The Microscope for Neural Networks
What Are Neural Networks Actually Learning?
The autoencoders post showed how to squeeze data through a tight bottleneck and reconstruct it on the other side. That’s the compression direction—fewer dimensions, denser representations. But what if the real mystery isn’t how networks compress information—it’s what they’re secretly representing inside?
Crack open a trained neural network and look at an individual neuron. You might expect neuron 42 to represent “cats” or neuron 117 to represent “code.” But that’s not what happens. Neuron 42 fires for cats, blue objects, and the letter Q simultaneously. Neuron 117 responds to Python code, cooking recipes, and discussions about the weather. Individual neurons are a mess—they mean many things at once.
This is the black box problem, and sparse autoencoders are the crowbar that pries it open. Instead of compressing into fewer dimensions like a standard autoencoder, a sparse autoencoder expands into more dimensions while forcing most of them to stay zero. The result is extraordinary: each active dimension represents one clean, interpretable concept. One feature for cats. One feature for Python code. One feature for the Golden Gate Bridge.
This is exactly how Anthropic peered inside Claude and discovered that individual features correspond to specific concepts—and then built “Golden Gate Claude” by cranking one feature to maximum. Let’s build the whole thing from scratch.
The Superposition Problem: Too Many Features, Too Few Neurons
Why are neurons such a mess? The answer is superposition—and it’s actually a clever trick the network learned on its own.
Imagine a network needs to track 5 binary features (say, “is round,” “is blue,” “has legs,” “is edible,” “makes sound”), but its hidden layer only has 2 neurons. With 2 dimensions, you can only represent 2 independent directions. But if those features are sparse—most inputs only activate 1 or 2 features at a time—the network discovers it can pack all 5 features into just 2 dimensions by using nearly orthogonal directions.
Think of it geometrically: 5 directions arranged like the vertices of a pentagon in a 2D plane. No two directions are perfectly orthogonal, but their dot products are small enough that the interference is tolerable when only 1–2 features are active at once. The network trades some accuracy for much more representational capacity.
This was formalized by Elhage et al. at Anthropic in their landmark 2022 paper “Toy Models of Superposition.” They showed a clean phase diagram: dense, important features get dedicated neurons. Sparse, less important features get packed into superposition. The sparser the features, the more aggressively the network compresses.
Here’s superposition in action:
import numpy as np
def demonstrate_superposition():
"""5 sparse features packed into 2 dimensions."""
np.random.seed(42)
d_features, d_hidden = 5, 2
# Ground-truth features: 5 one-hot vectors in 5D
# Each data point activates 1-2 features (sparse!)
n_samples = 500
feature_probs = 0.15 # each feature active 15% of the time
X = np.random.binomial(1, feature_probs, (n_samples, d_features)).astype(float)
X *= np.random.uniform(0.5, 2.0, X.shape) # random magnitudes
# Importance weights: feature 0 most important, feature 4 least
importance = np.array([1.0, 0.8, 0.6, 0.4, 0.2])
# Train a linear model to reconstruct X through a 2D bottleneck
# Tied weights: X_hat = X @ W @ W.T, minimize importance-weighted MSE
W = np.random.randn(d_features, d_hidden) * 0.5
lr = 0.001
for step in range(2000):
H = X @ W # encode: (n, 2)
X_hat = H @ W.T # decode: (n, 5)
R = importance * (X_hat - X) # weighted residual: (n, 5)
# Gradient via both paths through W (encoding and decoding)
grad = X.T @ (R @ W) / n_samples # path 1: through encoder
grad += (R.T @ H) / n_samples # path 2: through decoder
W -= lr * 2 * grad
# The learned directions for each feature
print("Feature directions in 2D space (rows of W):")
for i in range(d_features):
direction = W[i] / (np.linalg.norm(W[i]) + 1e-8)
print(f" Feature {i} (importance={importance[i]:.1f}): "
f"[{direction[0]:+.3f}, {direction[1]:+.3f}]")
# Check polysemanticity: each hidden neuron responds to multiple features
print("\nHidden neuron 0 responds to features:", end=" ")
print([i for i in range(d_features) if abs(W[i, 0]) > 0.2])
print("Hidden neuron 1 responds to features:", end=" ")
print([i for i in range(d_features) if abs(W[i, 1]) > 0.2])
demonstrate_superposition()
The output reveals the problem: each hidden neuron participates in representing multiple features. Neuron 0 responds to features 0, 2, and 4. Neuron 1 responds to features 1, 3, and 4. The features are tangled together—polysemantic neurons. If you only look at individual neurons, you can’t figure out what the network learned.
Superposition is the default behavior of neural networks. And it makes individual neurons fundamentally uninterpretable.
The Sparse Autoencoder Architecture: Flipping the Bottleneck
Standard autoencoders compress: 784 dimensions → 32 dimensions → 784 dimensions. The bottleneck forces information loss, creating a compressed representation. Sparse autoencoders flip this entirely: 16 dimensions → 64 dimensions → 16 dimensions. The hidden layer is bigger than the input.
How can a bigger hidden layer be useful? Isn’t that just a trivial identity mapping? The trick is the sparsity constraint: for any given input, only a handful of those 64 dimensions are allowed to be nonzero. The overcomplete dictionary gives each concept room for its own dedicated dimension. The sparsity constraint ensures the network actually uses that room cleanly instead of distributing information across all dimensions.
One important detail from Bricken et al.’s original paper: before encoding, we subtract the decoder bias b_dec from the input. This pre-centering trick lets the decoder bias capture the mean activation, so the sparse features only need to represent deviations from the mean—a cleaner decomposition.
class SparseAutoencoder:
"""Overcomplete autoencoder with sparsity constraint."""
def __init__(self, d_model, d_sae):
self.d_model = d_model
self.d_sae = d_sae
# He initialization for ReLU
scale_enc = np.sqrt(2.0 / d_model)
scale_dec = np.sqrt(2.0 / d_sae)
self.W_enc = np.random.randn(d_sae, d_model) * scale_enc
self.b_enc = np.zeros(d_sae)
self.W_dec = np.random.randn(d_model, d_sae) * scale_dec
self.b_dec = np.zeros(d_model)
# Normalize decoder columns to unit norm
self.normalize_decoder()
def normalize_decoder(self):
"""Constrain each decoder column to unit L2 norm."""
norms = np.linalg.norm(self.W_dec, axis=0, keepdims=True)
self.W_dec /= (norms + 1e-8)
def encode(self, x):
"""Encode with pre-centering and ReLU activation."""
x_centered = x - self.b_dec # pre-center
z_pre = x_centered @ self.W_enc.T + self.b_enc
z = np.maximum(0, z_pre) # ReLU: creates true zeros
return z, z_pre
def decode(self, z):
"""Reconstruct from sparse features."""
return z @ self.W_dec.T + self.b_dec
def forward(self, x):
"""Full forward pass: encode then decode."""
z, z_pre = self.encode(x)
x_hat = self.decode(z)
return x_hat, z, z_pre
Notice the expansion: if d_model=16 and d_sae=64, that’s a 4× overcomplete dictionary. In practice, researchers use 4× to 64× for small experiments. Anthropic pushed to 2,800× when they trained a 34-million-feature SAE on Claude 3 Sonnet. Each column of W_dec is a “feature direction”—a vector in activation space that represents one learned concept.
The Loss Function: Reconstruction vs. Sparsity
The SAE loss function is a tug-of-war between two forces:
The reconstruction term wants perfect reproduction—pushing activations to be large and dense so nothing is lost. The L1 sparsity penalty wants most activations to be exactly zero—pushing toward minimal, clean representations. The balance between these two forces is what creates interpretable features.
But why L1 and not L2? This is the key geometric insight.
Picture the L1 constraint region: it’s a diamond (or hypercube in higher dimensions) with sharp corners that sit exactly on the axes. The L2 region is a circle (or hypersphere)—smooth everywhere, no corners. When the loss minimum falls near the constraint boundary, L1’s diamond shape means the solution lands at a corner—where some coordinates are exactly zero. L2’s smooth circle gives solutions where coordinates are small but never quite zero. L1 creates true sparsity. L2 creates “approximately sparse.”
At the implementation level, L1’s gradient is the sign function: +1 for positive activations, −1 for negative, undefined at zero. This constant-magnitude gradient keeps pulling activations toward zero with the same force regardless of how small they already are—until they cross zero and get clamped by ReLU.
def sae_loss(x, x_hat, z, lambda_l1):
"""
Compute SAE loss: reconstruction + L1 sparsity.
Returns: total_loss, recon_loss, l1_loss, l0
"""
# Reconstruction: mean squared error
recon_loss = np.mean((x - x_hat) ** 2)
# Sparsity: L1 norm of hidden activations
l1_loss = np.mean(np.sum(np.abs(z), axis=1))
# Total loss
total_loss = recon_loss + lambda_l1 * l1_loss
# L0: average number of nonzero features per input
l0 = np.mean(np.sum(z > 0, axis=1))
return total_loss, recon_loss, l1_loss, l0
def l1_gradient(z):
"""Subgradient of L1 norm: sign(z), with 0 at z=0."""
return np.sign(z)
The key metrics to track during training:
- L0: average number of nonzero features per input. Lower = sparser. Target depends on the application—typically 10–100 for research SAEs.
- Reconstruction MSE: how faithfully we reproduce the original activations. Must stay low enough that the SAE doesn’t destroy important information.
- Dead features: fraction of SAE neurons that never activate on any input. These waste capacity—a real practical challenge at scale.
Training the SAE: Watching Features Emerge
Let’s train on synthetic data where we know the ground truth. We’ll create 8 true features living in a 4-dimensional space (2× overcomplete—exactly the superposition scenario). Each data point is a sparse combination of 1–3 features. Then we’ll train a 32-feature SAE (8× overcomplete) and verify it recovers the true features.
def train_sae_synthetic():
"""Train SAE on synthetic data with known ground-truth features."""
np.random.seed(0)
d_model, d_sae = 4, 32
n_true_features = 8
# Ground-truth features: 8 random unit vectors in 4D
true_features = np.random.randn(n_true_features, d_model)
true_features /= np.linalg.norm(true_features, axis=1, keepdims=True)
# Generate sparse data: each sample uses 1-3 features
n_samples = 5000
X = np.zeros((n_samples, d_model))
for i in range(n_samples):
k = np.random.randint(1, 4) # 1-3 active features
active = np.random.choice(n_true_features, k, replace=False)
coeffs = np.random.uniform(0.5, 2.0, k)
for j, a in enumerate(active):
X[i] += coeffs[j] * true_features[a]
# Initialize SAE
sae = SparseAutoencoder(d_model, d_sae)
lambda_l1 = 0.04
lr = 0.003
# Adam optimizer state
m = {k: np.zeros_like(v) for k, v in
[('W_enc', sae.W_enc), ('b_enc', sae.b_enc),
('W_dec', sae.W_dec), ('b_dec', sae.b_dec)]}
v = {k: np.zeros_like(v) for k, v in
[('W_enc', sae.W_enc), ('b_enc', sae.b_enc),
('W_dec', sae.W_dec), ('b_dec', sae.b_dec)]}
for epoch in range(200):
# Forward pass
x_hat, z, z_pre = sae.forward(X)
total, recon, l1, l0 = sae_loss(X, x_hat, z, lambda_l1)
# Backward pass: gradients through decoder and encoder
d_xhat = 2 * (x_hat - X) / X.shape[0] # d(recon)/d(x_hat)
# Gradient through decoder: x_hat = z @ W_dec.T + b_dec
d_z = d_xhat @ sae.W_dec + lambda_l1 * l1_gradient(z) / X.shape[0]
d_z_pre = d_z * (z_pre > 0) # ReLU mask
d_Wdec = d_xhat.T @ z # (d_model, d_sae)
# Gradient through encoder: z_pre = (X - b_dec) @ W_enc.T + b_enc
d_Wenc = d_z_pre.T @ (X - sae.b_dec) # (d_sae, d_model)
d_benc = np.sum(d_z_pre, axis=0) # (d_sae,)
# b_dec appears in both encoder (pre-centering) and decoder
d_bdec = d_xhat.sum(axis=0) - (d_z_pre @ sae.W_enc).sum(axis=0)
# Adam updates (simplified: beta1=0.9, beta2=0.999, eps=1e-8)
grads = {'W_enc': d_Wenc, 'b_enc': d_benc,
'W_dec': d_Wdec, 'b_dec': d_bdec}
t = epoch + 1
for key in grads:
m[key] = 0.9 * m[key] + 0.1 * grads[key]
v[key] = 0.999 * v[key] + 0.001 * grads[key] ** 2
m_hat = m[key] / (1 - 0.9 ** t)
v_hat = v[key] / (1 - 0.999 ** t)
param = getattr(sae, key)
param -= lr * m_hat / (np.sqrt(v_hat) + 1e-8)
sae.normalize_decoder() # unit norm constraint on decoder columns
if epoch % 50 == 0:
dead = np.mean(np.all(z == 0, axis=0))
print(f"Epoch {epoch:3d} | Recon: {recon:.4f} | "
f"L0: {l0:.1f} | Dead: {dead:.0%}")
# Verify feature recovery: match SAE features to ground truth
alive_mask = np.any(z > 0, axis=0)
alive_cols = sae.W_dec[:, alive_mask].T # (n_alive, d_model)
alive_cols /= np.linalg.norm(alive_cols, axis=1, keepdims=True)
similarity = np.abs(alive_cols @ true_features.T) # cosine similarity
best_match = np.max(similarity, axis=0)
print(f"\nFeature recovery (cosine similarity to ground truth):")
for i in range(n_true_features):
print(f" True feature {i}: best match = {best_match[i]:.3f}")
train_sae_synthetic()
The training dynamics are revealing. Early on, L0 is high—many features fire for each input (dense, entangled). As the L1 penalty takes effect, features specialize. L0 drops. By the end, each input activates only 2–4 SAE features, and those features align closely with the true ground-truth directions. The SAE has reversed the superposition.
The decoder norm constraint deserves emphasis: after every gradient step, we normalize each column of W_dec to unit length. Without this, the network “cheats”—it can shrink decoder weights while inflating encoder weights, making L1 look small without actually learning sparse features. The norm constraint forces honest sparsity.
From Polysemantic to Monosemantic: The Key Experiment
Synthetic data is nice, but the real magic happens when we apply SAEs to an actual neural network’s internals. Let’s train a small classifier, then use an SAE to decompose its hidden activations into interpretable features.
def monosemanticity_experiment():
"""Train a classifier, then decompose its hidden layer with an SAE."""
np.random.seed(7)
n_classes, d_input, d_hidden = 8, 20, 16
# Generate clustered data: 8 classes with distinct centers
centers = np.random.randn(n_classes, d_input) * 3
X_all, Y_all = [], []
for c in range(n_classes):
pts = centers[c] + np.random.randn(200, d_input) * 0.8
X_all.append(pts)
Y_all.extend([c] * 200)
X_all = np.vstack(X_all)
Y_all = np.array(Y_all)
# Train a simple 2-layer MLP classifier
W1 = np.random.randn(d_input, d_hidden) * np.sqrt(2.0 / d_input)
b1 = np.zeros(d_hidden)
W2 = np.random.randn(d_hidden, n_classes) * np.sqrt(2.0 / d_hidden)
b2 = np.zeros(n_classes)
for step in range(500):
h = np.maximum(0, X_all @ W1 + b1) # hidden activations
logits = h @ W2 + b2
# Softmax + cross-entropy (stable)
logits -= logits.max(axis=1, keepdims=True)
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
one_hot = np.eye(n_classes)[Y_all]
d_logits = (probs - one_hot) / len(Y_all)
# Backprop
d_W2 = h.T @ d_logits
d_b2 = d_logits.sum(axis=0)
d_h = d_logits @ W2.T * (h > 0)
d_W1 = X_all.T @ d_h
d_b1 = d_h.sum(axis=0)
lr = 0.01
W1 -= lr * d_W1; b1 -= lr * d_b1
W2 -= lr * d_W2; b2 -= lr * d_b2
# Extract hidden activations
H = np.maximum(0, X_all @ W1 + b1)
# Measure polysemanticity of original neurons
print("=== Original Hidden Neurons (Polysemantic) ===")
for neuron in [0, 3, 7]:
active_classes = []
for c in range(n_classes):
mask = Y_all == c
mean_act = H[mask, neuron].mean()
if mean_act > 0.5:
active_classes.append(c)
print(f" Neuron {neuron} responds to classes: {active_classes}")
# Train SAE on hidden activations
d_sae = 64 # 4x overcomplete
sae = SparseAutoencoder(d_hidden, d_sae)
# (Training loop similar to above, abbreviated for clarity)
lambda_l1 = 0.05
lr = 0.005
m_s = {k: np.zeros_like(getattr(sae, k))
for k in ['W_enc', 'b_enc', 'W_dec', 'b_dec']}
v_s = {k: np.zeros_like(getattr(sae, k))
for k in ['W_enc', 'b_enc', 'W_dec', 'b_dec']}
for epoch in range(300):
x_hat, z, z_pre = sae.forward(H)
d_xhat = 2 * (x_hat - H) / H.shape[0]
d_z = d_xhat @ sae.W_dec + lambda_l1 * np.sign(z) / H.shape[0]
d_z_pre = d_z * (z_pre > 0)
grads = {
'W_dec': d_xhat.T @ z,
'W_enc': d_z_pre.T @ (H - sae.b_dec),
'b_enc': d_z_pre.sum(axis=0),
'b_dec': d_xhat.sum(axis=0) - (d_z_pre @ sae.W_enc).sum(axis=0)
}
t = epoch + 1
for key in grads:
m_s[key] = 0.9 * m_s[key] + 0.1 * grads[key]
v_s[key] = 0.999 * v_s[key] + 0.001 * grads[key] ** 2
param = getattr(sae, key)
param -= lr * (m_s[key] / (1 - 0.9**t)) / (
np.sqrt(v_s[key] / (1 - 0.999**t)) + 1e-8)
sae.normalize_decoder()
# Analyze SAE features: are they monosemantic?
_, z_final, _ = sae.forward(H)
print("\n=== SAE Features (Monosemantic) ===")
alive_features = np.where(np.any(z_final > 0.1, axis=0))[0]
for feat in alive_features[:8]:
activations_by_class = []
for c in range(n_classes):
mask = Y_all == c
activations_by_class.append(z_final[mask, feat].mean())
dominant = np.argmax(activations_by_class)
purity = max(activations_by_class) / (sum(activations_by_class) + 1e-8)
print(f" Feature {feat:2d}: dominant class = {dominant}, "
f"purity = {purity:.1%}")
monosemanticity_experiment()
The contrast is striking. Original neurons respond to 2–4 classes each—polysemantic. SAE features respond to one class each with high purity—monosemantic. The SAE has untangled the superposed representations into clean, interpretable components.
This is exactly what Anthropic did at scale. They trained SAEs on Claude 3 Sonnet’s activations (with 34 million features at their largest) and found features for “Golden Gate Bridge,” “code errors,” “DNA sequences,” “legal language,” and thousands of other interpretable concepts. When they artificially amplified the Golden Gate Bridge feature during inference, Claude started self-identifying as the bridge—a vivid demonstration that these features are real and causally meaningful.
Beyond L1: The TopK Alternative
The L1 penalty has a subtle problem: activation shrinkage. Because L1 applies a constant-magnitude gradient pushing all activations toward zero, even the features that should be active have their magnitudes systematically reduced. The network underestimates how strongly each feature fires.
Gao et al. at OpenAI proposed an elegant fix in 2024: TopK activation. Instead of using L1 to encourage sparsity, just keep the top k activations and hard-zero the rest. Sparsity is enforced by selection, not penalization, so the active features pass through with their full, unshrunk magnitudes.
def topk_encode(x, W_enc, b_enc, b_dec, k):
"""TopK SAE encoder: keep k largest activations, zero the rest."""
x_centered = x - b_dec
z_pre = x_centered @ W_enc.T + b_enc
z_pre = np.maximum(0, z_pre) # ReLU first
# Keep only top-k activations per sample
z = np.zeros_like(z_pre)
for i in range(len(z_pre)):
if np.sum(z_pre[i] > 0) <= k:
z[i] = z_pre[i] # fewer than k active: keep all
else:
topk_idx = np.argpartition(z_pre[i], -k)[-k:]
z[i, topk_idx] = z_pre[i, topk_idx]
return z
# With TopK, the loss is pure reconstruction — no L1 term!
# L_topk = ||x - x_hat||^2
# Sparsity is exact: always <= k active features per input.
# No shrinkage: active feature magnitudes are preserved exactly.
# Compare activation distributions:
# L1-SAE: many small values clustered near zero (shrinkage)
# TopK-SAE: clean bimodal distribution (zero or full magnitude)
TopK simplifies hyperparameter tuning dramatically: instead of sweeping λ and hoping for the right L0, you directly choose k (how many features can be active). The tradeoff between sparsity and reconstruction becomes a single, interpretable knob.
The field has moved fast since the original L1 approach. DeepMind introduced Gated SAEs (separating the “should this feature fire?” decision from “how strongly?”), JumpReLU SAEs (learnable per-feature thresholds), and BatchTopK (variable sparsity across tokens in a batch). All of these improvements target the same core problem: getting cleaner sparsity without sacrificing reconstruction quality.
Why This Matters: Opening the Black Box
Sparse autoencoders aren’t just a curiosity—they’re the most promising tool we have for understanding what large language models actually know.
Anthropic’s “Scaling Monosemanticity” paper (Templeton, Conerly et al., 2024) trained SAEs with up to 34 million features on Claude 3 Sonnet’s residual stream. What they found was remarkable:
- Abstract, multilingual features: A “Golden Gate Bridge” feature that responds to text mentions, photographs, and related concepts across languages.
- Geographic clustering: Near the Golden Gate Bridge feature, they found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, and the 1906 San Francisco earthquake.
- Safety-relevant features: Features corresponding to deception, bias, sycophancy, and dangerous content—exactly the kinds of internal representations we need to understand for AI safety.
- Feature steering: Artificially clamping a feature’s activation changes model behavior in interpretable, predictable ways—“Golden Gate Claude” was the public demonstration.
OpenAI published parallel work (Gao et al., 2024), training TopK SAEs with up to 16 million features on GPT-4’s activations. DeepMind released “Gemma Scope,” a public collection of JumpReLU SAEs trained on Gemma 2 across all layers.
The implications are profound. If we can reliably identify features inside models, we can audit what they know, control their behavior through targeted feature manipulation, detect and mitigate harmful capabilities, and build genuine trust in AI systems. SAEs are how we turn black boxes into glass boxes.
But humility is warranted. Even at 34 million features, Anthropic observed that 65% of features “died” during training (never activated). We don’t know how many features we’re missing. We’re still building the microscope. But for the first time, we can see individual features inside production language models—and what we’re finding is fascinating.
Try It: Feature Disentangler
Watch polysemantic neurons become monosemantic features as you increase sparsity. Left: original hidden neurons (tangled). Right: SAE features (disentangled).
Try It: Sparsity Explorer
Activation heatmap of SAE features across inputs. Dark = zero (inactive), bright = active. Click on any bright cell to see what class that feature represents.
Connections to the Series
Sparse autoencoders sit at the intersection of compression, representation learning, and interpretability—connecting to nearly every post in the elementary series.
- Autoencoders & VAEs: SAEs invert the bottleneck—overcomplete instead of undercomplete, but the same encoder-decoder structure and reconstruction loss. VAEs regularize with KL divergence toward a Gaussian; SAEs regularize with L1 toward sparsity.
- Loss Functions: The L1 vs L2 geometry is fundamental here. L1’s diamond-shaped constraint region creates exact zeros (true sparsity), while L2’s sphere only shrinks values. This is one of the most important geometric distinctions in all of machine learning.
- Activation Functions: ReLU is essential for SAEs—its hard zero at negative inputs creates true sparsity. Replace ReLU with GELU or SiLU and you get soft, never-quite-zero activations that defeat the purpose.
- Embeddings: Each decoder column is an interpretable embedding—a direction in activation space representing one concept. SAE features are to neural activations what word embeddings are to vocabulary: a meaningful coordinate system.
- Attention: SAEs trained on attention layer outputs reveal what attention heads compute—each head’s output can be decomposed into sparse, interpretable features.
- Transformers: SAEs can probe any transformer layer—residual stream, MLP output, attention output. Different sites reveal different types of features, building a complete picture of what the transformer learns.
- Knowledge Distillation: Both techniques make network knowledge more explicit—distillation compresses knowledge into a simpler model, SAEs decompose it into a sparse dictionary of interpretable features.
- Contrastive Learning: Both learn structured representations—contrastive learning organizes the embedding space so similar items are close; SAEs organize the feature space so each dimension means one thing.
- Normalization: The decoder column norm constraint in SAEs echoes weight normalization—constraining representational capacity to force better learning rather than gaming the objective.
References & Further Reading
- Elhage et al. — “Toy Models of Superposition” (2022) — The theoretical foundation explaining why neural networks pack more features than they have dimensions.
- Bricken et al. — “Towards Monosemanticity” (2023) — Anthropic’s foundational paper training SAEs on a one-layer transformer and finding interpretable features.
- Cunningham, Ewart, Riggs, Huben, Sharkey — “Sparse Autoencoders Find Highly Interpretable Features” (2023) — Independent confirmation of SAE effectiveness on GPT-2 and Pythia models.
- Templeton, Conerly et al. — “Scaling Monosemanticity” (2024) — SAEs at production scale on Claude 3 Sonnet, including the Golden Gate Bridge feature and feature steering.
- Gao, Dupre la Tour et al. — “Scaling and Evaluating Sparse Autoencoders” (2024) — OpenAI’s TopK SAE variant with scaling laws, trained on GPT-4.
- Rajamanoharan et al. — “Improving Dictionary Learning with Gated Sparse Autoencoders” (2024) — DeepMind’s Gated SAE, separating the gating decision from magnitude estimation.
- Rajamanoharan et al. — “Jumping Ahead: JumpReLU Sparse Autoencoders” (2024) — Learnable per-feature thresholds for cleaner sparsity.
- Bussmann — “BatchTopK Sparse Autoencoders” (2024) — Variable per-token sparsity using batch-level top-k selection.
- Dunefsky, Chlenski, Nanda — “Transcoders Find Interpretable LLM Feature Circuits” (2025) — Modeling component functions instead of reconstructing activations.
- Olah — “Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases” (2022) — Foundational essay on why interpretable features matter for AI safety.