← Back to Blog

Sparse Autoencoders from Scratch: The Microscope for Neural Networks

What Are Neural Networks Actually Learning?

The autoencoders post showed how to squeeze data through a tight bottleneck and reconstruct it on the other side. That’s the compression direction—fewer dimensions, denser representations. But what if the real mystery isn’t how networks compress information—it’s what they’re secretly representing inside?

Crack open a trained neural network and look at an individual neuron. You might expect neuron 42 to represent “cats” or neuron 117 to represent “code.” But that’s not what happens. Neuron 42 fires for cats, blue objects, and the letter Q simultaneously. Neuron 117 responds to Python code, cooking recipes, and discussions about the weather. Individual neurons are a mess—they mean many things at once.

This is the black box problem, and sparse autoencoders are the crowbar that pries it open. Instead of compressing into fewer dimensions like a standard autoencoder, a sparse autoencoder expands into more dimensions while forcing most of them to stay zero. The result is extraordinary: each active dimension represents one clean, interpretable concept. One feature for cats. One feature for Python code. One feature for the Golden Gate Bridge.

This is exactly how Anthropic peered inside Claude and discovered that individual features correspond to specific concepts—and then built “Golden Gate Claude” by cranking one feature to maximum. Let’s build the whole thing from scratch.

The Superposition Problem: Too Many Features, Too Few Neurons

Why are neurons such a mess? The answer is superposition—and it’s actually a clever trick the network learned on its own.

Imagine a network needs to track 5 binary features (say, “is round,” “is blue,” “has legs,” “is edible,” “makes sound”), but its hidden layer only has 2 neurons. With 2 dimensions, you can only represent 2 independent directions. But if those features are sparse—most inputs only activate 1 or 2 features at a time—the network discovers it can pack all 5 features into just 2 dimensions by using nearly orthogonal directions.

Think of it geometrically: 5 directions arranged like the vertices of a pentagon in a 2D plane. No two directions are perfectly orthogonal, but their dot products are small enough that the interference is tolerable when only 1–2 features are active at once. The network trades some accuracy for much more representational capacity.

This was formalized by Elhage et al. at Anthropic in their landmark 2022 paper “Toy Models of Superposition.” They showed a clean phase diagram: dense, important features get dedicated neurons. Sparse, less important features get packed into superposition. The sparser the features, the more aggressively the network compresses.

Here’s superposition in action:

import numpy as np

def demonstrate_superposition():
    """5 sparse features packed into 2 dimensions."""
    np.random.seed(42)
    d_features, d_hidden = 5, 2

    # Ground-truth features: 5 one-hot vectors in 5D
    # Each data point activates 1-2 features (sparse!)
    n_samples = 500
    feature_probs = 0.15  # each feature active 15% of the time
    X = np.random.binomial(1, feature_probs, (n_samples, d_features)).astype(float)
    X *= np.random.uniform(0.5, 2.0, X.shape)  # random magnitudes

    # Importance weights: feature 0 most important, feature 4 least
    importance = np.array([1.0, 0.8, 0.6, 0.4, 0.2])

    # Train a linear model to reconstruct X through a 2D bottleneck
    # Tied weights: X_hat = X @ W @ W.T, minimize importance-weighted MSE
    W = np.random.randn(d_features, d_hidden) * 0.5
    lr = 0.001
    for step in range(2000):
        H = X @ W                      # encode: (n, 2)
        X_hat = H @ W.T               # decode: (n, 5)
        R = importance * (X_hat - X)   # weighted residual: (n, 5)
        # Gradient via both paths through W (encoding and decoding)
        grad = X.T @ (R @ W) / n_samples      # path 1: through encoder
        grad += (R.T @ H) / n_samples         # path 2: through decoder
        W -= lr * 2 * grad

    # The learned directions for each feature
    print("Feature directions in 2D space (rows of W):")
    for i in range(d_features):
        direction = W[i] / (np.linalg.norm(W[i]) + 1e-8)
        print(f"  Feature {i} (importance={importance[i]:.1f}): "
              f"[{direction[0]:+.3f}, {direction[1]:+.3f}]")

    # Check polysemanticity: each hidden neuron responds to multiple features
    print("\nHidden neuron 0 responds to features:", end=" ")
    print([i for i in range(d_features) if abs(W[i, 0]) > 0.2])
    print("Hidden neuron 1 responds to features:", end=" ")
    print([i for i in range(d_features) if abs(W[i, 1]) > 0.2])

demonstrate_superposition()

The output reveals the problem: each hidden neuron participates in representing multiple features. Neuron 0 responds to features 0, 2, and 4. Neuron 1 responds to features 1, 3, and 4. The features are tangled together—polysemantic neurons. If you only look at individual neurons, you can’t figure out what the network learned.

Superposition is the default behavior of neural networks. And it makes individual neurons fundamentally uninterpretable.

The Sparse Autoencoder Architecture: Flipping the Bottleneck

Standard autoencoders compress: 784 dimensions → 32 dimensions → 784 dimensions. The bottleneck forces information loss, creating a compressed representation. Sparse autoencoders flip this entirely: 16 dimensions → 64 dimensions → 16 dimensions. The hidden layer is bigger than the input.

How can a bigger hidden layer be useful? Isn’t that just a trivial identity mapping? The trick is the sparsity constraint: for any given input, only a handful of those 64 dimensions are allowed to be nonzero. The overcomplete dictionary gives each concept room for its own dedicated dimension. The sparsity constraint ensures the network actually uses that room cleanly instead of distributing information across all dimensions.

One important detail from Bricken et al.’s original paper: before encoding, we subtract the decoder bias b_dec from the input. This pre-centering trick lets the decoder bias capture the mean activation, so the sparse features only need to represent deviations from the mean—a cleaner decomposition.

class SparseAutoencoder:
    """Overcomplete autoencoder with sparsity constraint."""

    def __init__(self, d_model, d_sae):
        self.d_model = d_model
        self.d_sae = d_sae
        # He initialization for ReLU
        scale_enc = np.sqrt(2.0 / d_model)
        scale_dec = np.sqrt(2.0 / d_sae)
        self.W_enc = np.random.randn(d_sae, d_model) * scale_enc
        self.b_enc = np.zeros(d_sae)
        self.W_dec = np.random.randn(d_model, d_sae) * scale_dec
        self.b_dec = np.zeros(d_model)
        # Normalize decoder columns to unit norm
        self.normalize_decoder()

    def normalize_decoder(self):
        """Constrain each decoder column to unit L2 norm."""
        norms = np.linalg.norm(self.W_dec, axis=0, keepdims=True)
        self.W_dec /= (norms + 1e-8)

    def encode(self, x):
        """Encode with pre-centering and ReLU activation."""
        x_centered = x - self.b_dec               # pre-center
        z_pre = x_centered @ self.W_enc.T + self.b_enc
        z = np.maximum(0, z_pre)                   # ReLU: creates true zeros
        return z, z_pre

    def decode(self, z):
        """Reconstruct from sparse features."""
        return z @ self.W_dec.T + self.b_dec

    def forward(self, x):
        """Full forward pass: encode then decode."""
        z, z_pre = self.encode(x)
        x_hat = self.decode(z)
        return x_hat, z, z_pre

Notice the expansion: if d_model=16 and d_sae=64, that’s a 4× overcomplete dictionary. In practice, researchers use 4× to 64× for small experiments. Anthropic pushed to 2,800× when they trained a 34-million-feature SAE on Claude 3 Sonnet. Each column of W_dec is a “feature direction”—a vector in activation space that represents one learned concept.

The Loss Function: Reconstruction vs. Sparsity

The SAE loss function is a tug-of-war between two forces:

L = ‖x − x̂‖² + λ · ‖z‖₁

The reconstruction term wants perfect reproduction—pushing activations to be large and dense so nothing is lost. The L1 sparsity penalty wants most activations to be exactly zero—pushing toward minimal, clean representations. The balance between these two forces is what creates interpretable features.

But why L1 and not L2? This is the key geometric insight.

Picture the L1 constraint region: it’s a diamond (or hypercube in higher dimensions) with sharp corners that sit exactly on the axes. The L2 region is a circle (or hypersphere)—smooth everywhere, no corners. When the loss minimum falls near the constraint boundary, L1’s diamond shape means the solution lands at a corner—where some coordinates are exactly zero. L2’s smooth circle gives solutions where coordinates are small but never quite zero. L1 creates true sparsity. L2 creates “approximately sparse.”

At the implementation level, L1’s gradient is the sign function: +1 for positive activations, −1 for negative, undefined at zero. This constant-magnitude gradient keeps pulling activations toward zero with the same force regardless of how small they already are—until they cross zero and get clamped by ReLU.

def sae_loss(x, x_hat, z, lambda_l1):
    """
    Compute SAE loss: reconstruction + L1 sparsity.

    Returns: total_loss, recon_loss, l1_loss, l0
    """
    # Reconstruction: mean squared error
    recon_loss = np.mean((x - x_hat) ** 2)

    # Sparsity: L1 norm of hidden activations
    l1_loss = np.mean(np.sum(np.abs(z), axis=1))

    # Total loss
    total_loss = recon_loss + lambda_l1 * l1_loss

    # L0: average number of nonzero features per input
    l0 = np.mean(np.sum(z > 0, axis=1))

    return total_loss, recon_loss, l1_loss, l0

def l1_gradient(z):
    """Subgradient of L1 norm: sign(z), with 0 at z=0."""
    return np.sign(z)

The key metrics to track during training:

Training the SAE: Watching Features Emerge

Let’s train on synthetic data where we know the ground truth. We’ll create 8 true features living in a 4-dimensional space (2× overcomplete—exactly the superposition scenario). Each data point is a sparse combination of 1–3 features. Then we’ll train a 32-feature SAE (8× overcomplete) and verify it recovers the true features.

def train_sae_synthetic():
    """Train SAE on synthetic data with known ground-truth features."""
    np.random.seed(0)
    d_model, d_sae = 4, 32
    n_true_features = 8

    # Ground-truth features: 8 random unit vectors in 4D
    true_features = np.random.randn(n_true_features, d_model)
    true_features /= np.linalg.norm(true_features, axis=1, keepdims=True)

    # Generate sparse data: each sample uses 1-3 features
    n_samples = 5000
    X = np.zeros((n_samples, d_model))
    for i in range(n_samples):
        k = np.random.randint(1, 4)  # 1-3 active features
        active = np.random.choice(n_true_features, k, replace=False)
        coeffs = np.random.uniform(0.5, 2.0, k)
        for j, a in enumerate(active):
            X[i] += coeffs[j] * true_features[a]

    # Initialize SAE
    sae = SparseAutoencoder(d_model, d_sae)
    lambda_l1 = 0.04
    lr = 0.003

    # Adam optimizer state
    m = {k: np.zeros_like(v) for k, v in
         [('W_enc', sae.W_enc), ('b_enc', sae.b_enc),
          ('W_dec', sae.W_dec), ('b_dec', sae.b_dec)]}
    v = {k: np.zeros_like(v) for k, v in
         [('W_enc', sae.W_enc), ('b_enc', sae.b_enc),
          ('W_dec', sae.W_dec), ('b_dec', sae.b_dec)]}

    for epoch in range(200):
        # Forward pass
        x_hat, z, z_pre = sae.forward(X)
        total, recon, l1, l0 = sae_loss(X, x_hat, z, lambda_l1)

        # Backward pass: gradients through decoder and encoder
        d_xhat = 2 * (x_hat - X) / X.shape[0]   # d(recon)/d(x_hat)
        # Gradient through decoder: x_hat = z @ W_dec.T + b_dec
        d_z = d_xhat @ sae.W_dec + lambda_l1 * l1_gradient(z) / X.shape[0]
        d_z_pre = d_z * (z_pre > 0)  # ReLU mask
        d_Wdec = d_xhat.T @ z                     # (d_model, d_sae)
        # Gradient through encoder: z_pre = (X - b_dec) @ W_enc.T + b_enc
        d_Wenc = d_z_pre.T @ (X - sae.b_dec)      # (d_sae, d_model)
        d_benc = np.sum(d_z_pre, axis=0)           # (d_sae,)
        # b_dec appears in both encoder (pre-centering) and decoder
        d_bdec = d_xhat.sum(axis=0) - (d_z_pre @ sae.W_enc).sum(axis=0)

        # Adam updates (simplified: beta1=0.9, beta2=0.999, eps=1e-8)
        grads = {'W_enc': d_Wenc, 'b_enc': d_benc,
                 'W_dec': d_Wdec, 'b_dec': d_bdec}
        t = epoch + 1
        for key in grads:
            m[key] = 0.9 * m[key] + 0.1 * grads[key]
            v[key] = 0.999 * v[key] + 0.001 * grads[key] ** 2
            m_hat = m[key] / (1 - 0.9 ** t)
            v_hat = v[key] / (1 - 0.999 ** t)
            param = getattr(sae, key)
            param -= lr * m_hat / (np.sqrt(v_hat) + 1e-8)

        sae.normalize_decoder()  # unit norm constraint on decoder columns

        if epoch % 50 == 0:
            dead = np.mean(np.all(z == 0, axis=0))
            print(f"Epoch {epoch:3d} | Recon: {recon:.4f} | "
                  f"L0: {l0:.1f} | Dead: {dead:.0%}")

    # Verify feature recovery: match SAE features to ground truth
    alive_mask = np.any(z > 0, axis=0)
    alive_cols = sae.W_dec[:, alive_mask].T  # (n_alive, d_model)
    alive_cols /= np.linalg.norm(alive_cols, axis=1, keepdims=True)
    similarity = np.abs(alive_cols @ true_features.T)  # cosine similarity
    best_match = np.max(similarity, axis=0)
    print(f"\nFeature recovery (cosine similarity to ground truth):")
    for i in range(n_true_features):
        print(f"  True feature {i}: best match = {best_match[i]:.3f}")

train_sae_synthetic()

The training dynamics are revealing. Early on, L0 is high—many features fire for each input (dense, entangled). As the L1 penalty takes effect, features specialize. L0 drops. By the end, each input activates only 2–4 SAE features, and those features align closely with the true ground-truth directions. The SAE has reversed the superposition.

The decoder norm constraint deserves emphasis: after every gradient step, we normalize each column of W_dec to unit length. Without this, the network “cheats”—it can shrink decoder weights while inflating encoder weights, making L1 look small without actually learning sparse features. The norm constraint forces honest sparsity.

From Polysemantic to Monosemantic: The Key Experiment

Synthetic data is nice, but the real magic happens when we apply SAEs to an actual neural network’s internals. Let’s train a small classifier, then use an SAE to decompose its hidden activations into interpretable features.

def monosemanticity_experiment():
    """Train a classifier, then decompose its hidden layer with an SAE."""
    np.random.seed(7)
    n_classes, d_input, d_hidden = 8, 20, 16

    # Generate clustered data: 8 classes with distinct centers
    centers = np.random.randn(n_classes, d_input) * 3
    X_all, Y_all = [], []
    for c in range(n_classes):
        pts = centers[c] + np.random.randn(200, d_input) * 0.8
        X_all.append(pts)
        Y_all.extend([c] * 200)
    X_all = np.vstack(X_all)
    Y_all = np.array(Y_all)

    # Train a simple 2-layer MLP classifier
    W1 = np.random.randn(d_input, d_hidden) * np.sqrt(2.0 / d_input)
    b1 = np.zeros(d_hidden)
    W2 = np.random.randn(d_hidden, n_classes) * np.sqrt(2.0 / d_hidden)
    b2 = np.zeros(n_classes)

    for step in range(500):
        h = np.maximum(0, X_all @ W1 + b1)  # hidden activations
        logits = h @ W2 + b2
        # Softmax + cross-entropy (stable)
        logits -= logits.max(axis=1, keepdims=True)
        probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
        one_hot = np.eye(n_classes)[Y_all]
        d_logits = (probs - one_hot) / len(Y_all)
        # Backprop
        d_W2 = h.T @ d_logits
        d_b2 = d_logits.sum(axis=0)
        d_h = d_logits @ W2.T * (h > 0)
        d_W1 = X_all.T @ d_h
        d_b1 = d_h.sum(axis=0)
        lr = 0.01
        W1 -= lr * d_W1; b1 -= lr * d_b1
        W2 -= lr * d_W2; b2 -= lr * d_b2

    # Extract hidden activations
    H = np.maximum(0, X_all @ W1 + b1)

    # Measure polysemanticity of original neurons
    print("=== Original Hidden Neurons (Polysemantic) ===")
    for neuron in [0, 3, 7]:
        active_classes = []
        for c in range(n_classes):
            mask = Y_all == c
            mean_act = H[mask, neuron].mean()
            if mean_act > 0.5:
                active_classes.append(c)
        print(f"  Neuron {neuron} responds to classes: {active_classes}")

    # Train SAE on hidden activations
    d_sae = 64  # 4x overcomplete
    sae = SparseAutoencoder(d_hidden, d_sae)
    # (Training loop similar to above, abbreviated for clarity)
    lambda_l1 = 0.05
    lr = 0.005
    m_s = {k: np.zeros_like(getattr(sae, k))
           for k in ['W_enc', 'b_enc', 'W_dec', 'b_dec']}
    v_s = {k: np.zeros_like(getattr(sae, k))
           for k in ['W_enc', 'b_enc', 'W_dec', 'b_dec']}
    for epoch in range(300):
        x_hat, z, z_pre = sae.forward(H)
        d_xhat = 2 * (x_hat - H) / H.shape[0]
        d_z = d_xhat @ sae.W_dec + lambda_l1 * np.sign(z) / H.shape[0]
        d_z_pre = d_z * (z_pre > 0)
        grads = {
            'W_dec': d_xhat.T @ z,
            'W_enc': d_z_pre.T @ (H - sae.b_dec),
            'b_enc': d_z_pre.sum(axis=0),
            'b_dec': d_xhat.sum(axis=0) - (d_z_pre @ sae.W_enc).sum(axis=0)
        }
        t = epoch + 1
        for key in grads:
            m_s[key] = 0.9 * m_s[key] + 0.1 * grads[key]
            v_s[key] = 0.999 * v_s[key] + 0.001 * grads[key] ** 2
            param = getattr(sae, key)
            param -= lr * (m_s[key] / (1 - 0.9**t)) / (
                np.sqrt(v_s[key] / (1 - 0.999**t)) + 1e-8)
        sae.normalize_decoder()

    # Analyze SAE features: are they monosemantic?
    _, z_final, _ = sae.forward(H)
    print("\n=== SAE Features (Monosemantic) ===")
    alive_features = np.where(np.any(z_final > 0.1, axis=0))[0]
    for feat in alive_features[:8]:
        activations_by_class = []
        for c in range(n_classes):
            mask = Y_all == c
            activations_by_class.append(z_final[mask, feat].mean())
        dominant = np.argmax(activations_by_class)
        purity = max(activations_by_class) / (sum(activations_by_class) + 1e-8)
        print(f"  Feature {feat:2d}: dominant class = {dominant}, "
              f"purity = {purity:.1%}")

monosemanticity_experiment()

The contrast is striking. Original neurons respond to 2–4 classes each—polysemantic. SAE features respond to one class each with high purity—monosemantic. The SAE has untangled the superposed representations into clean, interpretable components.

This is exactly what Anthropic did at scale. They trained SAEs on Claude 3 Sonnet’s activations (with 34 million features at their largest) and found features for “Golden Gate Bridge,” “code errors,” “DNA sequences,” “legal language,” and thousands of other interpretable concepts. When they artificially amplified the Golden Gate Bridge feature during inference, Claude started self-identifying as the bridge—a vivid demonstration that these features are real and causally meaningful.

Beyond L1: The TopK Alternative

The L1 penalty has a subtle problem: activation shrinkage. Because L1 applies a constant-magnitude gradient pushing all activations toward zero, even the features that should be active have their magnitudes systematically reduced. The network underestimates how strongly each feature fires.

Gao et al. at OpenAI proposed an elegant fix in 2024: TopK activation. Instead of using L1 to encourage sparsity, just keep the top k activations and hard-zero the rest. Sparsity is enforced by selection, not penalization, so the active features pass through with their full, unshrunk magnitudes.

def topk_encode(x, W_enc, b_enc, b_dec, k):
    """TopK SAE encoder: keep k largest activations, zero the rest."""
    x_centered = x - b_dec
    z_pre = x_centered @ W_enc.T + b_enc
    z_pre = np.maximum(0, z_pre)  # ReLU first

    # Keep only top-k activations per sample
    z = np.zeros_like(z_pre)
    for i in range(len(z_pre)):
        if np.sum(z_pre[i] > 0) <= k:
            z[i] = z_pre[i]  # fewer than k active: keep all
        else:
            topk_idx = np.argpartition(z_pre[i], -k)[-k:]
            z[i, topk_idx] = z_pre[i, topk_idx]
    return z

# With TopK, the loss is pure reconstruction — no L1 term!
# L_topk = ||x - x_hat||^2
# Sparsity is exact: always <= k active features per input.
# No shrinkage: active feature magnitudes are preserved exactly.

# Compare activation distributions:
# L1-SAE:  many small values clustered near zero (shrinkage)
# TopK-SAE: clean bimodal distribution (zero or full magnitude)

TopK simplifies hyperparameter tuning dramatically: instead of sweeping λ and hoping for the right L0, you directly choose k (how many features can be active). The tradeoff between sparsity and reconstruction becomes a single, interpretable knob.

The field has moved fast since the original L1 approach. DeepMind introduced Gated SAEs (separating the “should this feature fire?” decision from “how strongly?”), JumpReLU SAEs (learnable per-feature thresholds), and BatchTopK (variable sparsity across tokens in a batch). All of these improvements target the same core problem: getting cleaner sparsity without sacrificing reconstruction quality.

Why This Matters: Opening the Black Box

Sparse autoencoders aren’t just a curiosity—they’re the most promising tool we have for understanding what large language models actually know.

Anthropic’s “Scaling Monosemanticity” paper (Templeton, Conerly et al., 2024) trained SAEs with up to 34 million features on Claude 3 Sonnet’s residual stream. What they found was remarkable:

OpenAI published parallel work (Gao et al., 2024), training TopK SAEs with up to 16 million features on GPT-4’s activations. DeepMind released “Gemma Scope,” a public collection of JumpReLU SAEs trained on Gemma 2 across all layers.

The implications are profound. If we can reliably identify features inside models, we can audit what they know, control their behavior through targeted feature manipulation, detect and mitigate harmful capabilities, and build genuine trust in AI systems. SAEs are how we turn black boxes into glass boxes.

But humility is warranted. Even at 34 million features, Anthropic observed that 65% of features “died” during training (never activated). We don’t know how many features we’re missing. We’re still building the microscope. But for the first time, we can see individual features inside production language models—and what we’re finding is fascinating.

Try It: Feature Disentangler

Watch polysemantic neurons become monosemantic features as you increase sparsity. Left: original hidden neurons (tangled). Right: SAE features (disentangled).

0.00
L0: 8.0 Recon MSE: 0.000 Purity: 45%

Try It: Sparsity Explorer

Activation heatmap of SAE features across inputs. Dark = zero (inactive), bright = active. Click on any bright cell to see what class that feature represents.

5
L0: 5.0 Recon: 0.032 Dead: 12%

Connections to the Series

Sparse autoencoders sit at the intersection of compression, representation learning, and interpretability—connecting to nearly every post in the elementary series.

References & Further Reading