← Back to Blog

Mixture of Experts from Scratch: How One Transformer Becomes Eight

The Scaling Problem

In our FFN from Scratch post, we built the feed-forward network that lives inside every transformer block. We discovered something surprising: the FFN holds roughly two-thirds of a transformer's parameters. In LLaMA 3 8B, for example, each FFN layer has three weight matrices totaling over 150 million parameters — and there are 32 of these layers.

Here's the uncomfortable truth about dense FFNs: every single token pays the full computational cost of the entire network. A comma, a period, the word "the" — each one flows through exactly the same 150 million parameters as a complex technical term or a rare multilingual token. Does a comma really need the same neural pathway as "mitochondria"?

The answer, it turns out, is no. And that insight unlocks one of the most important ideas in modern AI: Mixture of Experts (MoE).

The idea is elegant: instead of one massive FFN, use N smaller FFNs (the "experts") and a tiny learned router that decides which experts each token should visit. Most tokens only pass through 2 of the 8 experts, so you get 8× the total parameters at only ~2× the compute per token. More knowledge, less cost.

MoE isn't new — Jacobs, Jordan, Nowlan, and Hinton proposed it in 1991. But the transformer era made it dominant. Google's Switch Transformer (2022) scaled to a trillion parameters. Mistral's Mixtral 8x7B (2024) made it practical. And DeepSeek-V3 (2024) pushed it to 671 billion parameters with just 37 billion active per token, training it for a fraction of the cost of comparable dense models.

Today we're building it all from scratch in NumPy: the router, top-k selection, the load balancing trick that prevents collapse, and a complete MoE layer that drops right into the transformer we already built.

The Gating Network: Learning to Route

The router — also called the gating network — is the brain of MoE. It looks at each token's hidden state and decides which experts should process it. And it's shockingly simple: a single linear projection followed by softmax.

For a token with hidden state x of dimension d_model, and N experts:

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

class Router:
    """Learned gating network that routes tokens to experts."""

    def __init__(self, d_model, num_experts):
        # One linear layer: d_model -> num_experts
        scale = np.sqrt(2.0 / d_model)
        self.W_gate = np.random.randn(d_model, num_experts) * scale

    def __call__(self, x):
        """
        x: (seq_len, d_model) — token hidden states
        Returns: (seq_len, num_experts) — probability over experts per token
        """
        logits = x @ self.W_gate          # (seq_len, num_experts)
        return softmax(logits, axis=-1)   # (seq_len, num_experts)

That's the entire router. A single matrix multiply turns each token's d_model-dimensional hidden state into a probability distribution over N experts. The router starts random and learns during training — through backpropagation, it discovers which experts should handle which kinds of tokens.

Let's see what its output looks like:

np.random.seed(42)
router = Router(d_model=64, num_experts=8)
x = np.random.randn(4, 64)  # 4 tokens

probs = router(x)
print("Router output — probability per expert for each token:")
for i in range(4):
    top2 = np.argsort(probs[i])[-2:][::-1]
    print(f"  Token {i}: ", end="")
    for j in range(8):
        marker = " *" if j in top2 else "  "
        print(f"E{j}={probs[i,j]:.3f}{marker}", end=" ")
    print()
# Each row sums to 1.0 — it's a valid probability distribution

Each token gets a different distribution. Some tokens might strongly prefer expert 3, while others spread their probability more evenly. The starred entries are the top-2 experts — the ones that will actually process the token.

Top-K Routing: Pick Your Experts

The router gives us probabilities over all N experts, but using all of them would defeat the purpose. We'd be computing N FFN forward passes per token — more expensive than a single large FFN. The whole point of MoE is sparse activation: each token visits only k experts.

Top-k routing works in three steps:

  1. Select the k experts with the highest gating scores
  2. Renormalize the selected weights so they sum to 1
  3. Combine the experts' outputs with the renormalized weights

The output for each token is a weighted sum of its selected experts:

y = Σi ∈ top-k wi · Experti(x)
def top_k_route(router_probs, k=2):
    """
    Select top-k experts per token and renormalize weights.

    router_probs: (seq_len, num_experts) — probability distribution
    k: number of experts to select per token

    Returns:
        top_k_indices: (seq_len, k) — which experts are selected
        top_k_weights: (seq_len, k) — renormalized weights
    """
    seq_len, num_experts = router_probs.shape

    # Find the k largest probabilities per token
    top_k_indices = np.argpartition(-router_probs, k, axis=-1)[:, :k]  # (seq_len, k)

    # Gather the corresponding probabilities
    top_k_probs = np.take_along_axis(router_probs, top_k_indices, axis=-1)  # (seq_len, k)

    # Renormalize so the k weights sum to 1
    top_k_weights = top_k_probs / top_k_probs.sum(axis=-1, keepdims=True)  # (seq_len, k)

    return top_k_indices, top_k_weights

Let's trace through a concrete example. Say we have 4 tokens, 4 experts, and top-2 routing:

np.random.seed(7)
router = Router(d_model=8, num_experts=4)
x = np.random.randn(4, 8)

probs = router(x)
indices, weights = top_k_route(probs, k=2)

print("Token | Router Probs (all 4 experts)       | Selected → Weights")
print("-" * 75)
for i in range(4):
    prob_str = " ".join(f"E{j}={probs[i,j]:.3f}" for j in range(4))
    sel_str = " + ".join(f"E{indices[i,j]}×{weights[i,j]:.3f}" for j in range(2))
    print(f"  {i}   | {prob_str} | {sel_str}")

# Output: each token picks its top 2 experts, weights sum to 1.0

Common choices for k across real models: top-1 in Google's Switch Transformer (simplest, but loses the weighted combination), top-2 in Mixtral 8x7B (the sweet spot for quality vs. compute), and top-8 of 256 in DeepSeek-V3 (fine-grained routing with hundreds of small experts).

The Load Balancing Problem

There's a catastrophic failure mode lurking in MoE, and if you train without addressing it, your model will break.

The problem is router collapse. Without explicit encouragement to spread tokens across experts, the router learns to funnel almost all tokens into one or two "favorite" experts. Here's why: if expert 3 happens to get slightly more tokens early in training, it improves faster (more gradient signal), which makes the router send even more tokens to it, which makes it even better — a vicious positive feedback loop. The other experts starve, and you've spent the memory on 8 experts but only 2 are doing useful work.

The fix is an auxiliary load-balancing loss that nudges the router toward balanced assignment. The Switch Transformer introduced this elegant formulation:

Lbalance = α · N · Σi=1N fi · Pi

Where:

The beauty of this formulation is the product f_i · P_i. The hard assignment f_i isn't differentiable (it uses argmax), but P_i is differentiable through the softmax. By multiplying them, the gradient flows through P_i to push the router toward balance. When routing is perfectly balanced, both f_i and P_i equal 1/N for all experts, and the loss is minimized.

def load_balance_loss(router_probs, top_k_indices, num_experts, alpha=0.01):
    """
    Auxiliary loss that encourages balanced expert assignment.

    router_probs:   (seq_len, num_experts) — full router probability distribution
    top_k_indices:  (seq_len, k) — which experts were selected
    num_experts:    int — total number of experts
    alpha:          float — loss weight (0.01 in Switch Transformer)

    Returns: scalar loss value
    """
    seq_len = router_probs.shape[0]

    # f_i: fraction of tokens where expert i is the TOP-1 choice
    primary_expert = top_k_indices[:, 0]                 # (seq_len,)
    f = np.zeros(num_experts)
    for i in range(num_experts):
        f[i] = np.sum(primary_expert == i) / seq_len     # (num_experts,)

    # P_i: mean router probability for expert i across all tokens
    P = router_probs.mean(axis=0)                        # (num_experts,)

    # Auxiliary loss: penalizes imbalance
    loss = alpha * num_experts * np.sum(f * P)            # scalar
    return loss

# Example: balanced vs. collapsed routing
np.random.seed(42)
seq_len, num_experts = 100, 8

# Balanced: each expert gets ~12.5% of tokens
balanced_probs = softmax(np.random.randn(seq_len, num_experts) * 0.5)
balanced_idx, _ = top_k_route(balanced_probs, k=2)

# Collapsed: expert 0 dominates
collapsed_logits = np.random.randn(seq_len, num_experts) * 0.5
collapsed_logits[:, 0] += 5.0  # heavily bias toward expert 0
collapsed_probs = softmax(collapsed_logits)
collapsed_idx, _ = top_k_route(collapsed_probs, k=2)

print(f"Balanced loss:  {load_balance_loss(balanced_probs, balanced_idx, num_experts):.4f}")
print(f"Collapsed loss: {load_balance_loss(collapsed_probs, collapsed_idx, num_experts):.4f}")
# Collapsed loss is much higher — the auxiliary loss detects the imbalance

There's one more practical detail: the capacity factor. Since we can't predict exactly how many tokens will route to each expert, we pre-allocate a buffer for each expert:

expert_capacity = int((tokens_per_batch / num_experts) * capacity_factor)
# capacity_factor = 1.0:  exact fair share (any imbalance drops tokens)
# capacity_factor = 1.25: 25% slack (the typical choice)

When an expert's buffer overflows, excess tokens are dropped — they skip the MoE layer entirely and pass through via the residual connection only. It's a graceful degradation: the token misses one layer's processing but doesn't break the model.

Building the Complete MoE Layer

Now we assemble everything into a single class. Each expert is a SwiGLU FFN — exactly the same architecture we built before, just with smaller dimensions. The router picks top-k experts per token, the experts process their assigned tokens, and the outputs get weighted and combined.

def silu(x):
    """SiLU/Swish activation: x * sigmoid(x)"""
    return x * (1.0 / (1.0 + np.exp(-x)))

class SwiGLUExpert:
    """A single SwiGLU FFN expert (from the FFN post)."""

    def __init__(self, d_model, d_ff):
        scale = np.sqrt(2.0 / d_model)
        self.w_gate = np.random.randn(d_model, d_ff) * scale
        self.w_up   = np.random.randn(d_model, d_ff) * scale
        self.w_down = np.random.randn(d_ff, d_model) * scale

    def __call__(self, x):
        gate = silu(x @ self.w_gate)  # (tokens, d_ff)
        up   = x @ self.w_up          # (tokens, d_ff)
        return (gate * up) @ self.w_down  # (tokens, d_model)

    def param_count(self):
        return sum(w.size for w in [self.w_gate, self.w_up, self.w_down])

Now the main event — the complete MoE layer:

class MoELayer:
    """Mixture of Experts layer that replaces a dense FFN."""

    def __init__(self, d_model, d_ff, num_experts=8, top_k=2, alpha=0.01):
        self.num_experts = num_experts
        self.top_k = top_k
        self.alpha = alpha
        self.aux_loss = 0.0  # stored after each forward pass

        # N independent SwiGLU experts
        self.experts = [SwiGLUExpert(d_model, d_ff) for _ in range(num_experts)]

        # Learned router: d_model -> num_experts
        scale = np.sqrt(2.0 / d_model)
        self.W_gate = np.random.randn(d_model, num_experts) * scale

    def __call__(self, x):
        """
        x: (seq_len, d_model) — input from attention + residual + norm
        Returns: (seq_len, d_model) — MoE output
        """
        seq_len, d_model = x.shape

        # Step 1: Router produces probability distribution over experts
        logits = x @ self.W_gate                        # (seq_len, num_experts)
        router_probs = softmax(logits, axis=-1)         # (seq_len, num_experts)

        # Step 2: Top-k expert selection per token
        top_k_idx = np.argpartition(
            -router_probs, self.top_k, axis=-1
        )[:, :self.top_k]                               # (seq_len, top_k)
        top_k_probs = np.take_along_axis(
            router_probs, top_k_idx, axis=-1
        )                                               # (seq_len, top_k)
        top_k_weights = top_k_probs / top_k_probs.sum(
            axis=-1, keepdims=True
        )                                               # (seq_len, top_k) — renormalized

        # Step 3: Dispatch tokens to experts and combine outputs
        output = np.zeros_like(x)                       # (seq_len, d_model)
        for k in range(self.top_k):
            expert_indices = top_k_idx[:, k]            # (seq_len,) — which expert
            weights = top_k_weights[:, k]               # (seq_len,) — how much weight

            for e in range(self.num_experts):
                # Find tokens assigned to this expert at rank k
                mask = (expert_indices == e)
                if not np.any(mask):
                    continue

                tokens_for_expert = x[mask]             # (n_tokens, d_model)
                expert_out = self.experts[e](tokens_for_expert)  # (n_tokens, d_model)
                output[mask] += weights[mask, None] * expert_out  # weighted contribution

        # Step 4: Compute auxiliary load-balancing loss
        primary = top_k_idx[:, 0]
        f = np.array([np.sum(primary == e) / seq_len
                       for e in range(self.num_experts)])
        P = router_probs.mean(axis=0)
        self.aux_loss = self.alpha * self.num_experts * np.sum(f * P)

        return output                                   # (seq_len, d_model)

    def param_count(self):
        expert_params = sum(e.param_count() for e in self.experts)
        router_params = self.W_gate.size
        return expert_params + router_params

Let's verify it works and compare parameter counts:

np.random.seed(42)
d_model = 64
d_ff = 172  # same as our transformer capstone

# Dense FFN: one big SwiGLU
dense_ffn = SwiGLUExpert(d_model, d_ff)

# MoE: 8 experts, each with the SAME d_ff, top-2 routing
moe = MoELayer(d_model, d_ff, num_experts=8, top_k=2)

x = np.random.randn(10, d_model)  # 10 tokens

dense_out = dense_ffn(x)
moe_out = moe(x)

print(f"Dense FFN:  {dense_out.shape}  |  params: {dense_ffn.param_count():>8,}")
print(f"MoE Layer:  {moe_out.shape}  |  params: {moe.param_count():>8,}")
print(f"Aux loss:   {moe.aux_loss:.4f}")
print(f"Parameter ratio: {moe.param_count() / dense_ffn.param_count():.1f}x")

Here's the key tradeoff visualized:

Configuration Total Params Active Params/Token Compute/Token
Dense FFN (d_ff=172) 33,024 33,024 (100%) 1.0×
MoE 8 experts, top-2 264,704 66,048 (25%) ~2.0×
MoE 8 experts, top-1 264,704 33,024 (12.5%) ~1.0×

With top-2 routing, MoE uses 8× the total parameters but only 2× the compute per token (2 experts active out of 8). With top-1, you match the dense model's compute exactly while having 8× the capacity. This is why Mixtral 8x7B has 46.7 billion total parameters but inference compute comparable to a ~13B dense model.

The MoE bargain: Trade memory (you must store all expert weights) for capacity (more knowledge in the same compute budget). It's perfect when GPU memory is cheaper than GPU compute — which, increasingly, it is.

Training: MoE vs Dense

Theory is nice, but does MoE actually learn better than a dense FFN? Let's train both on a small task and compare. We'll use character-level next-token prediction, the same setup from our transformer capstone.

# Simplified training comparison: MoE vs Dense FFN
# (Single-layer, character-level, SGD for clarity)

np.random.seed(42)
d_model, d_ff, vocab_size = 32, 86, 26  # small for CPU training
seq_len, lr = 16, 0.01

# Simple embedding + output head (shared between both models)
embed = np.random.randn(vocab_size, d_model) * 0.02

# Two models: dense FFN vs MoE (4 experts, top-2)
dense = SwiGLUExpert(d_model, d_ff)
moe = MoELayer(d_model, d_ff, num_experts=4, top_k=2, alpha=0.01)

# Toy training data: repeating character patterns
text = "abcdefghijklmnopqrstuvwxyz" * 20  # 520 characters
data = np.array([ord(c) - ord('a') for c in text])  # 0-25

def get_batch(data, seq_len, batch_start):
    x = data[batch_start:batch_start + seq_len]
    y = data[batch_start + 1:batch_start + seq_len + 1]
    return x, y

def forward_and_loss(model, embed, x_ids, y_ids):
    """Forward pass through embedding -> model -> softmax -> loss."""
    x = embed[x_ids]                         # (seq_len, d_model)
    h = model(x)                             # (seq_len, d_model)
    logits = h @ embed.T                     # (seq_len, vocab_size)

    # Cross-entropy loss
    logits_max = logits.max(axis=-1, keepdims=True)
    log_probs = logits - logits_max - np.log(
        np.exp(logits - logits_max).sum(axis=-1, keepdims=True)
    )
    loss = -log_probs[np.arange(len(y_ids)), y_ids].mean()
    return loss

# Training loop — track loss over 50 steps
dense_losses, moe_losses = [], []
for step in range(50):
    batch_start = (step * seq_len) % (len(data) - seq_len - 1)
    x_ids, y_ids = get_batch(data, seq_len, batch_start)

    d_loss = forward_and_loss(dense, embed, x_ids, y_ids)
    m_loss = forward_and_loss(moe, embed, x_ids, y_ids)

    dense_losses.append(float(d_loss))
    moe_losses.append(float(m_loss))

    if step % 10 == 0:
        aux = moe.aux_loss
        print(f"Step {step:3d} | Dense loss: {d_loss:.3f} | "
              f"MoE loss: {m_loss:.3f} (aux: {aux:.4f})")

In a full training setup with backpropagation and proper gradient updates, MoE consistently reaches lower loss than a dense model using the same per-token compute budget. The extra capacity from inactive experts provides a richer feature space — even though each token only sees 2 experts, the variety of available pathways matters.

Do Experts Actually Specialize?

One of the most intriguing findings from MoE research is that experts do develop specializations, though perhaps not the way you'd expect. Analysis of Mixtral 8x7B reveals:

The load-balancing loss actually works against sharp specialization — it pushes routing toward uniformity. This is a fundamental tension in MoE design: too much balance wastes the potential of specialization, too little leads to collapse. DeepSeek-V3's auxiliary-loss-free approach (which we'll explore next) was partly motivated by this tension.

Modern Innovations: Beyond Vanilla MoE

The vanilla MoE we just built is the foundation, but production models have added several clever refinements. Let's look at the three biggest ideas.

Shared Experts (DeepSeek)

DeepSeek-V2 introduced a simple but powerful idea: not all experts should be optional. Some knowledge is so fundamental that every token needs it. So they added one or more shared experts that are always active, alongside the routed experts:

class SharedExpertMoE:
    """MoE with always-active shared expert(s) + routed experts."""

    def __init__(self, d_model, d_ff, num_shared=1, num_routed=8, top_k=2):
        # Shared experts: always active for every token
        self.shared = [SwiGLUExpert(d_model, d_ff) for _ in range(num_shared)]

        # Routed experts: selected by the router
        self.routed_moe = MoELayer(d_model, d_ff, num_experts=num_routed, top_k=top_k)

    def __call__(self, x):
        # Shared experts process ALL tokens
        shared_out = sum(expert(x) for expert in self.shared)

        # Routed experts process tokens selectively
        routed_out = self.routed_moe(x)

        # Combine both contributions
        return shared_out + routed_out

The shared expert captures universal knowledge (syntax, common patterns), while routed experts handle specialized processing. DeepSeek-V3 uses 1 shared expert plus 256 routed experts with top-8 selection — a total of 671 billion parameters with just 37 billion active per token.

Fine-Grained Routing

Mixtral uses 8 large experts. DeepSeek uses 256 small ones. Why? More experts means more possible combinations. With 8 experts and top-2, you get C(8,2) = 28 unique expert pairs. With 256 experts and top-8, you get a staggering C(256,8) ≈ 4 × 1014 combinations — far more expressiveness for matching tokens to specialized processing.

The trick is that each fine-grained expert is proportionally smaller, so total compute stays constant. Instead of 8 experts each with d_ff=14336, you might have 256 experts each with d_ff=448. Same total parameters, radically different routing flexibility.

Auxiliary-Loss-Free Balancing

The load-balancing loss we implemented has a fundamental problem: it competes with the primary training objective. Every gradient step is a compromise between "predict the next token better" and "spread tokens more evenly." DeepSeek-V3 found a way to eliminate this tension entirely:

# DeepSeek-V3: Bias-based load balancing (no auxiliary loss)

class AuxFreeMoERouter:
    """Router with bias-based balancing — no auxiliary loss needed."""

    def __init__(self, d_model, num_experts, gamma=0.001):
        scale = np.sqrt(2.0 / d_model)
        self.W_gate = np.random.randn(d_model, num_experts) * scale
        self.bias = np.zeros(num_experts)   # balancing bias (NOT learned by gradient)
        self.gamma = gamma                  # bias update rate

    def route(self, x, k=2):
        logits = x @ self.W_gate            # (seq_len, num_experts)

        # Routing decision uses bias (determines WHICH experts)
        routing_scores = softmax(logits + self.bias, axis=-1)
        top_k_idx = np.argpartition(-routing_scores, k, axis=-1)[:, :k]

        # Gating weights do NOT use bias (determines HOW MUCH weight)
        gating_weights = softmax(logits, axis=-1)
        top_k_probs = np.take_along_axis(gating_weights, top_k_idx, axis=-1)
        top_k_weights = top_k_probs / top_k_probs.sum(axis=-1, keepdims=True)

        # Update bias based on load (not gradient-based!)
        tokens_per_expert = np.zeros(len(self.bias))
        for idx in top_k_idx[:, 0]:
            tokens_per_expert[idx] += 1
        avg_load = tokens_per_expert.mean()
        for e in range(len(self.bias)):
            if tokens_per_expert[e] > avg_load:
                self.bias[e] -= self.gamma   # overloaded → discourage
            elif tokens_per_expert[e] < avg_load:
                self.bias[e] += self.gamma   # underloaded → encourage

        return top_k_idx, top_k_weights

The key insight is the separation: the bias affects which experts get selected but not how much they contribute to the output. This means the gating weights remain pure functions of the input, while the bias acts as a gentle balancing thumb on the scale. No auxiliary loss means no gradient conflict — every gradient step is fully dedicated to the primary training objective.

Expert Parallelism

At scale, each expert lives on a different GPU. When a token needs expert 5, its hidden state must be sent to GPU 5, processed, and the result sent back. This is the all-to-all communication pattern — every GPU potentially sends data to every other GPU in two rounds (dispatch tokens, then collect results).

This communication cost is the main engineering challenge of MoE at scale. DeepSeek-V3 addresses it by grouping 256 experts into 8 groups of 32, each group on one node, and limiting each token to visit at most 4 nodes. Less communication, more practical at 671B parameters.

Try It: MoE Routing Visualizer

Each token gets routed to its top-2 experts. Arc thickness = gating weight. Toggle auxiliary loss OFF to see router collapse.

The Practical Guide

When should you use MoE? Here's the decision framework:

How MoE interacts with the other techniques we've built:

The Efficiency Revolution

MoE is how modern LLMs get smarter without proportionally more compute. The core insight is beautiful in its simplicity: not every token needs every parameter.

We built the whole thing from scratch:

  1. Router — a single linear layer + softmax that produces expert selection probabilities
  2. Top-k selection — pick the best k experts, renormalize weights, combine outputs
  3. Load balancing — the auxiliary loss that prevents router collapse
  4. Expert dispatch — send each token to its chosen experts, gather weighted results

And we saw how modern systems push beyond vanilla MoE: shared experts for universal knowledge, fine-grained routing with hundreds of tiny experts, and auxiliary-loss-free balancing that eliminates the training conflict entirely.

The architecture picture keeps growing. Where MoE fits:

tokenize → embed → position → normalize → attend → MoE FFN → softmax → loss → optimize → decode

We replaced the dense FFN with a sparse one. 8× the parameters, 2× the compute, and a tiny router that learns where to send each token. That's the trick that makes models like Mixtral and DeepSeek-V3 possible — and it's built from nothing more than the components we already understand.

References & Further Reading