Mixture of Experts from Scratch: How One Transformer Becomes Eight
The Scaling Problem
In our FFN from Scratch post, we built the feed-forward network that lives inside every transformer block. We discovered something surprising: the FFN holds roughly two-thirds of a transformer's parameters. In LLaMA 3 8B, for example, each FFN layer has three weight matrices totaling over 150 million parameters — and there are 32 of these layers.
Here's the uncomfortable truth about dense FFNs: every single token pays the full computational cost of the entire network. A comma, a period, the word "the" — each one flows through exactly the same 150 million parameters as a complex technical term or a rare multilingual token. Does a comma really need the same neural pathway as "mitochondria"?
The answer, it turns out, is no. And that insight unlocks one of the most important ideas in modern AI: Mixture of Experts (MoE).
The idea is elegant: instead of one massive FFN, use N smaller FFNs (the "experts") and a tiny learned router that decides which experts each token should visit. Most tokens only pass through 2 of the 8 experts, so you get 8× the total parameters at only ~2× the compute per token. More knowledge, less cost.
MoE isn't new — Jacobs, Jordan, Nowlan, and Hinton proposed it in 1991. But the transformer era made it dominant. Google's Switch Transformer (2022) scaled to a trillion parameters. Mistral's Mixtral 8x7B (2024) made it practical. And DeepSeek-V3 (2024) pushed it to 671 billion parameters with just 37 billion active per token, training it for a fraction of the cost of comparable dense models.
Today we're building it all from scratch in NumPy: the router, top-k selection, the load balancing trick that prevents collapse, and a complete MoE layer that drops right into the transformer we already built.
The Gating Network: Learning to Route
The router — also called the gating network — is the brain of MoE. It looks at each token's hidden state and decides which experts should process it. And it's shockingly simple: a single linear projection followed by softmax.
For a token with hidden state x of dimension d_model, and N experts:
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
x_max = np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
class Router:
"""Learned gating network that routes tokens to experts."""
def __init__(self, d_model, num_experts):
# One linear layer: d_model -> num_experts
scale = np.sqrt(2.0 / d_model)
self.W_gate = np.random.randn(d_model, num_experts) * scale
def __call__(self, x):
"""
x: (seq_len, d_model) — token hidden states
Returns: (seq_len, num_experts) — probability over experts per token
"""
logits = x @ self.W_gate # (seq_len, num_experts)
return softmax(logits, axis=-1) # (seq_len, num_experts)
That's the entire router. A single matrix multiply turns each token's d_model-dimensional hidden state into a probability distribution over N experts. The router starts random and learns during training — through backpropagation, it discovers which experts should handle which kinds of tokens.
Let's see what its output looks like:
np.random.seed(42)
router = Router(d_model=64, num_experts=8)
x = np.random.randn(4, 64) # 4 tokens
probs = router(x)
print("Router output — probability per expert for each token:")
for i in range(4):
top2 = np.argsort(probs[i])[-2:][::-1]
print(f" Token {i}: ", end="")
for j in range(8):
marker = " *" if j in top2 else " "
print(f"E{j}={probs[i,j]:.3f}{marker}", end=" ")
print()
# Each row sums to 1.0 — it's a valid probability distribution
Each token gets a different distribution. Some tokens might strongly prefer expert 3, while others spread their probability more evenly. The starred entries are the top-2 experts — the ones that will actually process the token.
Top-K Routing: Pick Your Experts
The router gives us probabilities over all N experts, but using all of them would defeat the purpose. We'd be computing N FFN forward passes per token — more expensive than a single large FFN. The whole point of MoE is sparse activation: each token visits only k experts.
Top-k routing works in three steps:
- Select the
kexperts with the highest gating scores - Renormalize the selected weights so they sum to 1
- Combine the experts' outputs with the renormalized weights
The output for each token is a weighted sum of its selected experts:
y = Σi ∈ top-k wi · Experti(x)
def top_k_route(router_probs, k=2):
"""
Select top-k experts per token and renormalize weights.
router_probs: (seq_len, num_experts) — probability distribution
k: number of experts to select per token
Returns:
top_k_indices: (seq_len, k) — which experts are selected
top_k_weights: (seq_len, k) — renormalized weights
"""
seq_len, num_experts = router_probs.shape
# Find the k largest probabilities per token
top_k_indices = np.argpartition(-router_probs, k, axis=-1)[:, :k] # (seq_len, k)
# Gather the corresponding probabilities
top_k_probs = np.take_along_axis(router_probs, top_k_indices, axis=-1) # (seq_len, k)
# Renormalize so the k weights sum to 1
top_k_weights = top_k_probs / top_k_probs.sum(axis=-1, keepdims=True) # (seq_len, k)
return top_k_indices, top_k_weights
Let's trace through a concrete example. Say we have 4 tokens, 4 experts, and top-2 routing:
np.random.seed(7)
router = Router(d_model=8, num_experts=4)
x = np.random.randn(4, 8)
probs = router(x)
indices, weights = top_k_route(probs, k=2)
print("Token | Router Probs (all 4 experts) | Selected → Weights")
print("-" * 75)
for i in range(4):
prob_str = " ".join(f"E{j}={probs[i,j]:.3f}" for j in range(4))
sel_str = " + ".join(f"E{indices[i,j]}×{weights[i,j]:.3f}" for j in range(2))
print(f" {i} | {prob_str} | {sel_str}")
# Output: each token picks its top 2 experts, weights sum to 1.0
Common choices for k across real models: top-1 in Google's Switch Transformer (simplest, but loses the weighted combination), top-2 in Mixtral 8x7B (the sweet spot for quality vs. compute), and top-8 of 256 in DeepSeek-V3 (fine-grained routing with hundreds of small experts).
The Load Balancing Problem
There's a catastrophic failure mode lurking in MoE, and if you train without addressing it, your model will break.
The problem is router collapse. Without explicit encouragement to spread tokens across experts, the router learns to funnel almost all tokens into one or two "favorite" experts. Here's why: if expert 3 happens to get slightly more tokens early in training, it improves faster (more gradient signal), which makes the router send even more tokens to it, which makes it even better — a vicious positive feedback loop. The other experts starve, and you've spent the memory on 8 experts but only 2 are doing useful work.
The fix is an auxiliary load-balancing loss that nudges the router toward balanced assignment. The Switch Transformer introduced this elegant formulation:
Lbalance = α · N · Σi=1N fi · Pi
Where:
- fi = fraction of tokens actually dispatched to expert i (hard assignment via argmax)
- Pi = mean router probability allocated to expert i (soft, differentiable through softmax)
- α = 0.01 (small enough not to overwhelm the primary training loss)
- N = number of experts (makes the loss scale-invariant)
The beauty of this formulation is the product f_i · P_i. The hard assignment f_i isn't differentiable (it uses argmax), but P_i is differentiable through the softmax. By multiplying them, the gradient flows through P_i to push the router toward balance. When routing is perfectly balanced, both f_i and P_i equal 1/N for all experts, and the loss is minimized.
def load_balance_loss(router_probs, top_k_indices, num_experts, alpha=0.01):
"""
Auxiliary loss that encourages balanced expert assignment.
router_probs: (seq_len, num_experts) — full router probability distribution
top_k_indices: (seq_len, k) — which experts were selected
num_experts: int — total number of experts
alpha: float — loss weight (0.01 in Switch Transformer)
Returns: scalar loss value
"""
seq_len = router_probs.shape[0]
# f_i: fraction of tokens where expert i is the TOP-1 choice
primary_expert = top_k_indices[:, 0] # (seq_len,)
f = np.zeros(num_experts)
for i in range(num_experts):
f[i] = np.sum(primary_expert == i) / seq_len # (num_experts,)
# P_i: mean router probability for expert i across all tokens
P = router_probs.mean(axis=0) # (num_experts,)
# Auxiliary loss: penalizes imbalance
loss = alpha * num_experts * np.sum(f * P) # scalar
return loss
# Example: balanced vs. collapsed routing
np.random.seed(42)
seq_len, num_experts = 100, 8
# Balanced: each expert gets ~12.5% of tokens
balanced_probs = softmax(np.random.randn(seq_len, num_experts) * 0.5)
balanced_idx, _ = top_k_route(balanced_probs, k=2)
# Collapsed: expert 0 dominates
collapsed_logits = np.random.randn(seq_len, num_experts) * 0.5
collapsed_logits[:, 0] += 5.0 # heavily bias toward expert 0
collapsed_probs = softmax(collapsed_logits)
collapsed_idx, _ = top_k_route(collapsed_probs, k=2)
print(f"Balanced loss: {load_balance_loss(balanced_probs, balanced_idx, num_experts):.4f}")
print(f"Collapsed loss: {load_balance_loss(collapsed_probs, collapsed_idx, num_experts):.4f}")
# Collapsed loss is much higher — the auxiliary loss detects the imbalance
There's one more practical detail: the capacity factor. Since we can't predict exactly how many tokens will route to each expert, we pre-allocate a buffer for each expert:
expert_capacity = int((tokens_per_batch / num_experts) * capacity_factor)
# capacity_factor = 1.0: exact fair share (any imbalance drops tokens)
# capacity_factor = 1.25: 25% slack (the typical choice)
When an expert's buffer overflows, excess tokens are dropped — they skip the MoE layer entirely and pass through via the residual connection only. It's a graceful degradation: the token misses one layer's processing but doesn't break the model.
Building the Complete MoE Layer
Now we assemble everything into a single class. Each expert is a SwiGLU FFN — exactly the same architecture we built before, just with smaller dimensions. The router picks top-k experts per token, the experts process their assigned tokens, and the outputs get weighted and combined.
def silu(x):
"""SiLU/Swish activation: x * sigmoid(x)"""
return x * (1.0 / (1.0 + np.exp(-x)))
class SwiGLUExpert:
"""A single SwiGLU FFN expert (from the FFN post)."""
def __init__(self, d_model, d_ff):
scale = np.sqrt(2.0 / d_model)
self.w_gate = np.random.randn(d_model, d_ff) * scale
self.w_up = np.random.randn(d_model, d_ff) * scale
self.w_down = np.random.randn(d_ff, d_model) * scale
def __call__(self, x):
gate = silu(x @ self.w_gate) # (tokens, d_ff)
up = x @ self.w_up # (tokens, d_ff)
return (gate * up) @ self.w_down # (tokens, d_model)
def param_count(self):
return sum(w.size for w in [self.w_gate, self.w_up, self.w_down])
Now the main event — the complete MoE layer:
class MoELayer:
"""Mixture of Experts layer that replaces a dense FFN."""
def __init__(self, d_model, d_ff, num_experts=8, top_k=2, alpha=0.01):
self.num_experts = num_experts
self.top_k = top_k
self.alpha = alpha
self.aux_loss = 0.0 # stored after each forward pass
# N independent SwiGLU experts
self.experts = [SwiGLUExpert(d_model, d_ff) for _ in range(num_experts)]
# Learned router: d_model -> num_experts
scale = np.sqrt(2.0 / d_model)
self.W_gate = np.random.randn(d_model, num_experts) * scale
def __call__(self, x):
"""
x: (seq_len, d_model) — input from attention + residual + norm
Returns: (seq_len, d_model) — MoE output
"""
seq_len, d_model = x.shape
# Step 1: Router produces probability distribution over experts
logits = x @ self.W_gate # (seq_len, num_experts)
router_probs = softmax(logits, axis=-1) # (seq_len, num_experts)
# Step 2: Top-k expert selection per token
top_k_idx = np.argpartition(
-router_probs, self.top_k, axis=-1
)[:, :self.top_k] # (seq_len, top_k)
top_k_probs = np.take_along_axis(
router_probs, top_k_idx, axis=-1
) # (seq_len, top_k)
top_k_weights = top_k_probs / top_k_probs.sum(
axis=-1, keepdims=True
) # (seq_len, top_k) — renormalized
# Step 3: Dispatch tokens to experts and combine outputs
output = np.zeros_like(x) # (seq_len, d_model)
for k in range(self.top_k):
expert_indices = top_k_idx[:, k] # (seq_len,) — which expert
weights = top_k_weights[:, k] # (seq_len,) — how much weight
for e in range(self.num_experts):
# Find tokens assigned to this expert at rank k
mask = (expert_indices == e)
if not np.any(mask):
continue
tokens_for_expert = x[mask] # (n_tokens, d_model)
expert_out = self.experts[e](tokens_for_expert) # (n_tokens, d_model)
output[mask] += weights[mask, None] * expert_out # weighted contribution
# Step 4: Compute auxiliary load-balancing loss
primary = top_k_idx[:, 0]
f = np.array([np.sum(primary == e) / seq_len
for e in range(self.num_experts)])
P = router_probs.mean(axis=0)
self.aux_loss = self.alpha * self.num_experts * np.sum(f * P)
return output # (seq_len, d_model)
def param_count(self):
expert_params = sum(e.param_count() for e in self.experts)
router_params = self.W_gate.size
return expert_params + router_params
Let's verify it works and compare parameter counts:
np.random.seed(42)
d_model = 64
d_ff = 172 # same as our transformer capstone
# Dense FFN: one big SwiGLU
dense_ffn = SwiGLUExpert(d_model, d_ff)
# MoE: 8 experts, each with the SAME d_ff, top-2 routing
moe = MoELayer(d_model, d_ff, num_experts=8, top_k=2)
x = np.random.randn(10, d_model) # 10 tokens
dense_out = dense_ffn(x)
moe_out = moe(x)
print(f"Dense FFN: {dense_out.shape} | params: {dense_ffn.param_count():>8,}")
print(f"MoE Layer: {moe_out.shape} | params: {moe.param_count():>8,}")
print(f"Aux loss: {moe.aux_loss:.4f}")
print(f"Parameter ratio: {moe.param_count() / dense_ffn.param_count():.1f}x")
Here's the key tradeoff visualized:
| Configuration | Total Params | Active Params/Token | Compute/Token |
|---|---|---|---|
| Dense FFN (d_ff=172) | 33,024 | 33,024 (100%) | 1.0× |
| MoE 8 experts, top-2 | 264,704 | 66,048 (25%) | ~2.0× |
| MoE 8 experts, top-1 | 264,704 | 33,024 (12.5%) | ~1.0× |
With top-2 routing, MoE uses 8× the total parameters but only 2× the compute per token (2 experts active out of 8). With top-1, you match the dense model's compute exactly while having 8× the capacity. This is why Mixtral 8x7B has 46.7 billion total parameters but inference compute comparable to a ~13B dense model.
The MoE bargain: Trade memory (you must store all expert weights) for capacity (more knowledge in the same compute budget). It's perfect when GPU memory is cheaper than GPU compute — which, increasingly, it is.
Training: MoE vs Dense
Theory is nice, but does MoE actually learn better than a dense FFN? Let's train both on a small task and compare. We'll use character-level next-token prediction, the same setup from our transformer capstone.
# Simplified training comparison: MoE vs Dense FFN
# (Single-layer, character-level, SGD for clarity)
np.random.seed(42)
d_model, d_ff, vocab_size = 32, 86, 26 # small for CPU training
seq_len, lr = 16, 0.01
# Simple embedding + output head (shared between both models)
embed = np.random.randn(vocab_size, d_model) * 0.02
# Two models: dense FFN vs MoE (4 experts, top-2)
dense = SwiGLUExpert(d_model, d_ff)
moe = MoELayer(d_model, d_ff, num_experts=4, top_k=2, alpha=0.01)
# Toy training data: repeating character patterns
text = "abcdefghijklmnopqrstuvwxyz" * 20 # 520 characters
data = np.array([ord(c) - ord('a') for c in text]) # 0-25
def get_batch(data, seq_len, batch_start):
x = data[batch_start:batch_start + seq_len]
y = data[batch_start + 1:batch_start + seq_len + 1]
return x, y
def forward_and_loss(model, embed, x_ids, y_ids):
"""Forward pass through embedding -> model -> softmax -> loss."""
x = embed[x_ids] # (seq_len, d_model)
h = model(x) # (seq_len, d_model)
logits = h @ embed.T # (seq_len, vocab_size)
# Cross-entropy loss
logits_max = logits.max(axis=-1, keepdims=True)
log_probs = logits - logits_max - np.log(
np.exp(logits - logits_max).sum(axis=-1, keepdims=True)
)
loss = -log_probs[np.arange(len(y_ids)), y_ids].mean()
return loss
# Training loop — track loss over 50 steps
dense_losses, moe_losses = [], []
for step in range(50):
batch_start = (step * seq_len) % (len(data) - seq_len - 1)
x_ids, y_ids = get_batch(data, seq_len, batch_start)
d_loss = forward_and_loss(dense, embed, x_ids, y_ids)
m_loss = forward_and_loss(moe, embed, x_ids, y_ids)
dense_losses.append(float(d_loss))
moe_losses.append(float(m_loss))
if step % 10 == 0:
aux = moe.aux_loss
print(f"Step {step:3d} | Dense loss: {d_loss:.3f} | "
f"MoE loss: {m_loss:.3f} (aux: {aux:.4f})")
In a full training setup with backpropagation and proper gradient updates, MoE consistently reaches lower loss than a dense model using the same per-token compute budget. The extra capacity from inactive experts provides a richer feature space — even though each token only sees 2 experts, the variety of available pathways matters.
Do Experts Actually Specialize?
One of the most intriguing findings from MoE research is that experts do develop specializations, though perhaps not the way you'd expect. Analysis of Mixtral 8x7B reveals:
- Token-level patterns are strong: Punctuation tokens consistently route to specific experts. Whitespace tokens cluster together. This is the clearest specialization signal.
- Syntactic patterns emerge: Some experts prefer verbs, others handle proper nouns. The router learns grammatical structure without being told it exists.
- Domain specialization is weak: Despite hopes, experts don't neatly split into "the biology expert" and "the math expert." All experts see all domains, just in different proportions.
The load-balancing loss actually works against sharp specialization — it pushes routing toward uniformity. This is a fundamental tension in MoE design: too much balance wastes the potential of specialization, too little leads to collapse. DeepSeek-V3's auxiliary-loss-free approach (which we'll explore next) was partly motivated by this tension.
Modern Innovations: Beyond Vanilla MoE
The vanilla MoE we just built is the foundation, but production models have added several clever refinements. Let's look at the three biggest ideas.
Shared Experts (DeepSeek)
DeepSeek-V2 introduced a simple but powerful idea: not all experts should be optional. Some knowledge is so fundamental that every token needs it. So they added one or more shared experts that are always active, alongside the routed experts:
class SharedExpertMoE:
"""MoE with always-active shared expert(s) + routed experts."""
def __init__(self, d_model, d_ff, num_shared=1, num_routed=8, top_k=2):
# Shared experts: always active for every token
self.shared = [SwiGLUExpert(d_model, d_ff) for _ in range(num_shared)]
# Routed experts: selected by the router
self.routed_moe = MoELayer(d_model, d_ff, num_experts=num_routed, top_k=top_k)
def __call__(self, x):
# Shared experts process ALL tokens
shared_out = sum(expert(x) for expert in self.shared)
# Routed experts process tokens selectively
routed_out = self.routed_moe(x)
# Combine both contributions
return shared_out + routed_out
The shared expert captures universal knowledge (syntax, common patterns), while routed experts handle specialized processing. DeepSeek-V3 uses 1 shared expert plus 256 routed experts with top-8 selection — a total of 671 billion parameters with just 37 billion active per token.
Fine-Grained Routing
Mixtral uses 8 large experts. DeepSeek uses 256 small ones. Why? More experts means more possible combinations. With 8 experts and top-2, you get C(8,2) = 28 unique expert pairs. With 256 experts and top-8, you get a staggering C(256,8) ≈ 4 × 1014 combinations — far more expressiveness for matching tokens to specialized processing.
The trick is that each fine-grained expert is proportionally smaller, so total compute stays constant. Instead of 8 experts each with d_ff=14336, you might have 256 experts each with d_ff=448. Same total parameters, radically different routing flexibility.
Auxiliary-Loss-Free Balancing
The load-balancing loss we implemented has a fundamental problem: it competes with the primary training objective. Every gradient step is a compromise between "predict the next token better" and "spread tokens more evenly." DeepSeek-V3 found a way to eliminate this tension entirely:
# DeepSeek-V3: Bias-based load balancing (no auxiliary loss)
class AuxFreeMoERouter:
"""Router with bias-based balancing — no auxiliary loss needed."""
def __init__(self, d_model, num_experts, gamma=0.001):
scale = np.sqrt(2.0 / d_model)
self.W_gate = np.random.randn(d_model, num_experts) * scale
self.bias = np.zeros(num_experts) # balancing bias (NOT learned by gradient)
self.gamma = gamma # bias update rate
def route(self, x, k=2):
logits = x @ self.W_gate # (seq_len, num_experts)
# Routing decision uses bias (determines WHICH experts)
routing_scores = softmax(logits + self.bias, axis=-1)
top_k_idx = np.argpartition(-routing_scores, k, axis=-1)[:, :k]
# Gating weights do NOT use bias (determines HOW MUCH weight)
gating_weights = softmax(logits, axis=-1)
top_k_probs = np.take_along_axis(gating_weights, top_k_idx, axis=-1)
top_k_weights = top_k_probs / top_k_probs.sum(axis=-1, keepdims=True)
# Update bias based on load (not gradient-based!)
tokens_per_expert = np.zeros(len(self.bias))
for idx in top_k_idx[:, 0]:
tokens_per_expert[idx] += 1
avg_load = tokens_per_expert.mean()
for e in range(len(self.bias)):
if tokens_per_expert[e] > avg_load:
self.bias[e] -= self.gamma # overloaded → discourage
elif tokens_per_expert[e] < avg_load:
self.bias[e] += self.gamma # underloaded → encourage
return top_k_idx, top_k_weights
The key insight is the separation: the bias affects which experts get selected but not how much they contribute to the output. This means the gating weights remain pure functions of the input, while the bias acts as a gentle balancing thumb on the scale. No auxiliary loss means no gradient conflict — every gradient step is fully dedicated to the primary training objective.
Expert Parallelism
At scale, each expert lives on a different GPU. When a token needs expert 5, its hidden state must be sent to GPU 5, processed, and the result sent back. This is the all-to-all communication pattern — every GPU potentially sends data to every other GPU in two rounds (dispatch tokens, then collect results).
This communication cost is the main engineering challenge of MoE at scale. DeepSeek-V3 addresses it by grouping 256 experts into 8 groups of 32, each group on one node, and limiting each token to visit at most 4 nodes. Less communication, more practical at 671B parameters.
Try It: MoE Routing Visualizer
Each token gets routed to its top-2 experts. Arc thickness = gating weight. Toggle auxiliary loss OFF to see router collapse.
The Practical Guide
When should you use MoE? Here's the decision framework:
- Use MoE when: You want more model capacity without proportionally more compute per token. This is the sweet spot for large-scale pretraining where you have plenty of GPU memory but need efficiency.
- Skip MoE when: Your model is small (under ~1B parameters), or you're memory-constrained. All experts must be resident in memory even though only
kare active — Mixtral 8x7B has 46.7B total parameters stored in memory despite compute matching a ~13B model.
How MoE interacts with the other techniques we've built:
- Quantization: MoE models have a unique advantage — inactive experts can be quantized more aggressively than active ones. Hot experts stay at higher precision; cold experts drop to 4-bit or lower, saving significant memory.
- LoRA: Fine-tuning MoE models is tricky. The safest approach is to apply LoRA to the attention layers (shared across all experts) and the router. Targeting expert FFNs directly can cause load balance instability on small datasets. Always keep the auxiliary loss active during fine-tuning.
- KV Cache: MoE doesn't affect the KV cache at all — attention is shared, only the FFN is replaced. This means MoE's memory overhead is purely in the expert weights, not the inference cache.
The Efficiency Revolution
MoE is how modern LLMs get smarter without proportionally more compute. The core insight is beautiful in its simplicity: not every token needs every parameter.
We built the whole thing from scratch:
- Router — a single linear layer + softmax that produces expert selection probabilities
- Top-k selection — pick the best
kexperts, renormalize weights, combine outputs - Load balancing — the auxiliary loss that prevents router collapse
- Expert dispatch — send each token to its chosen experts, gather weighted results
And we saw how modern systems push beyond vanilla MoE: shared experts for universal knowledge, fine-grained routing with hundreds of tiny experts, and auxiliary-loss-free balancing that eliminates the training conflict entirely.
The architecture picture keeps growing. Where MoE fits:
tokenize → embed → position → normalize → attend → MoE FFN → softmax → loss → optimize → decode
We replaced the dense FFN with a sparse one. 8× the parameters, 2× the compute, and a tiny router that learns where to send each token. That's the trick that makes models like Mixtral and DeepSeek-V3 possible — and it's built from nothing more than the components we already understand.
References & Further Reading
- Jacobs, Jordan, Nowlan, Hinton — Adaptive Mixtures of Local Experts (1991) — the paper that started it all, proposing learned gating for expert selection
- Shazeer et al. — Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer (2017) — scaled MoE to thousands of experts with noisy top-k gating
- Fedus, Zoph, Shazeer — Switch Transformers (2022) — simplified MoE to top-1 routing, introduced the auxiliary load-balancing loss and capacity factor
- Jiang et al. — Mixtral of Experts (2024) — made MoE practical: 46.7B params, 12.9B active, matching or beating LLaMA 2 70B
- DeepSeek-AI — DeepSeek-V3 Technical Report (2024) — 671B params, 37B active, introduced auxiliary-loss-free load balancing and fine-grained routing with 256 experts
- Hugging Face — Mixture of Experts Explained — accessible overview of MoE concepts and implementations