Feed-Forward Networks from Scratch: The Other Half of Every Transformer Block

February 25, 2026 · Elementary · 16 min read

The Unsung Majority

Quick: what's the most important component in a transformer? If you said "attention," you're in good company. Entire papers, blog posts, and Twitter threads orbit around the self-attention mechanism. We even named the landmark paper after it.

But here's the twist: two-thirds of a transformer's parameters don't live in attention at all. They live in the feed-forward network — the "boring" layer that sits between attention blocks and quietly does most of the heavy lifting.

In our attention post, we built the mechanism that lets tokens talk to each other. But attention only mixes information — it doesn't transform it. The feed-forward network (FFN) is where the transformer actually processes what it has learned from context, storing knowledge and making per-token predictions. Think of attention as the meeting where everyone shares their ideas, and the FFN as the individual work that happens afterward.

In this post, we'll build every major FFN variant from pure math and NumPy:

The classic FFN from "Attention Is All You Need" — two linear projections with ReLU
Four activation functions: ReLU, GELU, SiLU/Swish — from hard gates to smooth curves
Gated Linear Units (GLU) — the multiplicative trick that changed everything
SwiGLU — the exact FFN variant powering LLaMA, Mistral, and virtually every modern LLM
The "FFN as key-value memory" interpretation — why this layer is a massive lookup table

By the end, you'll understand the component that holds most of a language model's knowledge. Let's build it.

The Classic FFN: Expand, Activate, Compress

The original transformer paper defines the position-wise feed-forward network with a beautifully simple formula:

FFN(x) = ReLU(x W₁ + b₁) W₂ + b₂

That's it. Two matrix multiplications with an activation function sandwiched between them. But there's a critical detail in the word "position-wise" — this same network is applied independently to every token in the sequence. Token at position 1 gets the same FFN as token at position 50. While attention lets tokens communicate across positions, the FFN processes each token's representation in isolation.

The architecture follows an expand-then-compress pattern. The first projection maps from the model dimension d_model up to a larger hidden dimension d_ff = 4 × d_model. The activation function applies element-wise non-linearity. Then the second projection maps back down to d_model. In the original paper, that's 512 → 2048 → 512.

Why the 4× expansion? The network needs a higher-dimensional space to learn complex transformations. Projecting up gives the non-linearity more dimensions to work with — more "room to think" — before compressing back to the model dimension. Let's build it:

import numpy as np
np.random.seed(42)

def classic_ffn(x, W1, b1, W2, b2):
    """Original transformer FFN: ReLU(xW1 + b1)W2 + b2"""
    hidden = x @ W1 + b1          # (seq, d_model) @ (d_model, d_ff) = (seq, d_ff)
    activated = np.maximum(0, hidden)  # ReLU: zero out negatives
    output = activated @ W2 + b2  # (seq, d_ff) @ (d_ff, d_model) = (seq, d_model)
    return output

# Dimensions from the original transformer paper
d_model, d_ff = 8, 32  # using small dims for demonstration (real: 512, 2048)
seq_len = 3

# Initialize weights (Kaiming/He initialization for ReLU)
W1 = np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model)
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff)
b2 = np.zeros(d_model)

# Three tokens of input
x = np.random.randn(seq_len, d_model)  # (3, 8)

output = classic_ffn(x, W1, b1, W2, b2)
print(f"Input shape:  {x.shape}")       # (3, 8)
print(f"Output shape: {output.shape}")   # (3, 8)
print(f"Parameters:   {d_model * d_ff + d_ff + d_ff * d_model + d_model}")  # 8*32 + 32 + 32*8 + 8 = 552
print(f"First token in:  {x[0, :4].round(3)}")
print(f"First token out: {output[0, :4].round(3)}")

Why 2/3 of all parameters? Attention uses four projections (Q, K, V, output), each of size d_model × d_model, totaling 4d² parameters. The FFN uses two projections of size d_model × 4d_model, totaling 8d². That's 8d² out of 12d² per layer — exactly 66.7%.

Activation Functions: From Hard Gates to Smooth Curves

The activation function between the two linear projections is what makes the FFN non-linear. Without it, stacking two matrix multiplications is just one bigger matrix multiplication — the network couldn't learn anything interesting. The choice of activation has evolved dramatically since 2017.

ReLU: The Hard Gate

ReLU(x) = max(0, x) was the original choice. It's brutally simple: positive values pass through unchanged, negative values are zeroed. This creates a hard binary gate where roughly half the neurons are "on" and half are "off" for any given input.

The problem is the dying neuron phenomenon. When a neuron's pre-activation consistently falls below zero — maybe due to an unlucky weight update — its gradient becomes permanently zero. No gradient flows, no learning happens, and the neuron is dead forever. In deep networks, this cascading death can silently reduce your effective model capacity.

GELU: The Gaussian Gate

GELU (Hendrycks & Gimpel, 2016) takes an elegant probabilistic approach. Instead of a hard zero/one gate, imagine randomly masking each input with a probability that depends on the input's value: large positive values almost always pass through, large negative values are almost always zeroed, and values near zero get a coin flip.

Mathematically, multiply each input by a Bernoulli random variable where the probability of passing is Φ(x), the CDF of the standard normal distribution. Take the expectation:

GELU(x) = x · Φ(x) = x · ½[1 + erf(x / √2)]

In practice, implementations use a fast tanh approximation:

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715x³)))

GELU's smooth curve means it has a non-zero gradient everywhere — even for negative inputs. Dead neurons can recover. This is why GELU became the default for BERT, GPT-2, and GPT-3.

SiLU/Swish: The Self-Gate

Swish (Ramachandran, Zoph & Le, 2017) was discovered through an automated search over activation functions — Google Brain literally tried thousands of combinations and this one won:

SiLU(x) = x · σ(x) = x · (1 / (1 + e^−x))

The input gates itself through its own sigmoid. For large positive x, σ(x) ≈ 1, so SiLU(x) ≈ x. For large negative x, σ(x) ≈ 0, so SiLU(x) ≈ 0. Near zero, the transition is smooth. One subtle property: SiLU is non-monotonic — it dips slightly below zero around x ≈ −1.28, which appears to help optimization.

Let's build all three and compare their behavior:

import numpy as np
from scipy.special import erf  # for exact GELU

def relu(x):
    return np.maximum(0, x)

def gelu_exact(x):
    """GELU: x * Phi(x) where Phi is the standard normal CDF"""
    return x * 0.5 * (1.0 + erf(x / np.sqrt(2.0)))

def gelu_approx(x):
    """Fast tanh approximation used in BERT/GPT-2"""
    return 0.5 * x * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * x**3)))

def silu(x):
    """SiLU/Swish: x * sigmoid(x)"""
    return x * (1.0 / (1.0 + np.exp(-x)))

# Compare at key points
test_points = np.array([-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0])

print("     x   |  ReLU  |  GELU  |  SiLU")
print("---------+--------+--------+-------")
for xi in test_points:
    print(f"  {xi:5.1f}  | {relu(xi):6.3f} | {gelu_exact(xi):6.3f} | {silu(xi):6.3f}")

# Output:
#      x   |  ReLU  |  GELU  |  SiLU
# ---------+--------+--------+-------
#   -2.0  |  0.000 | -0.046 | -0.238
#   -1.0  |  0.000 | -0.159 | -0.269
#   -0.5  |  0.000 | -0.154 | -0.189
#    0.0  |  0.000 |  0.000 |  0.000
#    0.5  |  0.500 |  0.346 |  0.311
#    1.0  |  1.000 |  0.841 |  0.731
#    2.0  |  2.000 |  1.954 |  1.762

Notice the key difference: at x = −1, ReLU outputs exactly 0, but GELU outputs −0.159 and SiLU outputs −0.269. Those small non-zero outputs mean gradients can still flow backward through negative regions — the foundation for why smooth activations train better in deep networks.

Why Gradients Matter

The gradient of each activation tells us how error signals flow backward during training:

def relu_grad(x):
    """Gradient of ReLU: 1 if x > 0, else 0"""
    return 1.0 if x > 0 else 0.0

def gelu_grad(x):
    """Gradient of GELU: Phi(x) + x * phi(x)"""
    phi = np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)  # standard normal PDF
    Phi = 0.5 * (1.0 + erf(x / np.sqrt(2.0)))      # standard normal CDF
    return Phi + x * phi

def silu_grad(x):
    """Gradient of SiLU: sigmoid(x) + x * sigmoid(x) * (1 - sigmoid(x))"""
    sig = 1.0 / (1.0 + np.exp(-x))
    return sig + x * sig * (1.0 - sig)

print("     x   | ReLU' | GELU'  | SiLU'")
print("---------+-------+--------+------")
for xi in [-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0]:
    print(f"  {xi:5.1f}  | {relu_grad(xi):5.3f} | {gelu_grad(xi):6.3f} | {silu_grad(xi):6.3f}")

# Output:
#      x   | ReLU' | GELU'  | SiLU'
# ---------+-------+--------+------
#   -2.0  | 0.000 | -0.085 | -0.091
#   -1.0  | 0.000 | -0.083 |  0.072
#   -0.5  | 0.000 |  0.133 |  0.260
#    0.0  | 0.000 |  0.500 |  0.500
#    0.5  | 1.000 |  0.867 |  0.740
#    1.0  | 1.000 |  1.083 |  0.928
#    2.0  | 1.000 |  1.085 |  1.091

At x = −1, ReLU's gradient is exactly zero — the neuron is dead and can never recover. GELU's gradient is −0.083 and SiLU's is 0.072 — both non-zero. A non-zero gradient, regardless of sign, means the optimizer can still update the weights feeding into this neuron. Notice that GELU and SiLU gradients can even exceed 1.0 for positive inputs (1.083 and 1.091 at x = 2) — they're not bounded like ReLU. Multiply this across billions of parameters and dozens of layers, and the difference between "dead" and "alive" compounds enormously.

Gated Linear Units: The Multiplicative Trick

Every activation function we've seen so far applies element-wise to a single linear projection. In 2017, Dauphin et al. proposed a different idea: what if the non-linearity came from multiplying two parallel projections together?

GLU(x) = (x W_gate) ⊗ σ(x W_up)

One branch computes a "content" representation. The other branch computes a "gate" through a sigmoid that controls how much of each feature passes through. The element-wise product of the two creates a richer non-linearity than any single activation function can provide.

In 2020, Noam Shazeer published a short but influential paper showing that you could swap the sigmoid for any activation function in the gating position. This spawned a whole family of variants:

ReGLU: ReLU(xW_gate) ⊗ xW_up
GEGLU: GELU(xW_gate) ⊗ xW_up — used in Gemma
SwiGLU: SiLU(xW_gate) ⊗ xW_up — used in LLaMA, Mistral

Shazeer tested all of them. GEGLU and SwiGLU came out on top, with SwiGLU becoming the de facto standard for modern LLMs. Let's build the whole family:

def glu_ffn(x, W_gate, W_up, W_down, activation='swiglu'):
    """Gated Linear Unit FFN with selectable activation."""
    gate_proj = x @ W_gate        # (seq, d_model) @ (d_model, d_ff) = (seq, d_ff)
    up_proj   = x @ W_up          # (seq, d_model) @ (d_model, d_ff) = (seq, d_ff)

    # Apply activation to the gate branch
    if activation == 'glu':
        gate = 1 / (1 + np.exp(-gate_proj))                # sigmoid (gate only)
    elif activation == 'reglu':
        gate = np.maximum(0, gate_proj)                     # ReLU
    elif activation == 'geglu':
        gate = gelu_exact(gate_proj)                        # GELU
    elif activation == 'swiglu':
        gate = silu(gate_proj)                              # SiLU/Swish
    else:
        raise ValueError(f"Unknown activation: {activation}")

    hidden = gate * up_proj       # element-wise gating: (seq, d_ff)
    output = hidden @ W_down      # (seq, d_ff) @ (d_ff, d_model) = (seq, d_model)
    return output

# Build it with small dimensions
d_model, d_ff = 8, 22  # ~8/3 * 8 = ~21.3, rounded to 22
W_gate = np.random.randn(d_model, d_ff) * 0.1
W_up   = np.random.randn(d_model, d_ff) * 0.1
W_down = np.random.randn(d_ff, d_model) * 0.1

x = np.random.randn(3, d_model)

for variant in ['glu', 'reglu', 'geglu', 'swiglu']:
    out = glu_ffn(x, W_gate, W_up, W_down, activation=variant)
    print(f"{variant:6s} | output[0,:4] = {out[0, :4].round(4)}")

# Each variant produces slightly different outputs from the
# same weights, because the gate activation shapes the signal differently

The key insight is that GLU variants provide two gradient pathways during backpropagation. Gradients flow through both the gate branch and the content branch, making training more stable than a single activation. The multiplicative interaction also lets the network learn more complex feature combinations than addition alone.

SwiGLU: The Modern Standard

SwiGLU is the FFN variant used in LLaMA, LLaMA 2, LLaMA 3, Mistral, and most modern open-weight LLMs. Its formula is deceptively simple:

FFN_SwiGLU(x) = (SiLU(x W_gate) ⊗ x W_up) W_down

But there's a catch. The classic FFN has two weight matrices (W₁ and W₂). SwiGLU has three (W_gate, W_up, W_down). If we keep the hidden dimension at 4d, we'd have 50% more parameters per layer — an unfair comparison and a waste of compute.

The Parameter Budget

To keep the total parameter count the same, we need to shrink the hidden dimension. Here's the math:

# Classic FFN: 2 matrices
# W1: (d, 4d) + W2: (4d, d) = 2 * d * 4d = 8d^2 parameters

# SwiGLU: 3 matrices
# W_gate: (d, h) + W_up: (d, h) + W_down: (h, d) = 3 * d * h parameters

# Set equal for parameter parity:
# 3 * d * h = 8 * d^2
# h = 8d / 3 ≈ 2.667d

d_model = 4096  # LLaMA 7B model dimension

# Theoretical hidden dim
h_theory = 8 * d_model / 3
print(f"Theoretical: {h_theory:.1f}")  # 10922.7

# Real LLaMA 7B: round to nearest multiple of 256
h_llama7b = 11008  # closest multiple of 256 to 10922.7
print(f"LLaMA 7B:   {h_llama7b}")

# Parameter comparison
classic_params = 2 * d_model * (4 * d_model)
swiglu_params  = 3 * d_model * h_llama7b
print(f"\nClassic FFN params per layer: {classic_params:,}")     # 134,217,728
print(f"SwiGLU FFN params per layer:  {swiglu_params:,}")        # 135,266,304
print(f"Difference: {(swiglu_params/classic_params - 1)*100:.1f}%")  # ~0.8%

Real implementations round the hidden dimension to a multiple of 256 (or 128) for GPU memory alignment. LLaMA 7B uses 11008, which is almost exactly the theoretical 8d/3 = 10922.67.

Interestingly, LLaMA 3 8B and Mistral 7B both use a hidden dimension of 14336 (ratio 3.5×) — deliberately larger than parameter parity would require. They chose to allocate more parameters to the FFN, betting that the extra FFN capacity is worth more than perfect balance with attention.

Now let's build the complete SwiGLU FFN exactly as LLaMA implements it:

class SwiGLUFFN:
    """SwiGLU FFN matching LLaMA's implementation."""

    def __init__(self, d_model, d_ff):
        # Three weight matrices, no bias (modern LLMs drop biases)
        scale = np.sqrt(2.0 / d_model)
        self.w_gate = np.random.randn(d_model, d_ff) * scale  # gate projection
        self.w_up   = np.random.randn(d_model, d_ff) * scale  # up projection
        self.w_down = np.random.randn(d_ff, d_model) * scale  # down projection

    def __call__(self, x):
        # This is the entire LLaMA FFN forward pass:
        # return self.w2(F.silu(self.w1(x)) * self.w3(x))
        gate = silu(x @ self.w_gate)     # (seq, d_ff) — gated activation
        up   = x @ self.w_up             # (seq, d_ff) — linear projection
        hidden = gate * up               # (seq, d_ff) — element-wise gating
        return hidden @ self.w_down      # (seq, d_model) — project back down

    def param_count(self):
        return sum(w.size for w in [self.w_gate, self.w_up, self.w_down])

# Build a small version
d_model, d_ff = 64, 171  # 8/3 * 64 ≈ 170.67, rounded up
ffn = SwiGLUFFN(d_model, d_ff)

x = np.random.randn(5, d_model)  # 5 tokens
out = ffn(x)

print(f"Input:      {x.shape}")           # (5, 64)
print(f"Output:     {out.shape}")          # (5, 64)
print(f"Parameters: {ffn.param_count():,}")  # 32,832

The one-liner that powers modern LLMs: return self.w2(F.silu(self.w1(x)) * self.w3(x)) — three projections, one activation, one element-wise multiply, one output projection. That's the entire feed-forward computation in LLaMA, Mistral, and most modern open-weight models.

FFN as Key-Value Memory

In 2021, Geva, Schuster, Berant, and Levy published a fascinating reinterpretation of what FFN layers actually do. They showed that feed-forward layers operate as key-value memories, structurally identical to an unnormalized attention mechanism over a fixed memory bank.

Here's the insight. Take the classic FFN and rewrite it as a sum over individual neurons:

FFN(x) = ∑_i ReLU(x · k_i) · v_i

Where k_i is the i-th row of W₁ (a "key") and v_i is the i-th column of W₂ (a "value"). Each neuron acts as a memory slot: the key pattern-matches against the input, the activation function decides if this memory should fire, and the value determines what to output.

This is structurally identical to attention with a fixed memory bank — no query-key-value computation, just a direct lookup. Lower-layer keys detect shallow patterns (syntax, word co-occurrence). Upper-layer keys detect semantic patterns (topic, sentiment). And the values? They encode predictions about what should come next.

def ffn_as_memory(x, W1, W2, top_k=5):
    """Decompose FFN into individual neuron (memory slot) contributions."""
    d_ff = W1.shape[1]

    # Each neuron is a key-value pair
    # key_i = W1[:, i] (column of W1 = row of W1.T)
    # val_i = W2[i, :] (row of W2)

    # Compute match scores: how well does input match each key?
    scores = x @ W1  # (d_model,) @ (d_model, d_ff) = (d_ff,)
    activations = np.maximum(0, scores)  # ReLU gate

    # Find the top-k most activated memory slots
    top_indices = np.argsort(activations)[-top_k:][::-1]

    print("Top activated memory slots:")
    print(f"{'Slot':>6} | {'Score':>8} | {'Activated':>9}")
    print("-" * 35)
    for idx in top_indices:
        print(f"  {idx:4d} | {scores[idx]:8.3f} | {activations[idx]:9.3f}")

    # The full output is the sum of activated value vectors
    output = activations @ W2  # equivalent to sum_i(act_i * v_i)

    # Show that individual contributions add up to the full FFN output
    manual_sum = np.zeros_like(x)
    for i in range(d_ff):
        manual_sum += activations[i] * W2[i, :]

    print(f"\nDirect FFN output:       {output[:4].round(4)}")
    print(f"Sum of memory values:    {manual_sum[:4].round(4)}")
    print(f"Match: {np.allclose(output, manual_sum)}")

    return output, top_indices

# Demo with a small FFN
d_model, d_ff = 8, 32
W1 = np.random.randn(d_model, d_ff) * 0.5
W2 = np.random.randn(d_ff, d_model) * 0.5
x = np.random.randn(d_model)

output, top_slots = ffn_as_memory(x, W1, W2, top_k=5)

This "memory" interpretation isn't just a metaphor — it has practical consequences. It's the theoretical foundation for knowledge editing, where researchers modify specific facts stored in the model by surgically updating individual rows of FFN weight matrices. When a model "knows" that the Eiffel Tower is in Paris, that fact is stored as specific key-value associations in FFN layers.

The Full Transformer Block

Now we can see where the FFN sits in the complete architecture. A modern transformer block (Pre-Norm style, as in LLaMA and Mistral) looks like this:

x = x + Attention(RMSNorm(x))
x = x + FFN(RMSNorm(x))

First, RMSNorm stabilizes the input, then attention mixes information across tokens. That result is added back via a residual connection. Then another RMSNorm, followed by our FFN processing each token independently. Another residual connection, and we're done.

Attention handles the "what's relevant in context?" question. The FFN handles "now that I've seen the relevant context, what should I predict?" Let's build the complete block:

def rms_norm(x, weight, eps=1e-6):
    """RMSNorm: normalize by root-mean-square, then scale."""
    rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + eps)
    return (x / rms) * weight

def simplified_attention(x, Wq, Wk, Wv, Wo):
    """Single-head attention (simplified for clarity)."""
    Q = x @ Wq  # (seq, d)
    K = x @ Wk  # (seq, d)
    V = x @ Wv  # (seq, d)
    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)    # (seq, seq)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)  # softmax
    return (weights @ V) @ Wo             # (seq, d)

class TransformerBlock:
    """One complete transformer block: attention + FFN."""

    def __init__(self, d_model, d_ff):
        s = np.sqrt(2.0 / d_model)
        # Attention weights (single-head for simplicity)
        self.Wq = np.random.randn(d_model, d_model) * s
        self.Wk = np.random.randn(d_model, d_model) * s
        self.Wv = np.random.randn(d_model, d_model) * s
        self.Wo = np.random.randn(d_model, d_model) * s
        # FFN weights (SwiGLU)
        self.ffn = SwiGLUFFN(d_model, d_ff)
        # Norm weights (all ones = identity scaling)
        self.norm1 = np.ones(d_model)
        self.norm2 = np.ones(d_model)

    def __call__(self, x):
        # Pre-Norm: normalize BEFORE each sublayer
        # Sublayer 1: attention (mixes tokens)
        x = x + simplified_attention(
            rms_norm(x, self.norm1),
            self.Wq, self.Wk, self.Wv, self.Wo
        )
        # Sublayer 2: FFN (processes each token independently)
        x = x + self.ffn(rms_norm(x, self.norm2))
        return x

    def param_count(self):
        attn_params = 4 * self.Wq.size           # Q, K, V, O projections
        ffn_params  = self.ffn.param_count()       # gate, up, down projections
        norm_params = self.norm1.size + self.norm2.size
        return attn_params, ffn_params, norm_params

# Build a block
d_model, d_ff = 64, 171
block = TransformerBlock(d_model, d_ff)

x = np.random.randn(10, d_model)  # 10 tokens
out = block(x)

attn_p, ffn_p, norm_p = block.param_count()
total = attn_p + ffn_p + norm_p
print(f"Attention params: {attn_p:>8,}  ({attn_p/total*100:.1f}%)")
print(f"FFN params:       {ffn_p:>8,}  ({ffn_p/total*100:.1f}%)")
print(f"Norm params:      {norm_p:>8,}  ({norm_p/total*100:.1f}%)")
print(f"Total:            {total:>8,}")
# Attention params:   16,384  (33.2%)
# FFN params:         32,832  (66.5%)  <-- two-thirds!
# Norm params:           128  ( 0.3%)

There it is: about two-thirds of the transformer block is FFN. The "other half" of every transformer block is actually the two-thirds majority.

Try It: Activation Function Explorer

Compare activation functions and their gradients. Toggle functions on/off, drag the vertical line to read exact values, and adjust the Swish β parameter to watch it morph from linear → SiLU → ReLU.

β: 1.0

Hover over the chart to see exact values

The Practical Guide

Here's how FFN design has evolved from the original transformer to today's LLMs:

Aspect	Original (2017)	Modern LLMs
FFN type	2-layer MLP	3-matrix GLU variant
Activation	ReLU	SwiGLU (LLaMA/Mistral) or GeGLU (Gemma)
Hidden dim	4 × d_model	⅔ × 4d ≈ 2.67d (or 3.5d)
Normalization	Post-Norm LayerNorm	Pre-Norm RMSNorm
Bias terms	Yes	No
Weight matrices	2 (W₁, W₂)	3 (W_gate, W_up, W_down)
FFN param share	~66.7%	~66.7% or more

When choosing an activation function for your own models:

ReLU — only if you need maximum speed on constrained hardware, or maintaining a legacy architecture
GELU — still solid for encoder models (BERT-family). The default in many NLP libraries
SwiGLU — the right choice for decoder LLMs and new architectures. If you're building a transformer from scratch today, use this

The FFN connects to many topics we've covered. Normalization is applied before the FFN in Pre-Norm architectures. LoRA adapters often target FFN layers (the gate and up projections are popular choices). Quantization has its biggest impact on FFN weights, since they dominate parameter count. And the optimizer must handle the interaction between smooth activation functions and gradient flow.

The Silent Powerhouse

We started with a simple two-layer network from 2017 and ended with the exact architecture running inside today's most powerful language models. Along the way, we saw activation functions evolve from hard binary gates (ReLU) to smooth probabilistic curves (GELU) to self-gated functions (SiLU), and finally to the multiplicative gating of SwiGLU.

But the most important insight might be the simplest: the feed-forward network is where transformers store and retrieve knowledge. Attention decides what context matters. The FFN decides what to do with it. Two-thirds of the model, working silently between attention layers, turning context into predictions.

We've now covered every major component inside the transformer block: attention, normalization, positional encoding, and now feed-forward networks. The full pipeline continues: tokenize → embed → position → normalize → attend → FFN → softmax → loss → optimize → decode → fine-tune → quantize.

References & Further Reading

Vaswani et al. — "Attention Is All You Need" (2017) — the original transformer paper that introduced the position-wise FFN with ReLU
Hendrycks & Gimpel — "Gaussian Error Linear Units (GELUs)" (2016) — derivation of GELU from stochastic regularization
Ramachandran, Zoph & Le — "Searching for Activation Functions" (2017) — automated discovery of Swish/SiLU
Dauphin et al. — "Language Modeling with Gated Convolutional Networks" (2017) — original Gated Linear Units paper
Shazeer — "GLU Variants Improve Transformer" (2020) — SwiGLU and the GLU variant family
Geva et al. — "Transformer Feed-Forward Layers Are Key-Value Memories" (2021) — the key-value memory interpretation of FFN layers
Touvron et al. — "LLaMA: Open and Efficient Foundation Language Models" (2023) — modern SwiGLU FFN architecture