← Back to Blog

The Complete Transformer from Scratch: Assembling Every Piece We've Built

The Assembly Line

Over fourteen posts, we've built every individual piece of the transformer architecture from scratch. We started with automatic differentiation, then learned how text becomes tokens, how tokens become vectors, and how position gets encoded into those vectors. We built the attention mechanism that lets tokens communicate, the normalization layers that keep activations stable, and the feed-forward networks that transform each token independently. We learned how softmax produces probabilities, how loss functions measure mistakes, how optimizers fix them, and how decoding strategies choose words. We even covered what happens after training: LoRA for fine-tuning, quantization for compression, and the KV cache for fast inference.

Individually, each component makes sense. But nobody ever shows you how they all snap together.

Today we build the whole machine.

We'll assemble a minimal but fully functional GPT-style language model in pure Python and NumPy. It'll be small enough to train on a CPU — about 200K parameters — but architecturally identical to the models powering ChatGPT and LLaMA. Same embedding layer, same transformer blocks, same output head. The only difference is scale.

By the end, you'll type a prompt and watch your transformer generate text, one token at a time. And at every stage, you'll know exactly what's happening — because you built every piece yourself.

The Architecture at 30,000 Feet

A decoder-only transformer has three stages. The input stage converts text into vectors. A stack of identical transformer blocks processes those vectors, each block applying attention then a feed-forward network. The output stage converts the final vectors back into a probability distribution over the vocabulary.

Input Text │ ▼ ┌─────────────────────────┐ │ Token Embedding │ vocab_size × d_model │ + Position Embedding │ max_seq_len × d_model └────────────┬────────────┘ │ ▼ ┌─────────────────────────┐ │ Transformer Block × N │ ◄── repeated N times │ ┌───────────────────┐ │ │ │ RMSNorm │ │ │ │ Causal Attention │ │ │ │ + Residual │ │ │ ├───────────────────┤ │ │ │ RMSNorm │ │ │ │ SwiGLU FFN │ │ │ │ + Residual │ │ │ └───────────────────┘ │ └────────────┬────────────┘ │ ▼ ┌─────────────────────────┐ │ Final RMSNorm │ │ Output Head (logits) │ d_model × vocab_size └────────────┬────────────┘ │ ▼ softmax → probabilities → sample → next token

That's it. The entire architecture fits in a single diagram. The trick that makes transformers powerful isn't complexity — it's repetition. The same block structure, applied dozens or hundreds of times, with each layer learning to extract different features from the residual stream.

Think of the residual stream as a highway. Data flows through it unchanged unless a sub-layer (attention or FFN) reads from it, transforms the signal, and adds the result back. This is why it's called a "residual" connection — each layer computes a residual update to the main stream, not a complete replacement.

Let's define our model's configuration. We'll keep everything small enough for CPU training, but the structure is identical to GPT/LLaMA.

import numpy as np

class TransformerConfig:
    vocab_size: int = 256       # byte-level: every possible byte is a token
    d_model: int = 64           # embedding dimension
    n_heads: int = 4            # attention heads
    n_layers: int = 4           # transformer blocks
    d_ff: int = 172             # FFN hidden dim (≈ 8/3 × d_model for SwiGLU)
    max_seq_len: int = 128      # context window
    head_dim: int = 16          # d_model // n_heads

config = TransformerConfig()

Byte-level tokenization means vocab_size=256 — each byte in the input is its own token. This avoids needing a BPE tokenizer (which we built from scratch previously), keeping our model self-contained.

The Input Stage — Tokens to Vectors

The input stage does two things: look up a dense vector for each token, and add positional information so the model knows word order. Both live in the same d_model-dimensional vector space, so adding them together is meaningful — the model learns to place token identity and position into different subspaces of the same vector, as we explored in the embeddings and positional encoding posts.

class Embedding:
    def __init__(self, config):
        scale = 0.02
        self.token_embed = np.random.randn(config.vocab_size, config.d_model) * scale
        self.pos_embed = np.random.randn(config.max_seq_len, config.d_model) * scale

    def forward(self, token_ids):
        """token_ids: (seq_len,) array of ints"""
        seq_len = len(token_ids)
        tok = self.token_embed[token_ids]       # (seq_len, d_model)
        pos = self.pos_embed[:seq_len]           # (seq_len, d_model)
        return tok + pos                         # (seq_len, d_model)

The token embedding is a lookup table: each of our 256 possible bytes gets a 64-dimensional vector. The positional embedding is another lookup table indexed by position (0, 1, 2, ...). We use learned positional embeddings here (the GPT approach) rather than RoPE to keep the code simple. Modern models like LLaMA use RoPE for better length generalization, but the learned approach works fine for our 128-token context window.

The 0.02 initialization scale isn't arbitrary — it prevents the initial embeddings from being too large, which would cause the softmax in attention to saturate before training even begins.

The Transformer Block — Where the Magic Happens

Every transformer block follows the same Pre-Norm pattern:

x = x + Attention(RMSNorm(x))
x = x + FFN(RMSNorm(x))

The Pre-Norm ordering (normalize before the sub-layer, not after) is what modern models use. It keeps gradients flowing cleanly through the residual connections, making deep networks trainable. Let's build each piece.

RMSNorm

RMSNorm normalizes by the root-mean-square of activations — simpler and faster than LayerNorm, with no mean subtraction needed.

class RMSNorm:
    def __init__(self, dim, eps=1e-6):
        self.weight = np.ones(dim)   # learnable scale (gamma)
        self.eps = eps

    def forward(self, x):
        """x: (seq_len, dim)"""
        rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + self.eps)
        return (x / rms) * self.weight

Causal Self-Attention

This is the heart of the transformer — the attention mechanism that lets every token look at all previous tokens (but not future ones, thanks to the causal mask). We project the input into queries, keys, and values, split into multiple heads, compute scaled dot-product attention with masking, then project back.

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    e_x = np.exp(x - x_max)
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

class CausalSelfAttention:
    def __init__(self, config):
        d = config.d_model
        scale = 0.02
        # Q, K, V projections
        self.W_q = np.random.randn(d, d) * scale
        self.W_k = np.random.randn(d, d) * scale
        self.W_v = np.random.randn(d, d) * scale
        # Output projection
        self.W_o = np.random.randn(d, d) * scale

        self.n_heads = config.n_heads
        self.head_dim = config.head_dim

    def forward(self, x):
        """x: (seq_len, d_model) -> (seq_len, d_model)"""
        seq_len, d_model = x.shape
        n_h = self.n_heads
        h_d = self.head_dim

        # Project to Q, K, V
        Q = x @ self.W_q    # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape to (n_heads, seq_len, head_dim)
        Q = Q.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)
        K = K.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)
        V = V.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)

        # Scaled dot-product attention
        scale = np.sqrt(h_d)
        scores = (Q @ K.transpose(0, 2, 1)) / scale  # (n_heads, seq_len, seq_len)

        # Causal mask: prevent attending to future positions
        mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)
        scores = scores + mask

        attn_weights = softmax(scores, axis=-1)   # (n_heads, seq_len, seq_len)
        attn_out = attn_weights @ V               # (n_heads, seq_len, head_dim)

        # Reshape back: (seq_len, d_model)
        attn_out = attn_out.transpose(1, 0, 2).reshape(seq_len, d_model)

        # Output projection
        return attn_out @ self.W_o

A few things worth noting. The causal mask is an upper-triangular matrix filled with -1e9 — adding it to attention scores before softmax pushes future positions to zero probability. The 1/sqrt(head_dim) scaling prevents dot products from growing too large as dimensionality increases, which would cause softmax to saturate.

Each attention head operates on a 16-dimensional slice of the 64-dimensional embedding. Four heads attend in parallel, each potentially learning different patterns — one might track syntactic relationships, another positional proximity, another semantic similarity.

SwiGLU Feed-Forward Network

The feed-forward network processes each token independently. We use SwiGLU, the gated variant found in LLaMA and Mistral, which multiplies a gated pathway with an ungated one.

def silu(x):
    """SiLU/Swish activation: x * sigmoid(x)"""
    return x * (1.0 / (1.0 + np.exp(-x)))

class SwiGLU_FFN:
    def __init__(self, config):
        d = config.d_model
        d_ff = config.d_ff
        scale = 0.02
        # Three weight matrices for SwiGLU
        self.W_gate = np.random.randn(d, d_ff) * scale
        self.W_up   = np.random.randn(d, d_ff) * scale
        self.W_down = np.random.randn(d_ff, d) * scale

    def forward(self, x):
        """x: (seq_len, d_model) -> (seq_len, d_model)"""
        gate = silu(x @ self.W_gate)     # (seq_len, d_ff)
        up   = x @ self.W_up             # (seq_len, d_ff)
        return (gate * up) @ self.W_down  # (seq_len, d_model)

SwiGLU uses three weight matrices instead of two, so we set d_ff = 172 (roughly 8/3 × d_model) to keep total parameter count comparable to a standard FFN with 4 × d_model. The gating mechanism — element-wise multiplication of the SiLU-activated gate with the ungated up projection — lets the network learn which features to amplify and which to suppress.

Assembling the Block

Now we combine everything with residual connections:

class TransformerBlock:
    def __init__(self, config):
        self.norm1 = RMSNorm(config.d_model)
        self.attn  = CausalSelfAttention(config)
        self.norm2 = RMSNorm(config.d_model)
        self.ffn   = SwiGLU_FFN(config)

    def forward(self, x):
        """x: (seq_len, d_model) -> (seq_len, d_model)"""
        # Attention sub-layer with residual
        x = x + self.attn.forward(self.norm1.forward(x))
        # FFN sub-layer with residual
        x = x + self.ffn.forward(self.norm2.forward(x))
        return x

That's the entire transformer block in four lines of computation. Normalize, attend, add back. Normalize, feed-forward, add back. The residual connections (x = x + ...) are what make deep transformers trainable — gradients can flow directly through the addition, bypassing any poorly-behaved sub-layers.

Stacking Blocks — The Full Model

The complete transformer is an embedding layer, a stack of identical blocks, a final normalization, and an output head that maps back to vocabulary-sized logits.

class Transformer:
    def __init__(self, config):
        self.config = config
        self.embedding = Embedding(config)
        self.blocks = [TransformerBlock(config) for _ in range(config.n_layers)]
        self.final_norm = RMSNorm(config.d_model)
        # Weight tying: output head shares the token embedding matrix

    def forward(self, token_ids):
        """token_ids: (seq_len,) -> logits: (seq_len, vocab_size)"""
        x = self.embedding.forward(token_ids)

        for block in self.blocks:
            x = block.forward(x)

        x = self.final_norm.forward(x)

        # Project to vocabulary: reuse token embeddings (weight tying)
        logits = x @ self.embedding.token_embed.T   # (seq_len, vocab_size)
        return logits

Weight tying is an important trick: the output head reuses the token embedding matrix (transposed) instead of having its own separate weight matrix. This saves vocab_size × d_model parameters and actually improves performance — the model learns that the embedding space and the output prediction space should be aligned.

Let's count every parameter in our model:

Component Shape Parameters
Token embedding 256 × 64 16,384
Position embedding 128 × 64 8,192
Per-block: Attention (W_q, W_k, W_v, W_o) 4 × (64 × 64) 16,384
Per-block: FFN (W_gate, W_up, W_down) 2×(64×172) + 172×64 33,024
Per-block: 2 × RMSNorm 2 × 64 128
Final RMSNorm 64 64
Output head (tied with token embedding) 0
Total 4 blocks × 49,536 + 24,640 222,784

About 222K parameters. For reference, GPT-2 Small has 124 million (~557× larger), and LLaMA 7B has 6.7 billion (~30,000× larger). But the architecture is identical — the only differences are wider dimensions, more layers, and vastly more training data.

Training — Teaching the Model to Predict

The training objective is next-token prediction: at every position in the sequence, the model tries to predict what comes next. We measure the mistake using cross-entropy loss, which punishes confident wrong predictions severely (thanks to the logarithm) and rewards confident correct predictions.

def cross_entropy_loss(logits, targets):
    """
    logits:  (seq_len, vocab_size) - raw model output
    targets: (seq_len,) - true next tokens
    """
    # Stable softmax
    probs = softmax(logits, axis=-1)             # (seq_len, vocab_size)
    seq_len = len(targets)
    # Gather the probability assigned to each true target
    correct_probs = probs[np.arange(seq_len), targets]
    # Cross-entropy: -log(probability of correct token)
    loss = -np.mean(np.log(correct_probs + 1e-9))
    return loss, probs

For training, the input is a sequence of tokens [t0, t1, ..., tN] and the targets are the same sequence shifted by one: [t1, t2, ..., tN+1]. The model sees "The cat sat" and must predict "cat sat on". Every position provides a training signal simultaneously — this is what makes transformer training so data-efficient.

We'll use numerical gradients (finite differences) for simplicity. A real implementation would use autograd (which we built from scratch in the first post), but for a 222K parameter model on a CPU, numerical gradients are tractable and let us focus on architecture rather than backpropagation plumbing.

def get_all_params(model):
    """Collect every parameter array in the model into a flat list."""
    params = [model.embedding.token_embed, model.embedding.pos_embed]
    for block in model.blocks:
        params.extend([
            block.norm1.weight,
            block.attn.W_q, block.attn.W_k, block.attn.W_v, block.attn.W_o,
            block.norm2.weight,
            block.ffn.W_gate, block.ffn.W_up, block.ffn.W_down,
        ])
    params.append(model.final_norm.weight)
    return params

def compute_loss(model, token_ids):
    """Forward pass + cross-entropy loss for next-token prediction."""
    inputs  = token_ids[:-1]
    targets = token_ids[1:]
    logits  = model.forward(inputs)
    loss, _ = cross_entropy_loss(logits, targets)
    return loss

def numerical_gradients(model, token_ids, eps=1e-5):
    """Compute gradients via finite differences."""
    params = get_all_params(model)
    grads = []
    base_loss = compute_loss(model, token_ids)

    for param in params:
        grad = np.zeros_like(param)
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            old_val = param[idx]
            param[idx] = old_val + eps
            loss_plus = compute_loss(model, token_ids)
            param[idx] = old_val
            grad[idx] = (loss_plus - base_loss) / eps
            it.iternext()
        grads.append(grad)
    return grads, base_loss
Numerical gradients are extremely slow — computing one gradient requires a full forward pass per parameter. For our 222K parameter model, that's 222K forward passes per training step. This is fine for demonstration on tiny batches. In practice, you'd use automatic differentiation which computes all gradients in a single backward pass.

For the optimizer, we'll use Adam — the workhorse that combines momentum with adaptive learning rates:

class Adam:
    def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
        self.params = params
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.t = 0
        self.m = [np.zeros_like(p) for p in params]  # first moment
        self.v = [np.zeros_like(p) for p in params]  # second moment

    def step(self, grads):
        self.t += 1
        for i, (param, grad) in enumerate(zip(self.params, grads)):
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad ** 2
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Here's the training loop. Because numerical gradients are so slow, we train on very short sequences from a tiny corpus:

# Training data: encode text as bytes
text = "The cat sat on the mat. The dog sat on the log. "
text = text * 20  # repeat for more data
data = np.array([b for b in text.encode('utf-8')], dtype=np.int32)

model = Transformer(config)
optimizer = Adam(get_all_params(model), lr=3e-4)

# Random baseline: -ln(1/256) ≈ 5.55
print(f"Random baseline loss: {np.log(config.vocab_size):.2f}")

for step in range(50):
    # Random slice of training data
    start = np.random.randint(0, len(data) - config.max_seq_len - 1)
    chunk = data[start : start + config.max_seq_len + 1]

    grads, loss = numerical_gradients(model, chunk[:33])  # short seqs for speed
    optimizer.step(grads)

    if step % 10 == 0:
        print(f"Step {step:3d} | Loss: {loss:.3f}")

You should see the loss drop from around 5.5 (random guessing across 256 tokens) to something significantly lower as the model learns character patterns. Even with such a tiny dataset, the model picks up on common sequences like "the", "sat on", and the space patterns between words.

Generation — Making the Model Speak

Once trained, generation is autoregressive: feed in a prompt, predict the next token, append it, and repeat. We apply temperature to control randomness and top-p (nucleus) sampling to cut off the long tail of unlikely tokens.

def generate(model, prompt_bytes, max_new_tokens=100, temperature=0.8, top_p=0.9):
    """Generate text autoregressively from a byte-level prompt."""
    token_ids = list(prompt_bytes)

    for _ in range(max_new_tokens):
        # Truncate to context window
        context = np.array(token_ids[-model.config.max_seq_len:], dtype=np.int32)
        logits = model.forward(context)
        next_logits = logits[-1] / temperature    # last position only

        # Softmax to get probabilities
        probs = softmax(next_logits)

        # Top-p sampling: keep smallest set of tokens with cumulative prob >= top_p
        sorted_idx = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_idx]
        cumsum = np.cumsum(sorted_probs)
        cutoff = np.searchsorted(cumsum, top_p) + 1
        top_idx = sorted_idx[:cutoff]
        top_probs = probs[top_idx]
        top_probs = top_probs / top_probs.sum()   # renormalize

        # Sample from filtered distribution
        next_token = np.random.choice(top_idx, p=top_probs)
        token_ids.append(int(next_token))

    return bytes(token_ids).decode('utf-8', errors='replace')

# Generate!
prompt = "The cat"
prompt_bytes = list(prompt.encode('utf-8'))
print(generate(model, prompt_bytes, max_new_tokens=60))

On our tiny training corpus, the model will produce repetitive but recognizable patterns: "The cat sat on the mat. The dog sat on the log." — it's learned the rhythms of its training data. With more data, more parameters, and more training, this exact architecture scales to produce the fluent text you see from frontier language models.

Note that we skip the KV cache here — we recompute the full attention at each generation step. For our tiny model this is fine, but production models cache the key and value projections from previous tokens to avoid redundant computation, cutting generation from O(n²) to O(n) per step.

Interactive: Transformer X-Ray

Try It: Transformer X-Ray

Watch data flow through each stage of a transformer. Select a preset prompt or type your own, then step through the pipeline: tokenization → embedding → attention → FFN → output probabilities. The visualization shows what happens inside a single transformer block.

Click "Step →" to begin tracing data through the transformer

The Complete Picture — Where Every Post Fits

Here's the full pipeline, one last time, with every post linked:

Micrograd is the foundation beneath it all — automatic differentiation is what makes training possible, turning "how wrong are we?" into "how should we adjust each weight?"

The scaling insight is worth repeating: our 222K parameter model is architecturally identical to frontier models. GPT-2 uses the same transformer blocks with d_model=768 and 12 layers. LLaMA 7B uses d_model=4096 and 32 layers. The recent models rumored to exceed a trillion parameters use the same attention, the same FFN, the same normalization. They just stack more layers, widen the dimensions, and pour in more data.

What we didn't cover: multi-GPU training and data parallelism, mixed-precision training (fp16/bf16), Flash Attention (fusing the attention computation into a single GPU kernel for memory efficiency), speculative decoding (using a small draft model to speed up generation from a large model), and the massive infrastructure needed to train at scale. These are engineering challenges, not architectural ones — the core design is exactly what we built here.

You Built a Transformer

If you've followed all fifteen posts in this series, you've built every component of the most important architecture in modern AI — from the ground up, in pure Python and NumPy. You know what every parameter does, where every gradient flows, and why every design choice was made.

The transformer's power isn't in any single brilliant component. It's in the combination: residual connections that let gradients flow, attention that lets tokens communicate, normalization that keeps signals stable, feed-forward networks that transform each token, and the simple but profound idea that stacking these blocks and training with next-token prediction produces something that looks like understanding.

The architecture is simple. The breakthrough was realizing that attention + FFN + residuals + scale = intelligence. Or at least, something close enough that we're still figuring out the difference.

References & Further Reading