The Complete Transformer from Scratch: Assembling Every Piece We've Built
The Assembly Line
Over fourteen posts, we've built every individual piece of the transformer architecture from scratch. We started with automatic differentiation, then learned how text becomes tokens, how tokens become vectors, and how position gets encoded into those vectors. We built the attention mechanism that lets tokens communicate, the normalization layers that keep activations stable, and the feed-forward networks that transform each token independently. We learned how softmax produces probabilities, how loss functions measure mistakes, how optimizers fix them, and how decoding strategies choose words. We even covered what happens after training: LoRA for fine-tuning, quantization for compression, and the KV cache for fast inference.
Individually, each component makes sense. But nobody ever shows you how they all snap together.
Today we build the whole machine.
We'll assemble a minimal but fully functional GPT-style language model in pure Python and NumPy. It'll be small enough to train on a CPU — about 200K parameters — but architecturally identical to the models powering ChatGPT and LLaMA. Same embedding layer, same transformer blocks, same output head. The only difference is scale.
By the end, you'll type a prompt and watch your transformer generate text, one token at a time. And at every stage, you'll know exactly what's happening — because you built every piece yourself.
The Architecture at 30,000 Feet
A decoder-only transformer has three stages. The input stage converts text into vectors. A stack of identical transformer blocks processes those vectors, each block applying attention then a feed-forward network. The output stage converts the final vectors back into a probability distribution over the vocabulary.
That's it. The entire architecture fits in a single diagram. The trick that makes transformers powerful isn't complexity — it's repetition. The same block structure, applied dozens or hundreds of times, with each layer learning to extract different features from the residual stream.
Think of the residual stream as a highway. Data flows through it unchanged unless a sub-layer (attention or FFN) reads from it, transforms the signal, and adds the result back. This is why it's called a "residual" connection — each layer computes a residual update to the main stream, not a complete replacement.
Let's define our model's configuration. We'll keep everything small enough for CPU training, but the structure is identical to GPT/LLaMA.
import numpy as np
class TransformerConfig:
vocab_size: int = 256 # byte-level: every possible byte is a token
d_model: int = 64 # embedding dimension
n_heads: int = 4 # attention heads
n_layers: int = 4 # transformer blocks
d_ff: int = 172 # FFN hidden dim (≈ 8/3 × d_model for SwiGLU)
max_seq_len: int = 128 # context window
head_dim: int = 16 # d_model // n_heads
config = TransformerConfig()
Byte-level tokenization means vocab_size=256 — each byte in the input is its own token. This avoids needing a BPE tokenizer (which we built from scratch previously), keeping our model self-contained.
The Input Stage — Tokens to Vectors
The input stage does two things: look up a dense vector for each token, and add positional information so the model knows word order. Both live in the same d_model-dimensional vector space, so adding them together is meaningful — the model learns to place token identity and position into different subspaces of the same vector, as we explored in the embeddings and positional encoding posts.
class Embedding:
def __init__(self, config):
scale = 0.02
self.token_embed = np.random.randn(config.vocab_size, config.d_model) * scale
self.pos_embed = np.random.randn(config.max_seq_len, config.d_model) * scale
def forward(self, token_ids):
"""token_ids: (seq_len,) array of ints"""
seq_len = len(token_ids)
tok = self.token_embed[token_ids] # (seq_len, d_model)
pos = self.pos_embed[:seq_len] # (seq_len, d_model)
return tok + pos # (seq_len, d_model)
The token embedding is a lookup table: each of our 256 possible bytes gets a 64-dimensional vector. The positional embedding is another lookup table indexed by position (0, 1, 2, ...). We use learned positional embeddings here (the GPT approach) rather than RoPE to keep the code simple. Modern models like LLaMA use RoPE for better length generalization, but the learned approach works fine for our 128-token context window.
The 0.02 initialization scale isn't arbitrary — it prevents the initial embeddings from being too large, which would cause the softmax in attention to saturate before training even begins.
The Transformer Block — Where the Magic Happens
Every transformer block follows the same Pre-Norm pattern:
x = x + FFN(RMSNorm(x))
The Pre-Norm ordering (normalize before the sub-layer, not after) is what modern models use. It keeps gradients flowing cleanly through the residual connections, making deep networks trainable. Let's build each piece.
RMSNorm
RMSNorm normalizes by the root-mean-square of activations — simpler and faster than LayerNorm, with no mean subtraction needed.
class RMSNorm:
def __init__(self, dim, eps=1e-6):
self.weight = np.ones(dim) # learnable scale (gamma)
self.eps = eps
def forward(self, x):
"""x: (seq_len, dim)"""
rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + self.eps)
return (x / rms) * self.weight
Causal Self-Attention
This is the heart of the transformer — the attention mechanism that lets every token look at all previous tokens (but not future ones, thanks to the causal mask). We project the input into queries, keys, and values, split into multiple heads, compute scaled dot-product attention with masking, then project back.
def softmax(x, axis=-1):
"""Numerically stable softmax."""
x_max = np.max(x, axis=axis, keepdims=True)
e_x = np.exp(x - x_max)
return e_x / np.sum(e_x, axis=axis, keepdims=True)
class CausalSelfAttention:
def __init__(self, config):
d = config.d_model
scale = 0.02
# Q, K, V projections
self.W_q = np.random.randn(d, d) * scale
self.W_k = np.random.randn(d, d) * scale
self.W_v = np.random.randn(d, d) * scale
# Output projection
self.W_o = np.random.randn(d, d) * scale
self.n_heads = config.n_heads
self.head_dim = config.head_dim
def forward(self, x):
"""x: (seq_len, d_model) -> (seq_len, d_model)"""
seq_len, d_model = x.shape
n_h = self.n_heads
h_d = self.head_dim
# Project to Q, K, V
Q = x @ self.W_q # (seq_len, d_model)
K = x @ self.W_k
V = x @ self.W_v
# Reshape to (n_heads, seq_len, head_dim)
Q = Q.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)
K = K.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)
V = V.reshape(seq_len, n_h, h_d).transpose(1, 0, 2)
# Scaled dot-product attention
scale = np.sqrt(h_d)
scores = (Q @ K.transpose(0, 2, 1)) / scale # (n_heads, seq_len, seq_len)
# Causal mask: prevent attending to future positions
mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)
scores = scores + mask
attn_weights = softmax(scores, axis=-1) # (n_heads, seq_len, seq_len)
attn_out = attn_weights @ V # (n_heads, seq_len, head_dim)
# Reshape back: (seq_len, d_model)
attn_out = attn_out.transpose(1, 0, 2).reshape(seq_len, d_model)
# Output projection
return attn_out @ self.W_o
A few things worth noting. The causal mask is an upper-triangular matrix filled with -1e9 — adding it to attention scores before softmax pushes future positions to zero probability. The 1/sqrt(head_dim) scaling prevents dot products from growing too large as dimensionality increases, which would cause softmax to saturate.
Each attention head operates on a 16-dimensional slice of the 64-dimensional embedding. Four heads attend in parallel, each potentially learning different patterns — one might track syntactic relationships, another positional proximity, another semantic similarity.
SwiGLU Feed-Forward Network
The feed-forward network processes each token independently. We use SwiGLU, the gated variant found in LLaMA and Mistral, which multiplies a gated pathway with an ungated one.
def silu(x):
"""SiLU/Swish activation: x * sigmoid(x)"""
return x * (1.0 / (1.0 + np.exp(-x)))
class SwiGLU_FFN:
def __init__(self, config):
d = config.d_model
d_ff = config.d_ff
scale = 0.02
# Three weight matrices for SwiGLU
self.W_gate = np.random.randn(d, d_ff) * scale
self.W_up = np.random.randn(d, d_ff) * scale
self.W_down = np.random.randn(d_ff, d) * scale
def forward(self, x):
"""x: (seq_len, d_model) -> (seq_len, d_model)"""
gate = silu(x @ self.W_gate) # (seq_len, d_ff)
up = x @ self.W_up # (seq_len, d_ff)
return (gate * up) @ self.W_down # (seq_len, d_model)
SwiGLU uses three weight matrices instead of two, so we set d_ff = 172 (roughly 8/3 × d_model) to keep total parameter count comparable to a standard FFN with 4 × d_model. The gating mechanism — element-wise multiplication of the SiLU-activated gate with the ungated up projection — lets the network learn which features to amplify and which to suppress.
Assembling the Block
Now we combine everything with residual connections:
class TransformerBlock:
def __init__(self, config):
self.norm1 = RMSNorm(config.d_model)
self.attn = CausalSelfAttention(config)
self.norm2 = RMSNorm(config.d_model)
self.ffn = SwiGLU_FFN(config)
def forward(self, x):
"""x: (seq_len, d_model) -> (seq_len, d_model)"""
# Attention sub-layer with residual
x = x + self.attn.forward(self.norm1.forward(x))
# FFN sub-layer with residual
x = x + self.ffn.forward(self.norm2.forward(x))
return x
That's the entire transformer block in four lines of computation. Normalize, attend, add back. Normalize, feed-forward, add back. The residual connections (x = x + ...) are what make deep transformers trainable — gradients can flow directly through the addition, bypassing any poorly-behaved sub-layers.
Stacking Blocks — The Full Model
The complete transformer is an embedding layer, a stack of identical blocks, a final normalization, and an output head that maps back to vocabulary-sized logits.
class Transformer:
def __init__(self, config):
self.config = config
self.embedding = Embedding(config)
self.blocks = [TransformerBlock(config) for _ in range(config.n_layers)]
self.final_norm = RMSNorm(config.d_model)
# Weight tying: output head shares the token embedding matrix
def forward(self, token_ids):
"""token_ids: (seq_len,) -> logits: (seq_len, vocab_size)"""
x = self.embedding.forward(token_ids)
for block in self.blocks:
x = block.forward(x)
x = self.final_norm.forward(x)
# Project to vocabulary: reuse token embeddings (weight tying)
logits = x @ self.embedding.token_embed.T # (seq_len, vocab_size)
return logits
Weight tying is an important trick: the output head reuses the token embedding matrix (transposed) instead of having its own separate weight matrix. This saves vocab_size × d_model parameters and actually improves performance — the model learns that the embedding space and the output prediction space should be aligned.
Let's count every parameter in our model:
| Component | Shape | Parameters |
|---|---|---|
| Token embedding | 256 × 64 | 16,384 |
| Position embedding | 128 × 64 | 8,192 |
| Per-block: Attention (W_q, W_k, W_v, W_o) | 4 × (64 × 64) | 16,384 |
| Per-block: FFN (W_gate, W_up, W_down) | 2×(64×172) + 172×64 | 33,024 |
| Per-block: 2 × RMSNorm | 2 × 64 | 128 |
| Final RMSNorm | 64 | 64 |
| Output head | (tied with token embedding) | 0 |
| Total | 4 blocks × 49,536 + 24,640 | 222,784 |
About 222K parameters. For reference, GPT-2 Small has 124 million (~557× larger), and LLaMA 7B has 6.7 billion (~30,000× larger). But the architecture is identical — the only differences are wider dimensions, more layers, and vastly more training data.
Training — Teaching the Model to Predict
The training objective is next-token prediction: at every position in the sequence, the model tries to predict what comes next. We measure the mistake using cross-entropy loss, which punishes confident wrong predictions severely (thanks to the logarithm) and rewards confident correct predictions.
def cross_entropy_loss(logits, targets):
"""
logits: (seq_len, vocab_size) - raw model output
targets: (seq_len,) - true next tokens
"""
# Stable softmax
probs = softmax(logits, axis=-1) # (seq_len, vocab_size)
seq_len = len(targets)
# Gather the probability assigned to each true target
correct_probs = probs[np.arange(seq_len), targets]
# Cross-entropy: -log(probability of correct token)
loss = -np.mean(np.log(correct_probs + 1e-9))
return loss, probs
For training, the input is a sequence of tokens [t0, t1, ..., tN] and the targets are the same sequence shifted by one: [t1, t2, ..., tN+1]. The model sees "The cat sat" and must predict "cat sat on". Every position provides a training signal simultaneously — this is what makes transformer training so data-efficient.
We'll use numerical gradients (finite differences) for simplicity. A real implementation would use autograd (which we built from scratch in the first post), but for a 222K parameter model on a CPU, numerical gradients are tractable and let us focus on architecture rather than backpropagation plumbing.
def get_all_params(model):
"""Collect every parameter array in the model into a flat list."""
params = [model.embedding.token_embed, model.embedding.pos_embed]
for block in model.blocks:
params.extend([
block.norm1.weight,
block.attn.W_q, block.attn.W_k, block.attn.W_v, block.attn.W_o,
block.norm2.weight,
block.ffn.W_gate, block.ffn.W_up, block.ffn.W_down,
])
params.append(model.final_norm.weight)
return params
def compute_loss(model, token_ids):
"""Forward pass + cross-entropy loss for next-token prediction."""
inputs = token_ids[:-1]
targets = token_ids[1:]
logits = model.forward(inputs)
loss, _ = cross_entropy_loss(logits, targets)
return loss
def numerical_gradients(model, token_ids, eps=1e-5):
"""Compute gradients via finite differences."""
params = get_all_params(model)
grads = []
base_loss = compute_loss(model, token_ids)
for param in params:
grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old_val = param[idx]
param[idx] = old_val + eps
loss_plus = compute_loss(model, token_ids)
param[idx] = old_val
grad[idx] = (loss_plus - base_loss) / eps
it.iternext()
grads.append(grad)
return grads, base_loss
Numerical gradients are extremely slow — computing one gradient requires a full forward pass per parameter. For our 222K parameter model, that's 222K forward passes per training step. This is fine for demonstration on tiny batches. In practice, you'd use automatic differentiation which computes all gradients in a single backward pass.
For the optimizer, we'll use Adam — the workhorse that combines momentum with adaptive learning rates:
class Adam:
def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
self.params = params
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0
self.m = [np.zeros_like(p) for p in params] # first moment
self.v = [np.zeros_like(p) for p in params] # second moment
def step(self, grads):
self.t += 1
for i, (param, grad) in enumerate(zip(self.params, grads)):
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad ** 2
# Bias correction
m_hat = self.m[i] / (1 - self.beta1 ** self.t)
v_hat = self.v[i] / (1 - self.beta2 ** self.t)
param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
Here's the training loop. Because numerical gradients are so slow, we train on very short sequences from a tiny corpus:
# Training data: encode text as bytes
text = "The cat sat on the mat. The dog sat on the log. "
text = text * 20 # repeat for more data
data = np.array([b for b in text.encode('utf-8')], dtype=np.int32)
model = Transformer(config)
optimizer = Adam(get_all_params(model), lr=3e-4)
# Random baseline: -ln(1/256) ≈ 5.55
print(f"Random baseline loss: {np.log(config.vocab_size):.2f}")
for step in range(50):
# Random slice of training data
start = np.random.randint(0, len(data) - config.max_seq_len - 1)
chunk = data[start : start + config.max_seq_len + 1]
grads, loss = numerical_gradients(model, chunk[:33]) # short seqs for speed
optimizer.step(grads)
if step % 10 == 0:
print(f"Step {step:3d} | Loss: {loss:.3f}")
You should see the loss drop from around 5.5 (random guessing across 256 tokens) to something significantly lower as the model learns character patterns. Even with such a tiny dataset, the model picks up on common sequences like "the", "sat on", and the space patterns between words.
Generation — Making the Model Speak
Once trained, generation is autoregressive: feed in a prompt, predict the next token, append it, and repeat. We apply temperature to control randomness and top-p (nucleus) sampling to cut off the long tail of unlikely tokens.
def generate(model, prompt_bytes, max_new_tokens=100, temperature=0.8, top_p=0.9):
"""Generate text autoregressively from a byte-level prompt."""
token_ids = list(prompt_bytes)
for _ in range(max_new_tokens):
# Truncate to context window
context = np.array(token_ids[-model.config.max_seq_len:], dtype=np.int32)
logits = model.forward(context)
next_logits = logits[-1] / temperature # last position only
# Softmax to get probabilities
probs = softmax(next_logits)
# Top-p sampling: keep smallest set of tokens with cumulative prob >= top_p
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
cutoff = np.searchsorted(cumsum, top_p) + 1
top_idx = sorted_idx[:cutoff]
top_probs = probs[top_idx]
top_probs = top_probs / top_probs.sum() # renormalize
# Sample from filtered distribution
next_token = np.random.choice(top_idx, p=top_probs)
token_ids.append(int(next_token))
return bytes(token_ids).decode('utf-8', errors='replace')
# Generate!
prompt = "The cat"
prompt_bytes = list(prompt.encode('utf-8'))
print(generate(model, prompt_bytes, max_new_tokens=60))
On our tiny training corpus, the model will produce repetitive but recognizable patterns: "The cat sat on the mat. The dog sat on the log." — it's learned the rhythms of its training data. With more data, more parameters, and more training, this exact architecture scales to produce the fluent text you see from frontier language models.
Note that we skip the KV cache here — we recompute the full attention at each generation step. For our tiny model this is fine, but production models cache the key and value projections from previous tokens to avoid redundant computation, cutting generation from O(n²) to O(n) per step.
Interactive: Transformer X-Ray
Try It: Transformer X-Ray
Watch data flow through each stage of a transformer. Select a preset prompt or type your own, then step through the pipeline: tokenization → embedding → attention → FFN → output probabilities. The visualization shows what happens inside a single transformer block.
The Complete Picture — Where Every Post Fits
Here's the full pipeline, one last time, with every post linked:
- Tokenize — text becomes token IDs (byte-level BPE)
- Embed — token IDs become dense vectors
- Position — add positional information (learned / RoPE)
- Normalize — RMSNorm stabilizes activations before each sub-layer
- Attend — tokens communicate via scaled dot-product attention
- KV Cache — cache keys/values to avoid recomputation during generation
- FFN — SwiGLU transforms each token independently
- Softmax — convert logits to a probability distribution
- Loss — cross-entropy measures prediction quality
- Optimize — Adam updates weights using gradients
- Backprop — automatic differentiation computes all gradients
- Decode — temperature + top-p sampling chooses the next token
- Fine-tune — LoRA adapts the model to new tasks cheaply
- Quantize — shrink the model from float32 to int4 for deployment
Micrograd is the foundation beneath it all — automatic differentiation is what makes training possible, turning "how wrong are we?" into "how should we adjust each weight?"
The scaling insight is worth repeating: our 222K parameter model is architecturally identical to frontier models. GPT-2 uses the same transformer blocks with d_model=768 and 12 layers. LLaMA 7B uses d_model=4096 and 32 layers. The recent models rumored to exceed a trillion parameters use the same attention, the same FFN, the same normalization. They just stack more layers, widen the dimensions, and pour in more data.
What we didn't cover: multi-GPU training and data parallelism, mixed-precision training (fp16/bf16), Flash Attention (fusing the attention computation into a single GPU kernel for memory efficiency), speculative decoding (using a small draft model to speed up generation from a large model), and the massive infrastructure needed to train at scale. These are engineering challenges, not architectural ones — the core design is exactly what we built here.
You Built a Transformer
If you've followed all fifteen posts in this series, you've built every component of the most important architecture in modern AI — from the ground up, in pure Python and NumPy. You know what every parameter does, where every gradient flows, and why every design choice was made.
The transformer's power isn't in any single brilliant component. It's in the combination: residual connections that let gradients flow, attention that lets tokens communicate, normalization that keeps signals stable, feed-forward networks that transform each token, and the simple but profound idea that stacking these blocks and training with next-token prediction produces something that looks like understanding.
The architecture is simple. The breakthrough was realizing that attention + FFN + residuals + scale = intelligence. Or at least, something close enough that we're still figuring out the difference.
References & Further Reading
- Vaswani et al. — "Attention Is All You Need" (2017) — the original transformer paper that started it all
- Radford et al. — "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019) — showed that decoder-only transformers scale to impressive generation quality
- Touvron et al. — "LLaMA: Open and Efficient Foundation Language Models" (2023) — the open model that codified RoPE + SwiGLU + RMSNorm as the modern standard
- Elhage et al. — "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021) — deep analysis of the residual stream and how transformer components compose
- Andrej Karpathy — "Let's build GPT: from scratch, in code, spelled out" (2023) — the definitive video walkthrough of building a GPT from scratch in PyTorch