Attention Is All You Need (To Implement)
Why Attention Changed Everything
Before 2017, neural networks processed sequences like reading through a keyhole — one token at a time, desperately trying to compress everything they'd seen into a single fixed-size vector. Recurrent networks (RNNs, LSTMs) passed a hidden state forward step by step, but by the time they reached the end of a long sentence, the beginning had faded into a smear of averaged-out information.
Then Vaswani et al. published "Attention Is All You Need" and blew the doors off. The core idea was almost embarrassingly simple: instead of processing tokens sequentially, let every token look at every other token directly. No recurrence. No convolution. Just attention.
This mechanism — scaled dot-product attention — is the beating heart of every large language model you've ever used. GPT, Claude, Llama, Gemini — they're all built on the same ~20 lines of math. And today we're going to implement it from scratch in pure Python and NumPy.
By the end of this post, you'll have written roughly 120 lines of code that implement the core attention mechanism, multi-head attention, and positional encoding. No frameworks, no magic — just matrix multiplication and a clear understanding of what every line does.
The Intuition: Queries, Keys, and Values
Before we touch any math, let's build the right mental model. Attention works like a fuzzy database lookup.
Imagine you're in a library. You walk up to the catalog with a question in mind — that's your Query. Every book on the shelf has a label describing what it contains — those are the Keys. You compare your question against every label to figure out which books are most relevant. Then you pull the actual content from those books — those are the Values.
The critical insight: the label on a book (Key) isn't the same as its content (Value). A book labeled "French cooking" might contain a recipe you need for your Italian dish (cross-domain relevance). Similarly, in a sentence, the word "it" might have a Key that says "pronoun looking for an antecedent" and a Value that carries the semantic content of whatever "it" refers to.
In attention, every token plays all three roles simultaneously. Each token generates:
- A Query: "What am I looking for?"
- A Key: "What do I contain?"
- A Value: "Here's my actual information."
These are just different linear projections of the same embedding — three different "views" of the same token, each serving a different purpose. The dot product between a Query and a Key measures how relevant that key is to the query. Let's see this in code:
import numpy as np
# Two word embeddings (dimension 4 for simplicity)
cat = np.array([0.9, 0.1, 0.8, 0.2]) # "cat" — animal-like
sat = np.array([0.1, 0.9, 0.3, 0.7]) # "sat" — action-like
mat = np.array([0.8, 0.2, 0.7, 0.3]) # "mat" — object, similar to cat
# Dot product measures similarity
print(f"cat · sat = {np.dot(cat, sat):.3f}") # 0.560 — moderate
print(f"cat · mat = {np.dot(cat, mat):.3f}") # 1.360 — high!
print(f"sat · mat = {np.dot(sat, mat):.3f}") # 0.680 — moderate
The dot product is higher for vectors that point in similar directions. "Cat" and "mat" are similar (both are noun-like, object-like), so their dot product is highest. This is the engine that drives attention: tokens with high dot products will attend to each other.
Scaled Dot-Product Attention: The Core
Now for the real thing. The attention formula from the Transformer paper is one equation:
Attention(Q, K, V) = softmax(Q KT / √dk) · V
Let's break it down step by step. We have a sequence of n tokens, each with embedding dimension d_k. The matrices Q, K, V all have shape (n, d_k) — one row per token.
Step 1: Compute raw attention scores. Multiply Q by K transposed to get an (n, n) matrix where entry (i, j) is the dot product of token i's query with token j's key:
# Q shape: (n, d_k)
# K shape: (n, d_k)
# K^T shape: (d_k, n)
# Q @ K^T shape: (n, n) — every token scored against every other token
scores = Q @ K.T
Step 2: Scale by √dk. This is the step most tutorials hand-wave past. Here's why it matters:
If the entries of Q and K are roughly standard normal (mean 0, variance 1), then their dot product has variance d_k. With d_k = 512, dot products can easily reach ±30 or higher. When you feed numbers that large into softmax, you get outputs that are essentially 0 or 1 — the softmax has saturated, and its gradients vanish. Training grinds to a halt.
Dividing by √dk normalizes the variance back to 1, keeping the softmax in its useful, gradient-friendly region:
d_k = Q.shape[-1]
scores = scores / np.sqrt(d_k)
Why √dk? If Q and K entries are independent with variance 1, their dot product (a sum of dk products) has variance dk. Dividing by √dk gives variance 1. It's not a magic number — it's statistics.
Step 3: Softmax to get weights. Convert scores to probabilities. Each row sums to 1 — it's a probability distribution over which tokens to attend to:
def softmax(x):
# Subtract max for numerical stability (prevents overflow in exp)
e = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e / np.sum(e, axis=-1, keepdims=True)
weights = softmax(scores) # shape: (n, n), each row sums to 1
Step 4: Weighted sum of values. Each token's output is a weighted combination of all values, where the weights are the attention probabilities:
output = weights @ V # shape: (n, d_k) — same shape as input
Let's put it all together in a single clean function:
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: arrays of shape (n, d_k)
Returns: output (n, d_k), attention weights (n, n)
"""
d_k = Q.shape[-1]
# Step 1-2: Compute scaled scores
scores = (Q @ K.T) / np.sqrt(d_k) # (n, n)
# Step 3: Softmax over keys (last axis)
e = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
weights = e / np.sum(e, axis=-1, keepdims=True) # (n, n)
# Step 4: Weighted sum of values
output = weights @ V # (n, d_k)
return output, weights
That's it. Thirteen lines of code, and you've implemented the core mechanism behind every modern LLM. Let's run it.
Seeing Attention in Action
Let's feed a short sentence through our attention function and see what the weights look like. We'll use random projections to create Q, K, V from a set of toy embeddings:
np.random.seed(42)
# Five tokens with 8-dimensional embeddings
tokens = ["The", "cat", "sat", "on", "mat"]
n, d_model, d_k = 5, 8, 8
# Random token embeddings (in a real model, these are learned)
embeddings = np.random.randn(n, d_model)
# Random projection matrices (in a real model, these are learned too)
W_q = np.random.randn(d_model, d_k) * 0.3
W_k = np.random.randn(d_model, d_k) * 0.3
W_v = np.random.randn(d_model, d_k) * 0.3
# Project embeddings into Q, K, V spaces
Q = embeddings @ W_q # (5, 8)
K = embeddings @ W_k # (5, 8)
V = embeddings @ W_v # (5, 8)
# Run attention
output, weights = scaled_dot_product_attention(Q, K, V)
# Print the attention weight matrix
print("Attention weights (rows=queries, cols=keys):\n")
print(" ", " ".join(f"{t:>5s}" for t in tokens))
for i, row in enumerate(weights):
print(f"{tokens[i]:>5s}:", " ".join(f"{w:.3f}" for w in row))
This prints a 5×5 matrix where each row shows how much a query token attends to each key token. With random weights the patterns aren't semantically meaningful yet — but in a trained model, you'd see "sat" attending to "cat" (its subject) and "mat" attending to "on" (its preposition). The interactive demo below lets you see these patterns in real time.
Try It: Attention Weight Visualizer
Interactive: Attention Heatmap
Type a sentence and watch the attention weights. Each cell shows how strongly a query token (row) attends to a key token (column). Brighter = more attention. Adjust dk to see how the scaling factor changes the weight distribution.
Try a few experiments: drop dk to 2 and notice the weights become nearly uniform — with only two dimensions, the dot products are tiny and softmax can barely distinguish between tokens. Increase dk to 64 and watch the weights sharpen as more dimensions accumulate stronger signal in the dot products. In a real Transformer, dk is typically 64 per head, and the √dk scaling is what keeps softmax in its useful gradient region as dimensions grow.
Multi-Head Attention: Why One Head Isn't Enough
A single attention head computes one set of weights — one "view" of which tokens matter to which. But language is complex. In the sentence "The animal didn't cross the street because it was too tired," the word "it" needs to attend to "animal" for coreference, but also to "cross" for semantic role, and to "tired" for predication. A single attention pattern can't capture all of these simultaneously.
The solution: multi-head attention. Run attention h times in parallel, each with different learned projections, then combine the results. Each head can specialize in a different type of relationship.
The key implementation insight is that you don't create h separate Q, K, V matrices. Instead, you project into a full-size space and then reshape to split into heads:
class MultiHeadAttention:
def __init__(self, d_model, num_heads):
assert d_model % num_heads == 0
self.d_model = d_model # e.g., 64
self.num_heads = num_heads # e.g., 8
self.d_k = d_model // num_heads # e.g., 8
# Projection matrices (in a real model, these are learned)
scale = 0.3
self.W_q = np.random.randn(d_model, d_model) * scale
self.W_k = np.random.randn(d_model, d_model) * scale
self.W_v = np.random.randn(d_model, d_model) * scale
self.W_o = np.random.randn(d_model, d_model) * scale
def forward(self, X):
"""
X: (n, d_model) — input embeddings
Returns: (n, d_model) — attended output
"""
n = X.shape[0]
# Project into Q, K, V — still full d_model width
Q = X @ self.W_q # (n, d_model)
K = X @ self.W_k # (n, d_model)
V = X @ self.W_v # (n, d_model)
# Reshape to split heads: (n, d_model) → (n, h, d_k) → (h, n, d_k)
Q = Q.reshape(n, self.num_heads, self.d_k).transpose(1, 0, 2)
K = K.reshape(n, self.num_heads, self.d_k).transpose(1, 0, 2)
V = V.reshape(n, self.num_heads, self.d_k).transpose(1, 0, 2)
# Now Q, K, V are each (h, n, d_k) — h independent attention problems
# Scaled dot-product attention for all heads at once
scores = (Q @ K.transpose(0, 2, 1)) / np.sqrt(self.d_k) # (h, n, n)
# Softmax per head (stable)
e = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
weights = e / np.sum(e, axis=-1, keepdims=True) # (h, n, n)
# Weighted values
attended = weights @ V # (h, n, d_k)
# Concatenate heads: (h, n, d_k) → (n, h, d_k) → (n, d_model)
attended = attended.transpose(1, 0, 2).reshape(n, self.d_model)
# Final linear projection
output = attended @ self.W_o # (n, d_model)
return output, weights
Let's trace the shapes through the entire pipeline with concrete numbers. Suppose we have 6 tokens, d_model = 64, and 8 heads:
- Input:
(6, 64)— six 64-dimensional token embeddings - After projection: Q, K, V are each
(6, 64) - After reshape + transpose:
(8, 6, 8)— eight heads, each with 6 tokens of dimension 8 - Attention scores:
(8, 6, 6)— eight separate 6×6 attention matrices - After attention:
(8, 6, 8)— eight attended value sets - After concatenation:
(6, 64)— heads merged back to original dimension - After output projection:
(6, 64)— final output, same shape as input
The shape comes in as (n, d_model) and leaves as (n, d_model). Attention is a refinement — it doesn't change the shape of the representation, it changes the content. Each token's output is now informed by every other token, weighted by relevance.
The reshape trick is the key insight: instead of running h separate attention computations with h separate weight matrices, we project once into the full d_model space, reshape to split into heads, and use NumPy's batch matrix multiplication to handle all heads simultaneously. One matrix multiply does the work of eight.
Let's verify it works:
np.random.seed(42)
n, d_model, num_heads = 6, 64, 8
X = np.random.randn(n, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, all_weights = mha.forward(X)
print(f"Input shape: {X.shape}") # (6, 64)
print(f"Output shape: {output.shape}") # (6, 64)
print(f"Weight shape: {all_weights.shape}") # (8, 6, 6) — 8 attention matrices
# Each head learns different patterns
for h in range(min(3, num_heads)):
print(f"\nHead {h} — where does token 0 attend?")
w = all_weights[h, 0]
for i, weight in enumerate(w):
bar = "█" * int(weight * 40)
print(f" token {i}: {weight:.3f} {bar}")
Each head produces a different attention pattern. In a trained model, one head might focus on syntactic dependencies (subject-verb), another on positional proximity, and another on semantic similarity. That's the power of multi-head attention — multiple relationship types captured in parallel.
Positional Encoding: Where Am I?
We've built something powerful, but it has a fatal flaw: attention is permutation-invariant. Shuffle the tokens in any order and the output is just the same shuffle of the original output. "The cat sat on the mat" and "mat the on sat cat the" produce identical attention patterns (just reordered). That's a problem — word order matters.
The fix: inject position information into the embeddings before they enter the attention mechanism. The original Transformer uses sinusoidal positional encodings:
def positional_encoding(seq_len, d_model):
"""
Generates sinusoidal positional encodings.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
pe = np.zeros((seq_len, d_model))
position = np.arange(seq_len)[:, np.newaxis] # (seq_len, 1)
div_term = 10000 ** (np.arange(0, d_model, 2) / d_model) # (d_model/2,)
pe[:, 0::2] = np.sin(position / div_term) # even dimensions
pe[:, 1::2] = np.cos(position / div_term) # odd dimensions
return pe
Why sine and cosine? It's not arbitrary. Each dimension oscillates at a different frequency — low dimensions change rapidly (capturing fine position), high dimensions change slowly (capturing coarse position). Together, they create a unique "fingerprint" for each position. And because sin(a + b) can be expressed as a linear combination of sin(a) and cos(a), the encoding of position pos + k is a linear function of the encoding at position pos. This lets the model learn relative positions, not just absolute ones.
pe = positional_encoding(20, 64)
print(f"Shape: {pe.shape}") # (20, 64)
print(f"PE[0][:6]: {pe[0, :6]}") # Position 0
print(f"PE[1][:6]: {pe[1, :6]}") # Position 1 — different!
print(f"PE[0] · PE[1] = {np.dot(pe[0], pe[1]):.3f}") # High similarity (nearby)
print(f"PE[0] · PE[19] = {np.dot(pe[0], pe[19]):.3f}") # Lower (far apart)
The encoding is added to the embeddings, not concatenated. This preserves the embedding dimension and lets the model learn to combine positional and semantic information:
# In a full transformer:
# X = token_embeddings + positional_encoding(seq_len, d_model)
# output = multi_head_attention(X)
Modern models have moved on to learned positional embeddings or Rotary Position Embeddings (RoPE), but sinusoidal encoding remains the clearest way to understand why position injection is needed and how it works.
Putting It All Together
Let's wire up the complete pipeline — token embeddings, positional encoding, and multi-head attention — and run it end to end:
np.random.seed(42)
# Configuration
sentence = ["The", "cat", "sat", "on", "the", "mat"]
n = len(sentence)
d_model = 64
num_heads = 8
# Simulate token embeddings (random, since we don't have a vocabulary)
token_embeddings = np.random.randn(n, d_model) * 0.5
# Add positional encoding
pe = positional_encoding(n, d_model)
X = token_embeddings + pe
print("Before attention:")
print(f" Embedding norms: {[f'{np.linalg.norm(X[i]):.2f}' for i in range(n)]}")
# Run multi-head attention
mha = MultiHeadAttention(d_model, num_heads)
output, weights = mha.forward(X)
print("\nAfter attention:")
print(f" Output norms: {[f'{np.linalg.norm(output[i]):.2f}' for i in range(n)]}")
print(f"\nEvery token now carries information from every other token.")
print(f"Shape in: {X.shape} → Shape out: {output.shape}")
In roughly 120 lines of NumPy, we've built the core mechanism behind GPT, Claude, and every modern LLM. The real Transformer stacks multiple layers of this (6 to 96+), adds feed-forward networks between attention layers, uses layer normalization, and trains on massive datasets. But the attention mechanism you've just implemented is the same.
If you want to see how this becomes a full Transformer, PyTorch wraps everything we built into a single module:
import torch
import torch.nn as nn
# Everything we built from scratch — in one line
attn = nn.MultiheadAttention(embed_dim=64, num_heads=8, batch_first=True)
X = torch.randn(1, 6, 64) # (batch, seq_len, d_model)
output, weights = attn(X, X, X) # self-attention: Q=K=V=X
Behind nn.MultiheadAttention lives the same projection → reshape → scale → softmax → weighted sum → concatenate → project pipeline we wrote by hand. The difference is CUDA kernels instead of NumPy, but the math is identical.
What We Didn't Cover
A full Transformer has more moving parts that are worth exploring next:
- Masking: In decoder-only models (like GPT), a causal mask prevents tokens from attending to future positions. This is done by setting upper-triangle scores to −∞ before softmax.
- Feed-forward networks: After each attention layer, every token passes through a small 2-layer MLP. This adds the nonlinear capacity that attention (which is purely linear in V) lacks.
- Layer normalization: Normalizes activations between layers to stabilize training. Usually applied before attention ("pre-norm") in modern architectures.
- Residual connections: The output of each sub-layer is added to its input:
output = X + attention(X). This lets gradients flow directly through deep stacks.
Each of these is simpler than the attention mechanism itself. If you've understood what we built today, the rest of the Transformer is bookkeeping.
The remarkable thing about attention is how much insight is packed into such a small amount of code. Twenty lines of matrix multiplication, and you have a mechanism that can learn grammar, coreference, reasoning, and even code generation — all from the same simple principle: let every token decide what to pay attention to.
References & Further Reading
- Vaswani et al. — "Attention Is All You Need" (2017) — The original Transformer paper. Section 3.2 covers scaled dot-product and multi-head attention.
- Jay Alammar — "The Illustrated Transformer" — The gold standard visual explanation of the Transformer architecture, with step-by-step diagrams.
- Andrej Karpathy — "Let's build GPT: from scratch, in code, spelled out" — A 2-hour video building a GPT from zero, including the attention mechanism.
- Dive into Deep Learning — Attention Mechanisms — Comprehensive textbook treatment with interactive code, covering scoring functions, masking, and multi-head attention.
- Eli Bendersky — "Notes on implementing attention" — Clean, well-annotated implementation notes covering the details that other tutorials skip.