Positional Encoding from Scratch: How Transformers Know Word Order
The Shuffle Test
Take these two sentences: “dog bites man” and “man bites dog.” Opposite meanings. One is a slow news day; the other makes the front page. Now watch what happens when self-attention processes them.
Let’s assign random embeddings to each word and compute attention weights on both orderings:
import numpy as np
np.random.seed(42)
# Random embeddings for three words
embeddings = {
'dog': np.random.randn(8),
'bites': np.random.randn(8),
'man': np.random.randn(8),
}
def attention_weights(tokens, embeds):
"""Compute attention weight matrix for a token sequence."""
X = np.stack([embeds[t] for t in tokens]) # (3, 8)
scores = X @ X.T / np.sqrt(8) # (3, 3)
exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
return exp_scores / exp_scores.sum(axis=-1, keepdims=True)
w1 = attention_weights(['dog', 'bites', 'man'], embeddings)
w2 = attention_weights(['man', 'bites', 'dog'], embeddings)
print("'dog bites man' weights:")
print(np.round(w1, 3))
# [[0.954 0.025 0.021]
# [0.012 0.833 0.155]
# [0.009 0.145 0.845]]
print("\n'man bites dog' weights:")
print(np.round(w2, 3))
# [[0.845 0.145 0.009]
# [0.155 0.833 0.012]
# [0.021 0.025 0.954]]
Look carefully. The weight matrix for “man bites dog” is the same matrix as “dog bites man” — just with rows and columns swapped. Token dog still pays the same attention to bites regardless of whether “dog” is the subject or the object. The model has no idea which word came first.
This isn’t a bug. It’s a mathematical certainty. Attention computes pairwise similarities between embeddings with QKT. Shuffle the tokens and you shuffle the matrix — the relationships between tokens don’t change. Attention is permutation-invariant.
This is the fundamental problem. Language is deeply ordered — “the cat sat on the mat” is not “mat the on sat cat the.” Without some way to encode where each token sits in the sequence, a transformer treats every sentence as a bag of words.
In our attention post, we glossed over “add positional encoding to the input.” Today we crack that open. We’ll build three different solutions, each fixing a flaw in the previous one: sinusoidal encoding (2017), learned embeddings (2018), and Rotary Position Embeddings (2021). By the end, you’ll understand the approach used by every major open-source LLM today.
Sinusoidal Encoding — The Fourier Solution
The original Transformer paper by Vaswani et al. introduced a clever mathematical trick: encode each position as a unique pattern of sine and cosine waves at different frequencies.
The formula is deceptively simple. For position pos and dimension i:
- Even dimensions: PE(pos, 2i) = sin(pos / 100002i/d)
- Odd dimensions: PE(pos, 2i+1) = cos(pos / 100002i/d)
Let’s build it:
def sinusoidal_encoding(seq_len, d_model):
"""Generate sinusoidal positional encodings."""
pe = np.zeros((seq_len, d_model))
position = np.arange(seq_len)[:, np.newaxis] # (seq_len, 1)
# Frequencies: 1, 1/10000^(2/d), 1/10000^(4/d), ...
div_term = 10000 ** (np.arange(0, d_model, 2) / d_model) # (d_model/2,)
pe[:, 0::2] = np.sin(position / div_term) # Even dims get sin
pe[:, 1::2] = np.cos(position / div_term) # Odd dims get cos
return pe
pe = sinusoidal_encoding(10, 8)
print("Position 0:", np.round(pe[0], 3))
# [ 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000]
print("Position 1:", np.round(pe[1], 3))
# [ 0.841 0.540 0.100 0.995 0.010 1.000 0.001 1.000]
print("Position 9:", np.round(pe[9], 3))
# [ 0.412 -0.911 0.783 0.622 0.09 0.996 0.009 1. ]
Think of it like a set of clocks running at different speeds. Dimension 0 is a second hand — it completes a full sin/cos cycle every ~6 positions (wavelength 2π). Dimension 6 is more like an hour hand — it barely moves between adjacent positions, completing a cycle only after thousands of positions. Together, these multi-speed oscillations create a unique fingerprint for every position, just like hours + minutes + seconds uniquely identify any time of day.
Why use both sine and cosine? Sine alone creates ambiguity: sin(θ) = sin(π − θ), so positions at θ and π − θ would get identical encodings. The (sin, cos) pair gives each position a unique 2D coordinate at every frequency — no collisions.
The Key Insight: Linear Transformability
Here’s the property that makes sinusoidal encoding special. For any fixed offset k, the encoding at position pos + k can be written as a linear transformation of the encoding at position pos. The model doesn’t need to memorize “position 7 relates to position 5” — it can learn the general pattern “2 positions apart.”
The proof uses the angle addition identities. For a single frequency ω:
- sin((pos+k)ω) = sin(pos·ω)cos(k·ω) + cos(pos·ω)sin(k·ω)
- cos((pos+k)ω) = cos(pos·ω)cos(k·ω) − sin(pos·ω)sin(k·ω)
In matrix form, that’s a rotation:
PE(pos+k) = Mk × PE(pos), where Mk = [[cos(kω), sin(kω)], [−sin(kω), cos(kω)]]
The matrix Mk depends only on the offset k, not on the absolute position. Let’s verify:
# Verify: PE(pos+k) = rotation_matrix(k) @ PE(pos)
pe = sinusoidal_encoding(20, 8)
omega = 1.0 / 10000 ** (0 / 8) # Frequency for dims 0,1 (omega = 1.0)
k = 3 # Offset of 3 positions
# Rotation matrix for offset k at this frequency
M_k = np.array([
[ np.cos(k * omega), np.sin(k * omega)],
[-np.sin(k * omega), np.cos(k * omega)]
])
# Take PE at position 4, dims [0,1]
pe_4 = pe[4, :2] # [sin(4), cos(4)]
pe_7_actual = pe[7, :2] # [sin(7), cos(7)]
pe_7_computed = M_k @ pe_4 # Should equal pe[7, :2]
print(f"PE(7) actual: {np.round(pe_7_actual, 6)}")
print(f"PE(4) x M_3: {np.round(pe_7_computed, 6)}")
print(f"Match: {np.allclose(pe_7_actual, pe_7_computed)}")
# PE(7) actual: [ 0.656987 0.753902]
# PE(4) x M_3: [ 0.656987 0.753902]
# Match: True
Why does this matter? A neural network layer is a linear transformation followed by a nonlinearity. If relative position is a linear function of absolute position, the model can learn to detect “3 tokens apart” with a single weight matrix — no matter where in the sequence those tokens appear.
Learned Position Embeddings — The GPT Way
The engineers behind GPT-2 and BERT looked at all that trigonometry and asked: what if we just let the model figure it out?
Learned position embeddings are exactly what they sound like — a lookup table of trainable vectors, one per position. Position 0 gets a learned vector, position 1 gets another, and so on up to some maximum sequence length. During training, gradient descent shapes these vectors into whatever patterns the model finds useful.
# Learned positional embeddings (GPT-2 / BERT style)
max_seq_len = 1024
d_model = 64
# Random initialization — training will shape these
position_embeddings = np.random.randn(max_seq_len, d_model) * 0.02
# Look up positions and add to token embeddings
def add_learned_positions(token_embeds, pos_embeds):
"""Add learned position vectors to token embeddings."""
seq_len = token_embeds.shape[0]
return token_embeds + pos_embeds[:seq_len]
# Example: 5 tokens, each with a 64-dim embedding
tokens = np.random.randn(5, d_model)
positioned = add_learned_positions(tokens, position_embeddings)
print(f"Token shape: {tokens.shape}") # (5, 64)
print(f"Positioned shape: {positioned.shape}") # (5, 64) — same shape, position baked in
That’s the entire implementation. No sine waves, no rotation matrices. It’s the same idea as word embeddings: just as each word gets a learned vector, each position gets a learned vector. During training, the model discovers that position 1’s vector should be somewhat similar to position 2’s (because adjacent tokens often relate) and very different from position 500’s.
Surprisingly, this works about as well as sinusoidal encoding. Researchers found that trained position embeddings often rediscover sinusoidal-like patterns on their own — gradient descent arrives at similar solutions to the hand-crafted formula.
But there’s a catch. GPT-2 was trained with a maximum sequence length of 1024. What happens when you feed it 2048 tokens? Position 1025 doesn’t exist in the lookup table. You’re out of bounds. Sinusoidal encoding has no such limit — it’s a formula, so you can compute it for any position. Learned embeddings are stuck at whatever max_seq_len was set during training.
There’s a deeper issue too. Both sinusoidal and learned embeddings encode absolute position — “I am at position 5.” But language mostly cares about relative position — “the adjective is 2 tokens before the noun.” The sentence “the big dog” should work the same way whether it starts at position 0 or position 500. Sinusoidal encoding at least has the linear transformability trick. Learned embeddings have no such guarantee.
We need something that speaks relative position natively. Enter the approach that conquered modern AI.
Rotary Position Embeddings (RoPE) — The Modern Standard
In 2021, Su et al. published a paper titled “RoFormer” that introduced a beautiful idea: instead of adding position information to embeddings, rotate the query and key vectors by an angle that depends on their position.
RoPE is now the positional encoding used by LLaMA, Mistral, Qwen, DeepSeek, and essentially every major open-source LLM. If you’re using a modern language model, you’re almost certainly using RoPE.
Here’s the core idea. Take a d-dimensional embedding and split it into d/2 pairs of adjacent dimensions. Each pair is a 2D vector. Now rotate that 2D vector by an angle proportional to the token’s position:
- Token at position 0: no rotation
- Token at position 1: rotate by angle θi for pair i
- Token at position m: rotate by m · θi for pair i
The rotation angle for dimension pair i uses the same base frequency as sinusoidal encoding:
θi = 1 / 100002i/d
Pair 0 rotates fast (high frequency), pair d/2−1 rotates slow (low frequency). Same multi-scale structure as sinusoidal encoding, but applied as a rotation instead of an addition.
The 2D rotation matrix for pair i at position m is:
R(m, i) = [[cos(mθi), −sin(mθi)], [sin(mθi), cos(mθi)]]
Let’s build it:
def apply_rope(x, positions):
"""
Apply Rotary Position Embeddings to a (seq_len, d) matrix.
x: shape (seq_len, d) — the vectors to rotate (Q or K)
positions: shape (seq_len,) — position index for each token
"""
seq_len, d = x.shape
assert d % 2 == 0, "Dimension must be even"
# Frequencies: theta_i = 1 / 10000^(2i/d)
i = np.arange(d // 2)
theta = 1.0 / (10000 ** (2 * i / d)) # (d/2,)
# Angles: position * frequency
angles = positions[:, np.newaxis] * theta[np.newaxis, :] # (seq_len, d/2)
cos_a = np.cos(angles) # (seq_len, d/2)
sin_a = np.sin(angles) # (seq_len, d/2)
# Split x into pairs: even dims and odd dims
x_even = x[:, 0::2] # (seq_len, d/2) — first element of each pair
x_odd = x[:, 1::2] # (seq_len, d/2) — second element of each pair
# Apply 2D rotation to each pair
x_rot_even = x_even * cos_a - x_odd * sin_a
x_rot_odd = x_even * sin_a + x_odd * cos_a
# Interleave back
x_rot = np.empty_like(x)
x_rot[:, 0::2] = x_rot_even
x_rot[:, 1::2] = x_rot_odd
return x_rot
# Example: rotate an 8-dim vector at positions 0, 1, 2
x = np.array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0],
[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0],
[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]])
positions = np.array([0, 1, 2])
x_rot = apply_rope(x, positions)
print("Original (all identical):")
print(np.round(x[0], 3))
print("\nAfter RoPE (each row rotated by its position):")
for p in range(3):
print(f" Position {p}: {np.round(x_rot[p], 3)}")
# Position 0: [1. 0. 1. 0. 1. 0. 1. 0. ] — no rotation
# Position 1: [0.54 0.841 0.995 0.1 1. 0.01 1. 0.001] — small rotation
# Position 2: [-0.416 0.909 0.98 0.199 1. 0.02 1. 0.002] — more rotation
The same input vector gets rotated differently at each position. Position 0 is untouched. Position 1 gets a small nudge. Position 2 rotates further. And crucially, pair 0 (dimensions 0–1) rotates much faster than pair 3 (dimensions 6–7) — the multi-scale structure at work. You can see it clearly: pair 0 is already at [−0.416, 0.909] by position 2, while pair 3 has barely moved from [1, 0] to [1, 0.002].
One key difference from sinusoidal and learned embeddings: RoPE is applied to Q and K only, not to V. Why? Because position should affect which tokens attend to each other (the Q · K dot product that becomes attention weights), not what information gets passed through (the V values that get weighted and summed). The values carry semantic content; rotating them would distort meaning for no benefit.
The Magic: Relative Position for Free
Here’s why RoPE won. When you compute the attention score qm · kn between a query at position m and a key at position n, the rotation angles subtract. The dot product depends only on the relative offset (m − n), not on the absolute positions.
Let’s prove it for a single 2D pair. After RoPE, the rotated query and key are:
- q′ = [q1cos(mθ) − q2sin(mθ), q1sin(mθ) + q2cos(mθ)]
- k′ = [k1cos(nθ) − k2sin(nθ), k1sin(nθ) + k2cos(nθ)]
The dot product q′ · k′ expands (using the cosine subtraction identity) to:
q′ · k′ = (q1k1 + q2k2)cos((m−n)θ) + (q1k2 − q2k1)sin((m−n)θ)
The result depends on (m − n) — the relative position — not on m or n individually. Shift both positions by any constant c: cos((m+c−n−c)θ) = cos((m−n)θ). The dot product is invariant to absolute position.
Let’s verify this with code:
# Prove: RoPE dot product depends on RELATIVE position only
np.random.seed(7)
d = 8
q = np.random.randn(1, d) # A query vector
k = np.random.randn(1, d) # A key vector
# Case 1: query at position 2, key at position 5 (offset = 3)
q_rot_2 = apply_rope(q, np.array([2]))
k_rot_5 = apply_rope(k, np.array([5]))
dot_case1 = (q_rot_2 @ k_rot_5.T)[0, 0]
# Case 2: query at position 10, key at position 13 (offset = 3)
q_rot_10 = apply_rope(q, np.array([10]))
k_rot_13 = apply_rope(k, np.array([13]))
dot_case2 = (q_rot_10 @ k_rot_13.T)[0, 0]
# Case 3: query at position 100, key at position 103 (offset = 3)
q_rot_100 = apply_rope(q, np.array([100]))
k_rot_103 = apply_rope(k, np.array([103]))
dot_case3 = (q_rot_100 @ k_rot_103.T)[0, 0]
print(f"Positions (2, 5), offset 3: dot = {dot_case1:.6f}")
print(f"Positions (10, 13), offset 3: dot = {dot_case2:.6f}")
print(f"Positions (100,103),offset 3: dot = {dot_case3:.6f}")
print(f"All equal: {np.allclose(dot_case1, dot_case2) and np.allclose(dot_case2, dot_case3)}")
# Positions (2, 5), offset 3: dot = 0.349969
# Positions (10, 13), offset 3: dot = 0.349969
# Positions (100,103),offset 3: dot = 0.349969
# All equal: True
Same relative offset, same dot product. It doesn’t matter whether those tokens appear at the start, middle, or end of a 100,000-token document. The model learns “3 positions apart” once and it works everywhere.
Why is this better than sinusoidal? Sinusoidal encoding can capture relative position through its linear transformability, but the model has to learn to do so — it needs to learn the rotation matrix Mk. With RoPE, relative position falls directly out of the geometry of the dot product. It’s baked into the math, not something the model must discover during training.
Side-by-Side Comparison
| Property | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Position type | Absolute | Absolute | Relative |
| Learned? | No (deterministic) | Yes (fully trained) | Partially (base freq fixed) |
| Max sequence length | Unlimited | Fixed at training time | Unlimited |
| Applied to | Input embeddings (add) | Input embeddings (add) | Q and K only (rotate) |
| Relative position | Indirectly (learnable) | No guarantee | Natively geometric |
| Used by | Original Transformer (2017) | GPT-2, BERT (2018–2019) | LLaMA, Mistral (2021+) |
Each scheme solves a real problem. Sinusoidal encoding gave transformers a way to see position without any training. Learned embeddings let the model discover its own patterns. RoPE combined the best of both worlds: deterministic computation (no extra parameters), unlimited sequence length, and native relative position awareness.
There are other approaches we won’t cover in detail — ALiBi (adding a linear bias to attention scores based on distance) and T5’s learned relative position bias. But for autoregressive language models, RoPE is the clear winner and the one worth understanding deeply.
Try It: Position Encoding Explorer
Interactive Demo: Encoding Patterns & Position Similarity
Encoding Pattern (position × dimension)
Position Similarity (dot product)
Hover over cells to see values. Left: the encoding values for each position & dimension. Right: dot-product similarity between every pair of positions — notice how nearby positions are more similar.
The left heatmap shows the actual encoding values. For sinusoidal, you can see the characteristic wave pattern — low dimensions oscillate rapidly (the “second hand”), high dimensions change slowly (the “hour hand”). For RoPE, the pattern shows the rotation angles applied to each dimension pair.
The right panel is more revealing. It shows the dot-product similarity between every pair of positions. Notice the diagonal band structure — nearby positions have higher similarity than distant ones. This is the locality bias that position encoding gives the model: a natural tendency for tokens to attend more strongly to their neighbors.
The Pipeline So Far
We’ve now built every major piece of the transformer inference pipeline from scratch across this series:
tokenize → embed → position → attend → softmax → loss → optimize → decode
This post filled the critical gap. Without positional encoding, “dog bites man” and “man bites dog” are the same sentence to a transformer. Three lines of math — a sine, a cosine, and a rotation — separate a bag of words from a language model.
The remaining pieces of a full transformer (feed-forward layers, layer normalization, residual connections) are important but mechanically simpler. The hard conceptual work — turning text into numbers, giving those numbers position, letting them attend to each other, normalizing scores into probabilities, measuring error, optimizing weights, and finally selecting output tokens — is the journey we’ve taken from the first post to this one.
Next time you adjust a “temperature” slider or wonder why an LLM handles a 100K-token document, you’ll know exactly which piece of math is responsible.
References & Further Reading
- Vaswani et al. — “Attention Is All You Need” (2017) — The original Transformer paper that introduced sinusoidal positional encoding
- Su et al. — “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) — The paper that introduced RoPE
- Brown et al. — “Language Models are Few-Shot Learners” (GPT-3, 2020) — Used learned positional embeddings at scale
- Devlin et al. — “BERT: Pre-training of Deep Bidirectional Transformers” (2018) — Learned embeddings for bidirectional models
- EleutherAI — “Rotary Embeddings: A Relative Revolution” — Excellent deep dive into the math behind RoPE
- Press et al. — “Train Short, Test Long: Attention with Linear Biases” (ALiBi, 2021) — An alternative approach using linear distance biases