Rotary Position Embeddings from Scratch: How Modern LLMs Know Where Words Are

February 26, 2026 · Elementary AI · 14 min read

Why Rotations Won

Open up the config file for LLaMA, Mistral, Phi, Gemma, Qwen, or DeepSeek. Every single one of them uses the same positional encoding: Rotary Position Embeddings, or RoPE. In the span of three years, RoPE went from an obscure paper to the undisputed standard for how large language models understand word order.

In our positional encoding post, we built the original sinusoidal encodings from "Attention Is All You Need" and explored learned position embeddings. Both approaches add position information to token embeddings. RoPE does something different: it rotates the query and key vectors inside attention. That single change gives us relative position information for free, no learned parameters, and graceful extension to context lengths far beyond what the model saw during training.

In this post, we'll build RoPE from scratch in NumPy. We'll start with 2D rotations, extend to full embedding dimensions, prove why relative positions emerge from dot products, and explore the scaling tricks that push context windows to 128K+ tokens. By the end, you'll understand the elegant geometry hiding inside every modern language model.

The Core Insight: Position as Rotation

Forget about embeddings for a moment. Think about two numbers on a plane: the pair (x, y). If you want to mark that this pair came from position m in a sequence, you could add a position vector (that's what sinusoidal encoding does), or you could rotate the point by an angle proportional to m.

The 2D rotation matrix does exactly this:

R(m, θ) = [ cos(mθ), −sin(mθ) ]
[ sin(mθ), cos(mθ) ]

Applied to a vector [x, y], this rotates it by angle m·θ. Position 0 gets no rotation (identity matrix). Position 1 gets a small rotation. Position 100 gets a large one. The magic happens when you ask: what does the dot product between two rotated vectors look like?

Take a query vector q at position m and a key vector k at position n. After rotating both:

q' · k' = (q₁k₁ + q₂k₂) cos((m−n)θ) + (q₁k₂ − q₂k₁) sin((m−n)θ)

Look at that result. The absolute positions m and n have vanished. Only the relative offset (m − n) remains. The model doesn't need to learn "position 5 attending to position 8" and "position 105 attending to position 108" as separate patterns — it learns "3 positions apart" once, and that knowledge works everywhere in the sequence.

Let's verify this with code:

import numpy as np

def rotate_2d(x, position, theta):
    """Rotate a 2D vector by angle position * theta."""
    cos_val = np.cos(position * theta)
    sin_val = np.sin(position * theta)
    return np.array([
        x[0] * cos_val - x[1] * sin_val,
        x[0] * sin_val + x[1] * cos_val
    ])

# Two content vectors (these stay fixed)
q = np.array([0.5, 0.8])
k = np.array([0.3, 0.6])
theta = 0.1

# Same relative offset (3), wildly different absolute positions
for m, n in [(2, 5), (10, 13), (100, 103), (9999, 10002)]:
    q_rot = rotate_2d(q, m, theta)
    k_rot = rotate_2d(k, n, theta)
    dot = np.dot(q_rot, k_rot)
    print(f"positions ({m:5d}, {n:5d}) -> dot product = {dot:.6f}")

# Output:
# positions (    2,     5) -> dot product = 0.584131
# positions (   10,    13) -> dot product = 0.584131
# positions (  100,   103) -> dot product = 0.584131
# positions ( 9999, 10002) -> dot product = 0.584131

All four dot products are identical — because the relative offset is always 3, regardless of where in the sequence these tokens sit. That's the core magic of RoPE. Rotation encodes position in a way that the dot product (the heart of attention) naturally extracts relative distance.

From 2D to d Dimensions

A real embedding has many dimensions, not just two. RoPE handles this by pairing up dimensions: (d₀, d₁), (d₂, d₃), ..., through the last pair. Each pair gets rotated independently, but at a different frequency:

θᵢ = 10000^−2i/d for i = 0, 1, ..., d/2 − 1

This is the same frequency scheme as the original sinusoidal positional encoding from "Attention Is All You Need" — but instead of adding sines and cosines to the embedding, we use them to rotate pairs of dimensions.

The different frequencies create a multi-scale position encoding. The first pair (i=0) rotates quickly — θ₀ = 1.0, completing a full rotation every ~6.28 positions. This captures fine-grained local position. The last pair rotates extremely slowly — its cycle length exceeds 60,000 positions. This captures coarse-grained global position. Together, the pairs form a unique positional signature at every position, like the hands of a multi-armed clock.

import numpy as np

def rope_frequencies(d_model, base=10000.0):
    """Compute rotation frequencies for each dimension pair."""
    i = np.arange(0, d_model, 2, dtype=np.float64)
    return 1.0 / (base ** (i / d_model))

def apply_rope(x, position, freqs):
    """Apply RoPE to a d-dimensional vector at a given position.

    Rotates each pair of consecutive dimensions by position * freq_i.
    """
    d = x.shape[-1]
    angles = position * freqs               # (d/2,)
    cos_a = np.cos(angles)
    sin_a = np.sin(angles)
    x1, x2 = x[..., :d//2], x[..., d//2:]  # split into pairs
    return np.concatenate([
        x1 * cos_a - x2 * sin_a,
        x2 * cos_a + x1 * sin_a
    ], axis=-1)

# Verify: relative position property holds in d dimensions
d_model = 64
freqs = rope_frequencies(d_model)
q = np.random.randn(d_model)
k = np.random.randn(d_model)

for m, n in [(2, 5), (50, 53), (1000, 1003)]:
    q_rot = apply_rope(q, m, freqs)
    k_rot = apply_rope(k, n, freqs)
    print(f"positions ({m:4d}, {n:4d}) -> dot = {np.dot(q_rot, k_rot):.6f}")

# All three outputs are identical (relative offset = 3)

The implementation is remarkably efficient. We split the vector in half, apply element-wise cosines and sines, and recombine. No matrix multiplication needed — the block-diagonal structure of the rotation means each pair is independent. The computational overhead is around 1-3% on top of standard attention.

The Complex Number Perspective

There's an even more elegant way to see RoPE. Treat each pair of dimensions as a single complex number: z = x₁ + ix₂. Rotating by angle θ in the complex plane is just multiplication by e^iθ:

RoPE(z, m) = z · e^imθ

That's it. The entire RoPE operation is element-wise complex multiplication. And the relative position proof becomes a one-liner: when you compute the dot product of two rotated complex vectors, the position factors combine as e^imθ · e^−inθ = e^i(m−n)θ. Only the relative offset survives.

This is actually how LLaMA implements RoPE in practice: reshape Q and K into complex tensors, precompute the rotation factors e^imθ for all positions, multiply, and reshape back to real.

def apply_rope_complex(x, position, freqs):
    """Apply RoPE using complex arithmetic — equivalent to rotation matrix."""
    d = x.shape[-1]
    # Pair up dimensions as complex numbers: (x0+ix1, x2+ix3, ...)
    x_complex = x[..., :d//2] + 1j * x[..., d//2:]
    # Rotation = multiplication by e^{i * position * freq}
    angles = position * freqs
    rotation = np.exp(1j * angles)
    x_rotated = x_complex * rotation
    # Unpack back to real
    return np.concatenate([x_rotated.real, x_rotated.imag], axis=-1)

# Verify: complex version gives identical results to rotation matrix version
d_model = 64
freqs = rope_frequencies(d_model)
x = np.random.randn(d_model)
pos = 42

result_matrix = apply_rope(x, pos, freqs)
result_complex = apply_rope_complex(x, pos, freqs)
print(f"Max difference: {np.max(np.abs(result_matrix - result_complex)):.2e}")
# Output: Max difference: 0.00e+00

Both implementations produce bit-identical results. The complex version is conceptually cleaner; the rotation matrix version avoids complex number support in frameworks that don't have it. Pick whichever makes more sense to you — the math is the same.

RoPE Inside Attention

Where exactly does RoPE fit in a transformer layer? It rotates the query and key vectors after the linear projections but before the dot product. Crucially, it does not touch the value vectors — position should affect which tokens attend to each other (Q·K), not what information flows through (V).

def attention_with_rope(X, W_q, W_k, W_v, freqs):
    """Scaled dot-product attention with RoPE.

    X: (seq_len, d_model) — input embeddings (no positional encoding added!)
    """
    seq_len = X.shape[0]
    Q = X @ W_q                      # (seq_len, d_head)
    K = X @ W_k
    V = X @ W_v

    # Apply RoPE to Q and K (not V!)
    for pos in range(seq_len):
        Q[pos] = apply_rope(Q[pos], pos, freqs)
        K[pos] = apply_rope(K[pos], pos, freqs)

    # Standard scaled dot-product attention
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)  # (seq_len, seq_len)
    weights = softmax(scores)         # softmax along last axis
    return weights @ V                # (seq_len, d_head)

def softmax(x):
    """Numerically stable softmax along last axis."""
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

Compare this to standard attention: the only difference is those two apply_rope lines. No positional embedding is added to the input — position information enters entirely through the rotation of Q and K. This also plays nicely with the KV cache: since RoPE is applied to K before it enters the cache, position is "baked in" permanently. During autoregressive decoding, cached keys never need re-rotation.

The rotation also creates a natural locality bias. For tokens with similar content vectors, the attention score decays as the distance between them grows — because the rotation angle difference increases, pushing their dot product toward zero. Nearby tokens attend to each other more strongly by default, which matches the intuition that local context usually matters most.

Scaling Beyond Training Length

RoPE has a weakness: at positions beyond what the model saw during training, the rotation angles become unfamiliar. A model trained on 4,096 tokens has never encountered the rotation corresponding to position 8,000. The high-frequency dimension pairs are especially vulnerable — they complete full rotations and "wrap around," creating ambiguous position signals.

Three major techniques have emerged to address this:

Position Interpolation (PI)

The simplest approach: squeeze all positions into the original training range. If the model was trained on length L and you want to use length L', scale each position by L/L'. Position 8,000 in an 8K context becomes position 4,000 in the model's reference frame.

def rope_frequencies_pi(d_model, base=10000.0, scale=1.0):
    """Position Interpolation: scale down positions to fit training range.

    scale = original_length / target_length (e.g., 4096/8192 = 0.5)
    """
    i = np.arange(0, d_model, 2, dtype=np.float64)
    freqs = 1.0 / (base ** (i / d_model))
    return freqs * scale  # uniformly reduce all frequencies

It works, but compresses all frequencies equally — including the high-frequency pairs that distinguish nearby tokens. Short-range precision suffers.

NTK-Aware Scaling

Inspired by Neural Tangent Kernel theory, this approach modifies the base frequency instead of the positions. The key insight: high-frequency components (which encode local position) should be preserved, while low-frequency components (which encode global position) should be stretched to accommodate the longer context.

def rope_frequencies_ntk(d_model, base=10000.0, scale_factor=2.0):
    """NTK-aware scaling: modify the base to extend context.

    scale_factor = target_length / original_length
    """
    base_scaled = base * (scale_factor ** (d_model / (d_model - 2)))
    i = np.arange(0, d_model, 2, dtype=np.float64)
    return 1.0 / (base_scaled ** (i / d_model))

By scaling the base rather than the positions, NTK scaling naturally preserves high frequencies (small dimensions) while stretching low frequencies (large dimensions). This can work without any fine-tuning at inference time.

YaRN (Yet Another RoPE Extension)

YaRN combines the best of both worlds with a piecewise strategy: leave high frequencies completely alone, fully interpolate low frequencies, and smoothly blend in between. It also adds an attention temperature correction to compensate for the changed frequency distribution.

def rope_frequencies_yarn(d_model, base=10000.0, scale_factor=2.0,
                           original_max_len=4096, alpha=1, beta=32):
    """YaRN: piecewise frequency scaling with temperature correction.

    - High-freq dimensions (local position): unchanged
    - Low-freq dimensions (global position): fully interpolated
    - Mid-freq dimensions: smooth blend between the two
    """
    i = np.arange(0, d_model, 2, dtype=np.float64)
    freqs = 1.0 / (base ** (i / d_model))

    # For each frequency, compute how many rotations it completes
    # in the original training context
    wavelengths = 2 * np.pi / freqs
    ratios = original_max_len / wavelengths

    # Piecewise ramp function
    gamma = np.clip((ratios - alpha) / (beta - alpha), 0.0, 1.0)

    # Blend: gamma=0 -> fully interpolated, gamma=1 -> unchanged
    freqs_scaled = freqs / scale_factor         # PI frequencies
    freqs_yarn = gamma * freqs + (1 - gamma) * freqs_scaled

    # Attention temperature correction
    temperature = 0.1 * np.log(scale_factor) + 1.0

    return freqs_yarn, temperature

YaRN requires only 10x fewer training tokens to adapt than Position Interpolation and extends context to 128K+ tokens with minimal quality loss. It's what Qwen, DeepSeek, and many production models use today.

The RoPE Landscape: Who Uses What

RoPE has become so dominant that the interesting question isn't "who uses RoPE" but "what base frequency and scaling strategy." Here's the landscape as of early 2026:

Model	Base θ	Context	Scaling
LLaMA 2	10,000	4,096	Default
LLaMA 3	500,000	8,192	Increased base
LLaMA 3.1	500,000	131,072	Custom scaling
Code LLaMA	1,000,000	16,384+	Increased base
Mistral 7B	10,000	32,768	Sliding window
Gemma 2	10,000	8,192	Default
Qwen 2.5	1,000,000	131,072	YaRN
DeepSeek-V3	Varies	128,000+	YaRN

Notice the trend: newer models use vastly larger base frequencies (500K or 1M vs the original 10K). A larger base slows down the rotation speeds across all dimension pairs, which directly extends the range of positions the model can distinguish without ambiguity.

The Road Not Taken: ALiBi

Not everyone chose rotation. ALiBi (Attention with Linear Biases) takes a fundamentally different approach: instead of encoding position in the Q/K vectors, it adds a linear penalty −m·|i−j| directly to the attention scores after the dot product. Each attention head gets a different slope m, creating multi-scale distance awareness. No learned parameters, excellent extrapolation to unseen lengths.

ALiBi is used in BLOOM and some MPT models, and it extrapolates beautifully out of the box. But RoPE won the adoption war for a key reason: it encodes position within the embedding space, giving the model richer positional representations that interact with content in more expressive ways. ALiBi's position signal only exists in the attention scores — by the time values are aggregated, position information has been reduced to a scalar weight per token pair.

Beyond 1D: RoPE for Vision

RoPE isn't limited to sequences. For vision transformers, image patches have 2D positions (row, column). Axial RoPE splits the head dimensions in half: the first half encodes the x-position, the second half encodes the y-position. Qwen2-VL takes this further with M-RoPE (Multimodal RoPE), decomposing into three components — temporal, height, width — to handle text, images, and video in a unified framework.

Try It: Rotation Explorer

Drag the position sliders to rotate the query (blue) and key (red) vectors. Watch how the dot product (attention score) depends only on relative position — not absolute positions.

Query pos (m): 0 Key pos (n): 3

Lock relative distance

Dot product: — Relative pos (m−n): — θ = 0.3

Try It: Frequency Band Visualizer

Each dimension pair rotates at a different frequency. High-frequency pairs (red) encode local position; low-frequency pairs (blue) encode global position. Toggle scaling strategies to see how they modify frequencies for longer contexts.

Scaling: d_model: Scale: 2x

The Geometry of Knowing Where You Are

Rotary Position Embeddings are one of those ideas that make you appreciate mathematical elegance in engineering. The core mechanism is simple — rotate pairs of dimensions by an angle proportional to position. From that single operation, three powerful properties emerge for free: relative position information through the dot product, no learned parameters, and natural extension to longer sequences through frequency scaling.

RoPE is now the default positional encoding in every major open-source LLM. When you run a model locally and wonder why it handles 8K tokens but struggles at 32K, or why increasing rope_theta helps — now you know. It's all about how fast the vectors rotate, and whether the model has seen those rotation angles before.

The next time you look at a transformer architecture diagram, remember: every query and key vector is quietly spinning in high-dimensional space, and the angle between them is how the model knows that "the" came three words before "cat."

References & Further Reading

Jianlin Su et al. — "RoFormer: Enhanced Transformer with Rotary Position Embedding" — the original RoPE paper (2021)
Bowen Peng et al. — "YaRN: Efficient Context Window Extension of Large Language Models" — the leading context extension method (ICLR 2024)
Ofir Press et al. — "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" — ALiBi, the main alternative to RoPE
EleutherAI — "Rotary Embeddings: A Relative Revolution" — excellent technical deep dive with proofs
EleutherAI — "Extending the RoPE" — comprehensive coverage of context extension techniques
Byeongho Heo et al. — "Rotary Position Embedding for Vision Transformer" — extending RoPE to 2D for image patches (ECCV 2024)