Positional Encoding: Teaching Transformers to Count
Part 2 of 4: Making Order Matter
TL;DR: Attention is permutation equivariant — shuffle the tokens and you get the same output (shuffled). Without position information, “the cat sat on the mat” and “mat the on sat cat the” are identical to the model. This article derives three approaches to fixing this: sinusoidal encoding (fixed, infinite extrapolation in theory), learned embeddings (flexible, bounded), and Rotary Position Embeddings (RoPE — the modern winner, encoding relative position directly into Q and K via rotation matrices). Full tested implementation at rlvr-from-scratch.
Prerequisites: Part 1: Attention Is All You Need to Implement covers scaled dot-product attention and multi-head attention.
The Position Problem
In Part 1 we built attention from scratch. It works — but it has a fundamental gap.
Consider two inputs:
Input A: ["The", "cat", "sat", "on", "the", "mat"]
Input B: ["mat", "the", "on", "sat", "cat", "The"]
Feed both through our multi-head attention module. The attention weights will differ (different tokens in different positions means different Q, K, V vectors). But here’s the problem: if you permute both the input and the output in the same way, you get the same result. Attention treats its input as a set, not a sequence.
Formally, for any permutation :
This is called permutation equivariance. It means attention has no concept of “first”, “second”, “last”. Token 0 and token 99 are processed identically — there’s nothing in the computation that distinguishes position.
For language, this is catastrophic. “The dog bit the man” and “The man bit the dog” have the same tokens. Without position, attention can’t tell them apart.
We need to inject position information. The question is how.
Sinusoidal Positional Encoding
The Original Approach
Vaswani et al. (2017) proposed adding a fixed signal to each token’s embedding. The signal varies with position and dimension, using sine and cosine functions at different frequencies:
where is the position in the sequence and is the dimension index.
Why sin and cos? Why 10000? Why this specific formula?
The Frequency Intuition
Think of each dimension pair as a clock running at a different speed.
- Dimensions 0, 1: a fast clock — cycles every few positions
- Dimensions , : a slow clock — cycles over thousands of positions
The wavelength for dimension is:
For :
| Dimension pair | Wavelength | What it captures |
|---|---|---|
| 0, 1 | Very local position (every ~6 tokens) | |
| 128, 129 | Paragraph-level position | |
| 254, 255 | Document-level position |
The model gets a multi-scale position signal. Low dimensions distinguish nearby tokens. High dimensions distinguish distant tokens.
Why Sin/Cos Pairs?
The key property: for any fixed offset , can be written as a linear transformation of .
where .
This is a rotation matrix. Moving from position to position is a rotation by angle — the same rotation regardless of . This means the model can learn to attend to relative positions: “the token 3 positions back” is always the same transformation.
The Implementation
import torch
import math
class SinusoidalPositionalEncoding(torch.nn.Module):
"""
Fixed sinusoidal positional encoding from "Attention Is All You Need".
No learnable parameters — the encoding is deterministic.
Args:
d_model: Model dimension (must be even).
max_len: Maximum sequence length to precompute.
"""
def __init__(self, d_model: int, max_len: int = 8192):
super().__init__()
assert d_model % 2 == 0, "d_model must be even for sin/cos pairs"
# =========================================
# Precompute encoding matrix: (max_len, d_model)
# =========================================
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float() # (max_len, 1)
# Frequency for each dimension pair
# div_term = 1 / 10000^(2i / d_model) = exp(-2i * log(10000) / d_model)
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
) # (d_model / 2,)
pe[:, 0::2] = torch.sin(position * div_term) # even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # odd dimensions
# Register as buffer (not a parameter, but saved with model)
# Shape: (1, max_len, d_model) for broadcasting over batch
self.register_buffer("pe", pe.unsqueeze(0))
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (B, T, d_model)
Returns:
x + positional encoding: (B, T, d_model)
"""
# =========================================
# Add position signal to input embeddings
# =========================================
# self.pe: (1, max_len, d_model) -> slice to (1, T, d_model)
return x + self.pe[:, :x.size(1), :]
Key implementation detail: We compute the division term in log-space (exp(-2i * log(10000) / d_model)) instead of directly computing 10000^(2i/d_model). This avoids numerical overflow for large dimension indices.
Strengths and Limitations
Strengths:
- No parameters — nothing to learn, nothing to overfit
- Extrapolation — in theory, can handle any sequence length (the frequencies are defined for all positions)
- Relative position encoded — the linear transformation property
Limitations:
- Extrapolation doesn’t actually work well — while the math supports it, models trained on length degrade at length in practice
- Fixed — can’t adapt to the task
- Additive injection — position and content share the same representation space, which can create interference
Key Insight: Sinusoidal encoding turns position into a multi-frequency signal. Each dimension pair is a clock at a different speed. The sin/cos pairing enables relative position through rotation — a property that RoPE later exploits much more directly.
Learned Positional Embeddings
The Simplest Approach
Why compute a fixed formula when you can just learn the position representations?
An embedding table. Position 0 gets one learned vector, position 1 gets another, and so on. This is what GPT-2 and BERT use.
class LearnedPositionalEmbedding(torch.nn.Module):
"""
Learned positional embedding — a lookup table.
Args:
max_len: Maximum sequence length.
d_model: Model dimension.
"""
def __init__(self, max_len: int, d_model: int):
super().__init__()
self.embedding = torch.nn.Embedding(max_len, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (B, T, d_model)
Returns:
x + position embeddings: (B, T, d_model)
"""
T = x.size(1)
# =========================================
# Create position indices and look up embeddings
# =========================================
positions = torch.arange(T, device=x.device) # (T,)
pos_emb = self.embedding(positions) # (T, d_model)
return x + pos_emb # broadcast over batch
That’s it. An nn.Embedding layer. learnable parameters.
When Learned Beats Sinusoidal
In practice, learned embeddings often perform slightly better than sinusoidal for fixed-length tasks — the model can learn position representations optimized for the actual data distribution.
But there’s a hard ceiling: the model has no embedding for positions beyond . If you train with and try to process a 1,024-token sequence, you crash. You’d need to either truncate or retrain.
Comparison
| Property | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 | |
| Extrapolation | Theoretically yes, practically weak | No — hard crash beyond |
| Adaptability | None | Task-specific |
| Relative position | Encoded via rotation property | Not explicitly encoded |
| Used by | Original Transformer | GPT-2, BERT |
Both approaches share a fundamental limitation: they add position to the token embedding, mixing content and position in the same vector. What if we could encode position without this interference?
Key Insight: Learned embeddings trade generality for expressiveness. They work well within their trained range but cannot extrapolate. For modern LLMs that need long-context generalization, this is a dealbreaker.
Rotary Position Embeddings (RoPE)
The Modern Winner
RoPE (Su et al., 2021) is used by LLaMA, Qwen, Mistral, and most modern open-weight LLMs. Instead of adding position to the embedding, RoPE rotates the query and key vectors by a position-dependent angle. The attention score between two tokens then naturally depends on their relative position.
This is the key shift: position goes into Q and K, not into the embedding itself.
The Core Idea
For a 2D vector at position , RoPE applies a rotation by angle :
When we compute the dot product of a rotated query at position with a rotated key at position :
The rotation matrices compose: . The dot product depends only on the relative position , not on the absolute positions.
This is why RoPE works: relative position emerges naturally from the algebra of rotations.
Extending to Higher Dimensions
For , RoPE applies independent rotations to consecutive pairs of dimensions, each at a different frequency:
The full rotation for position :
where each is a 2×2 rotation matrix with angle .
Notice the frequency formula — it’s the same as sinusoidal encoding. RoPE inherits the multi-scale property: low-frequency pairs capture long-range position, high-frequency pairs capture local position.
Efficient Implementation
We don’t actually construct the rotation matrix. Instead, we use the identity:
This is just element-wise multiply and a swap — no matrix construction needed.
class RotaryPositionalEmbedding(torch.nn.Module):
"""
Rotary Position Embedding (RoPE).
Applied to Q and K tensors, not to the input embedding.
Position information enters through rotation, encoding
relative position in the attention score.
Args:
d_model: Model dimension (must be even).
max_len: Maximum sequence length.
base: Base for frequency computation (default 10000).
"""
def __init__(self, d_model: int, max_len: int = 8192, base: float = 10000.0):
super().__init__()
assert d_model % 2 == 0, "d_model must be even for RoPE pairs"
# =========================================
# Precompute frequencies: θ_i = base^(-2i/d)
# =========================================
inv_freq = 1.0 / (base ** (torch.arange(0, d_model, 2).float() / d_model))
self.register_buffer("inv_freq", inv_freq) # (d_model / 2,)
# Precompute cos/sin for all positions
self._build_cache(max_len)
def _build_cache(self, max_len: int):
"""Precompute cos and sin values for positions 0..max_len-1."""
positions = torch.arange(max_len).float() # (max_len,)
# Outer product: (max_len,) x (d/2,) -> (max_len, d/2)
freqs = torch.outer(positions, self.inv_freq)
# Duplicate for pairs: (max_len, d)
freqs = torch.cat([freqs, freqs], dim=-1)
self.register_buffer("cos_cached", freqs.cos().unsqueeze(0).unsqueeze(0))
self.register_buffer("sin_cached", freqs.sin().unsqueeze(0).unsqueeze(0))
# Shape: (1, 1, max_len, d_model) — broadcastable over B and H
def forward(
self, q: torch.Tensor, k: torch.Tensor, offset: int = 0
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Apply RoPE to query and key tensors.
Args:
q: (B, H, T, d_k)
k: (B, H, T, d_k)
offset: Position offset for KV-cache (default 0)
Returns:
q_rotated: (B, H, T, d_k)
k_rotated: (B, H, T, d_k)
"""
T = q.size(2)
# =========================================
# Slice precomputed cos/sin for current positions
# =========================================
cos = self.cos_cached[:, :, offset:offset + T, :] # (1, 1, T, d_k)
sin = self.sin_cached[:, :, offset:offset + T, :] # (1, 1, T, d_k)
# =========================================
# Apply rotation to Q and K
# =========================================
q_rotated = (q * cos) + (self._rotate_half(q) * sin)
k_rotated = (k * cos) + (self._rotate_half(k) * sin)
return q_rotated, k_rotated
@staticmethod
def _rotate_half(x: torch.Tensor) -> torch.Tensor:
"""
Rearrange pairs: [x0, x1, x2, x3, ...] -> [-x_{d/2}, ..., x0, x1, ...]
This implements the "swap and negate" part of the rotation.
"""
d_half = x.shape[-1] // 2
x1 = x[..., :d_half]
x2 = x[..., d_half:]
return torch.cat([-x2, x1], dim=-1)
How RoPE Integrates with Multi-Head Attention
RoPE modifies MultiHeadAttention.forward() — after splitting heads and before computing attention:
# Inside MultiHeadAttention.forward():
Q = self._split_heads(self.W_Q(query)) # (B, H, T_q, d_k)
K = self._split_heads(self.W_K(key)) # (B, H, T_k, d_k)
V = self._split_heads(self.W_V(value)) # (B, H, T_k, d_k)
# =========================================
# Apply RoPE to Q and K (not V!)
# =========================================
Q, K = self.rope(Q, K, offset=cache_len)
# Then proceed with attention as before
attn_output, weights = scaled_dot_product_attention(Q, K, V, mask)
Note: RoPE is applied to Q and K only — not to V. Position should affect which tokens attend to each other (the attention weights), but not what information they provide (the values).
RoPE and KV-Cache
RoPE integrates naturally with KV-cache. The offset parameter tells RoPE which absolute position the current tokens start at. During incremental decoding:
- First pass (full sequence):
offset=0, rotates all positions - Step (single token):
offset=t, rotates by the correct position for the new token - Cached K values are already rotated — no re-rotation needed
This is another advantage over additive position encoding, where you’d need to carefully track which positions have already been encoded.
Key Insight: RoPE encodes relative position as a mathematical property of dot products — not as an additive signal that competes with content. The attention score naturally depends on through the rotation algebra. This is why it generalizes better than additive approaches.
ALiBi: Attention with Linear Biases
An Even Simpler Alternative
ALiBi (Press et al., 2022) takes a radically different approach: don’t encode position in the embeddings at all. Instead, add a linear bias directly to the attention scores based on distance.
where is a head-specific slope. Closer tokens get a smaller penalty; distant tokens get a larger one.
Each attention head gets a different slope, geometrically spaced:
With 8 heads: .
Implementation
class ALiBi(torch.nn.Module):
"""
Attention with Linear Biases.
No position encoding in embeddings — position enters
as a bias on attention scores.
Args:
n_heads: Number of attention heads.
max_len: Maximum sequence length.
"""
def __init__(self, n_heads: int, max_len: int = 8192):
super().__init__()
# =========================================
# Compute head-specific slopes
# =========================================
slopes = torch.tensor([
1.0 / (2 ** (8 * h / n_heads))
for h in range(1, n_heads + 1)
]) # (H,)
# =========================================
# Precompute distance matrix
# =========================================
positions = torch.arange(max_len)
# |i - j| for all position pairs
distance = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs().float()
# (max_len, max_len)
# Bias: (H, max_len, max_len) — negative, penalizes distance
bias = -slopes.view(-1, 1, 1) * distance.unsqueeze(0)
self.register_buffer("bias", bias.unsqueeze(0)) # (1, H, max_len, max_len)
def forward(self, T: int) -> torch.Tensor:
"""
Returns ALiBi bias for sequence length T.
Add this to attention scores before softmax.
Returns:
bias: (1, H, T, T)
"""
return self.bias[:, :, :T, :T]
Usage is simple — add the bias to scores in the attention function:
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores + alibi.forward(T) # add position bias
scores = scores + causal_mask # add causal mask
weights = torch.softmax(scores, dim=-1)
Why ALiBi Works
1. Recency bias is a strong prior. In language, nearby tokens are usually more relevant. ALiBi encodes this directly.
2. Different heads, different horizons. Steep slopes () create heads that focus on very local context. Gentle slopes () create heads that can attend far back.
3. Extrapolation. Because the bias is a simple linear function of distance, it extends naturally to unseen lengths. This is ALiBi’s strongest selling point — it extrapolates better than sinusoidal or learned embeddings.
Limitations
- No relative position in content — position only affects attention weights, not representations
- Assumes recency — tasks where distant tokens are equally relevant (e.g., retrieval) may suffer
- Not used in most modern LLMs — RoPE has largely won for open-weight models
Key Insight: ALiBi is the minimalist approach — position as a bias on attention scores, nothing more. It extrapolates well but sacrifices the richness of position information in representations. Think of it as a strong baseline that RoPE improves upon.
Comparison: Which to Use When
| Property | Sinusoidal | Learned | RoPE | ALiBi |
|---|---|---|---|---|
| Parameters | 0 | 0 | 0 | |
| Where applied | Add to embeddings | Add to embeddings | Rotate Q, K | Bias on scores |
| Relative position | Via rotation property | Not explicit | Direct (dot product) | Direct (distance) |
| Extrapolation | Weak in practice | None | Good (with NTK-aware scaling) | Best out-of-box |
| Used by | Original Transformer | GPT-2, BERT | LLaMA, Qwen, Mistral | BLOOM, MPT |
| Content-position coupling | Additive (coupled) | Additive (coupled) | Multiplicative (decoupled) | Score-level only |
The Decision
For rlvr-from-scratch, we implement all three (sinusoidal, learned, RoPE) and use RoPE as the default — matching modern LLM practice.
Implementation
The full tested implementation is at src/rlvr_from_scratch/model/positional.py.
Module Summary
| Component | Type | Parameters | How it works |
|---|---|---|---|
SinusoidalPositionalEncoding | Additive | 0 | Pre-computed sin/cos buffer added to embeddings |
LearnedPositionalEmbedding | Additive | nn.Embedding lookup added to embeddings | |
RotaryPositionalEmbedding | Multiplicative | 0 | Rotation applied to Q, K after head split |
Test Coverage
Correctness:
- Sinusoidal: different positions produce different encodings
- Learned: output shape matches input, gradients flow
- RoPE: relative position property — depends only on
- RoPE: rotation preserves vector norm
Extrapolation:
- Sinusoidal: produces valid (non-NaN) output beyond trained length
- Learned: raises error beyond max_len
- RoPE: produces valid output at arbitrary positions
Integration:
- Each variant integrates with
MultiHeadAttentionwithout shape errors - KV-cache works correctly with RoPE offset parameter
Key Takeaways
The Core Problem
Attention is permutation equivariant. Without position information, “the dog bit the man” and “the man bit the dog” are indistinguishable.
Three Approaches
- Sinusoidal: Fixed multi-frequency signal added to embeddings. Educational, historically important, but superseded.
- Learned: An embedding table. Simple, effective within range, can’t extrapolate.
- RoPE: Rotation applied to Q and K. Relative position emerges from dot-product algebra. The modern standard.
Why RoPE Won
- Decoupled from content — position enters through rotation, not addition
- Relative by construction — , not
- Compatible with KV-cache — rotated keys are cached as-is
- Extrapolation — extends well with NTK-aware frequency scaling
What’s Next
We now have attention (Part 1) and position encoding (Part 2). In Part 3: Building a Transformer, I assemble the full transformer block: multi-head attention + feed-forward network + layer normalization + residual connections. The architecture goes from components to a complete, trainable model.
Further Reading
Original Papers:
- Attention Is All You Need (Vaswani et al., 2017) — sinusoidal encoding
- RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021) — RoPE
- Train Short, Test Long: Attention with Linear Biases (Press et al., 2022) — ALiBi
Extensions:
- YaRN: Efficient Context Window Extension (Peng et al., 2023) — NTK-aware RoPE scaling
- Extending Context Window of Large Language Models via Positional Interpolation (Chen et al., 2023)
Implementation:
- rlvr-from-scratch — sinusoidal, learned, and RoPE from scratch
Cite this reference
Sousa, V. (2026). Positional Encoding: Teaching Transformers to Count. vitorsousa.com (Foundation Reference). https://www.vitorsousa.com/foundations//
@article{sousa2026,
title={Positional Encoding: Teaching Transformers to Count},
author={Sousa, Vitor},
year={2026},
note={Foundation Reference},
url={https://www.vitorsousa.com/foundations//}
} Enjoyed this? Get notified when I publish new references.
Subscribe via RSS
Discussion
Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.