Last updated: Apr 2, 2026 ~30 min intermediate

Prerequisites: Attention Is All You Need to Implement

Positional Encoding: Teaching Transformers to Count

Part 2 of 4: Making Order Matter

TL;DR: Attention is permutation equivariant — shuffle the tokens and you get the same output (shuffled). Without position information, “the cat sat on the mat” and “mat the on sat cat the” are identical to the model. This article derives three approaches to fixing this: sinusoidal encoding (fixed, infinite extrapolation in theory), learned embeddings (flexible, bounded), and Rotary Position Embeddings (RoPE — the modern winner, encoding relative position directly into Q and K via rotation matrices). Full tested implementation at rlvr-from-scratch.

Prerequisites: Part 1: Attention Is All You Need to Implement covers scaled dot-product attention and multi-head attention.

The Position Problem

In Part 1 we built attention from scratch. It works — but it has a fundamental gap.

Consider two inputs:

Input A: ["The", "cat", "sat", "on", "the", "mat"]
Input B: ["mat", "the", "on", "sat", "cat", "The"]

Feed both through our multi-head attention module. The attention weights will differ (different tokens in different positions means different Q, K, V vectors). But here’s the problem: if you permute both the input and the output in the same way, you get the same result. Attention treats its input as a set, not a sequence.

Formally, for any permutation $\pi$ :

$\text{Attention}(\pi(X)) = \pi(\text{Attention}(X))$

This is called permutation equivariance. It means attention has no concept of “first”, “second”, “last”. Token 0 and token 99 are processed identically — there’s nothing in the computation that distinguishes position.

For language, this is catastrophic. “The dog bit the man” and “The man bit the dog” have the same tokens. Without position, attention can’t tell them apart.

We need to inject position information. The question is how.

Sinusoidal Positional Encoding

The Original Approach

Vaswani et al. (2017) proposed adding a fixed signal to each token’s embedding. The signal varies with position and dimension, using sine and cosine functions at different frequencies:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)$

$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)$

where $pos$ is the position in the sequence and $i$ is the dimension index.

Why sin and cos? Why 10000? Why this specific formula?

The Frequency Intuition

Think of each dimension pair $(2i, 2i+1)$ as a clock running at a different speed.

Dimensions 0, 1: a fast clock — cycles every few positions
Dimensions $d/2$ , $d/2+1$ : a slow clock — cycles over thousands of positions

The wavelength for dimension $2i$ is:

$\lambda_i = 2\pi \cdot 10000^{2i/d_\text{model}}$

For $d_\text{model} = 512$ :

Dimension pair	Wavelength	What it captures
0, 1	$2\pi \approx 6.3$	Very local position (every ~6 tokens)
128, 129	$\approx 630$	Paragraph-level position
254, 255	$\approx 62{,}800$	Document-level position

The model gets a multi-scale position signal. Low dimensions distinguish nearby tokens. High dimensions distinguish distant tokens.

Why Sin/Cos Pairs?

The key property: for any fixed offset $k$ , $PE_{pos+k}$ can be written as a linear transformation of $PE_{pos}$ .

$\begin{bmatrix} \sin(\omega_i (pos + k)) \\ \cos(\omega_i (pos + k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i pos) \\ \cos(\omega_i pos) \end{bmatrix}$

where $\omega_i = 1 / 10000^{2i/d_\text{model}}$ .

This is a rotation matrix. Moving from position $pos$ to position $pos + k$ is a rotation by angle $\omega_i k$ — the same rotation regardless of $pos$ . This means the model can learn to attend to relative positions: “the token 3 positions back” is always the same transformation.

The Implementation

import torch
import math

class SinusoidalPositionalEncoding(torch.nn.Module):
    """
    Fixed sinusoidal positional encoding from "Attention Is All You Need".
    
    No learnable parameters — the encoding is deterministic.
    
    Args:
        d_model: Model dimension (must be even).
        max_len: Maximum sequence length to precompute.
    """
    
    def __init__(self, d_model: int, max_len: int = 8192):
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even for sin/cos pairs"
        
        # =========================================
        # Precompute encoding matrix: (max_len, d_model)
        # =========================================
        pe = torch.zeros(max_len, d_model)
        
        position = torch.arange(0, max_len).unsqueeze(1).float()  # (max_len, 1)
        
        # Frequency for each dimension pair
        # div_term = 1 / 10000^(2i / d_model) = exp(-2i * log(10000) / d_model)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )  # (d_model / 2,)
        
        pe[:, 0::2] = torch.sin(position * div_term)  # even dimensions
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dimensions
        
        # Register as buffer (not a parameter, but saved with model)
        # Shape: (1, max_len, d_model) for broadcasting over batch
        self.register_buffer("pe", pe.unsqueeze(0))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, T, d_model)
        Returns:
            x + positional encoding: (B, T, d_model)
        """
        # =========================================
        # Add position signal to input embeddings
        # =========================================
        # self.pe: (1, max_len, d_model) -> slice to (1, T, d_model)
        return x + self.pe[:, :x.size(1), :]

Key implementation detail: We compute the division term in log-space (exp(-2i * log(10000) / d_model)) instead of directly computing 10000^(2i/d_model). This avoids numerical overflow for large dimension indices.

Strengths and Limitations

Strengths:

No parameters — nothing to learn, nothing to overfit
Extrapolation — in theory, can handle any sequence length (the frequencies are defined for all positions)
Relative position encoded — the linear transformation property

Limitations:

Extrapolation doesn’t actually work well — while the math supports it, models trained on length $T$ degrade at length $> T$ in practice
Fixed — can’t adapt to the task
Additive injection — position and content share the same representation space, which can create interference

Key Insight: Sinusoidal encoding turns position into a multi-frequency signal. Each dimension pair is a clock at a different speed. The sin/cos pairing enables relative position through rotation — a property that RoPE later exploits much more directly.

Learned Positional Embeddings

The Simplest Approach

Why compute a fixed formula when you can just learn the position representations?

$PE_{pos} = E_\text{pos}[pos] \quad \text{where } E_\text{pos} \in \mathbb{R}^{T_\text{max} \times d_\text{model}}$

An embedding table. Position 0 gets one learned vector, position 1 gets another, and so on. This is what GPT-2 and BERT use.

class LearnedPositionalEmbedding(torch.nn.Module):
    """
    Learned positional embedding — a lookup table.
    
    Args:
        max_len: Maximum sequence length.
        d_model: Model dimension.
    """
    
    def __init__(self, max_len: int, d_model: int):
        super().__init__()
        self.embedding = torch.nn.Embedding(max_len, d_model)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, T, d_model)
        Returns:
            x + position embeddings: (B, T, d_model)
        """
        T = x.size(1)
        # =========================================
        # Create position indices and look up embeddings
        # =========================================
        positions = torch.arange(T, device=x.device)  # (T,)
        pos_emb = self.embedding(positions)            # (T, d_model)
        return x + pos_emb                             # broadcast over batch

That’s it. An nn.Embedding layer. $T_\text{max} \times d_\text{model}$ learnable parameters.

When Learned Beats Sinusoidal

In practice, learned embeddings often perform slightly better than sinusoidal for fixed-length tasks — the model can learn position representations optimized for the actual data distribution.

But there’s a hard ceiling: the model has no embedding for positions beyond $T_\text{max}$ . If you train with $T_\text{max} = 512$ and try to process a 1,024-token sequence, you crash. You’d need to either truncate or retrain.

Comparison

Property	Sinusoidal	Learned
Parameters	0	$T_\text{max} \times d_\text{model}$
Extrapolation	Theoretically yes, practically weak	No — hard crash beyond $T_\text{max}$
Adaptability	None	Task-specific
Relative position	Encoded via rotation property	Not explicitly encoded
Used by	Original Transformer	GPT-2, BERT

Both approaches share a fundamental limitation: they add position to the token embedding, mixing content and position in the same vector. What if we could encode position without this interference?

Key Insight: Learned embeddings trade generality for expressiveness. They work well within their trained range but cannot extrapolate. For modern LLMs that need long-context generalization, this is a dealbreaker.

Rotary Position Embeddings (RoPE)

The Modern Winner

RoPE (Su et al., 2021) is used by LLaMA, Qwen, Mistral, and most modern open-weight LLMs. Instead of adding position to the embedding, RoPE rotates the query and key vectors by a position-dependent angle. The attention score between two tokens then naturally depends on their relative position.

This is the key shift: position goes into Q and K, not into the embedding itself.

The Core Idea

For a 2D vector $[q_0, q_1]$ at position $m$ , RoPE applies a rotation by angle $m\theta$ :

$\text{RoPE}(q, m) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{bmatrix} \begin{bmatrix} q_0 \\ q_1 \end{bmatrix}$

When we compute the dot product of a rotated query at position $m$ with a rotated key at position $n$ :

$\text{RoPE}(q, m)^T \text{RoPE}(k, n) = q^T R(m)^T R(n) k = q^T R(n - m) k$

The rotation matrices compose: $R(m)^T R(n) = R(n - m)$ . The dot product depends only on the relative position $n - m$ , not on the absolute positions.

This is why RoPE works: relative position emerges naturally from the algebra of rotations.

Extending to Higher Dimensions

For $d_\text{model} > 2$ , RoPE applies independent rotations to consecutive pairs of dimensions, each at a different frequency:

$\theta_i = 10000^{-2i/d_\text{model}} \quad \text{for } i = 0, 1, \ldots, d/2 - 1$

The full rotation for position $m$ :

$R(m) = \begin{bmatrix} R_{\theta_0}(m) & & \\ & R_{\theta_1}(m) & \\ & & \ddots \\ & & & R_{\theta_{d/2-1}}(m) \end{bmatrix}$

where each $R_{\theta_i}(m)$ is a 2×2 rotation matrix with angle $m\theta_i$ .

Notice the frequency formula — it’s the same as sinusoidal encoding. RoPE inherits the multi-scale property: low-frequency pairs capture long-range position, high-frequency pairs capture local position.

Efficient Implementation

We don’t actually construct the rotation matrix. Instead, we use the identity:

$\begin{bmatrix} q_0 \cos(m\theta) - q_1 \sin(m\theta) \\ q_0 \sin(m\theta) + q_1 \cos(m\theta) \end{bmatrix}$

This is just element-wise multiply and a swap — no matrix construction needed.

class RotaryPositionalEmbedding(torch.nn.Module):
    """
    Rotary Position Embedding (RoPE).
    
    Applied to Q and K tensors, not to the input embedding.
    Position information enters through rotation, encoding
    relative position in the attention score.
    
    Args:
        d_model: Model dimension (must be even).
        max_len: Maximum sequence length.
        base: Base for frequency computation (default 10000).
    """
    
    def __init__(self, d_model: int, max_len: int = 8192, base: float = 10000.0):
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even for RoPE pairs"
        
        # =========================================
        # Precompute frequencies: θ_i = base^(-2i/d)
        # =========================================
        inv_freq = 1.0 / (base ** (torch.arange(0, d_model, 2).float() / d_model))
        self.register_buffer("inv_freq", inv_freq)  # (d_model / 2,)
        
        # Precompute cos/sin for all positions
        self._build_cache(max_len)
    
    def _build_cache(self, max_len: int):
        """Precompute cos and sin values for positions 0..max_len-1."""
        positions = torch.arange(max_len).float()       # (max_len,)
        # Outer product: (max_len,) x (d/2,) -> (max_len, d/2)
        freqs = torch.outer(positions, self.inv_freq)
        # Duplicate for pairs: (max_len, d)
        freqs = torch.cat([freqs, freqs], dim=-1)
        
        self.register_buffer("cos_cached", freqs.cos().unsqueeze(0).unsqueeze(0))
        self.register_buffer("sin_cached", freqs.sin().unsqueeze(0).unsqueeze(0))
        # Shape: (1, 1, max_len, d_model) — broadcastable over B and H
    
    def forward(
        self, q: torch.Tensor, k: torch.Tensor, offset: int = 0
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Apply RoPE to query and key tensors.
        
        Args:
            q: (B, H, T, d_k)
            k: (B, H, T, d_k)
            offset: Position offset for KV-cache (default 0)
            
        Returns:
            q_rotated: (B, H, T, d_k)
            k_rotated: (B, H, T, d_k)
        """
        T = q.size(2)
        
        # =========================================
        # Slice precomputed cos/sin for current positions
        # =========================================
        cos = self.cos_cached[:, :, offset:offset + T, :]  # (1, 1, T, d_k)
        sin = self.sin_cached[:, :, offset:offset + T, :]  # (1, 1, T, d_k)
        
        # =========================================
        # Apply rotation to Q and K
        # =========================================
        q_rotated = (q * cos) + (self._rotate_half(q) * sin)
        k_rotated = (k * cos) + (self._rotate_half(k) * sin)
        
        return q_rotated, k_rotated
    
    @staticmethod
    def _rotate_half(x: torch.Tensor) -> torch.Tensor:
        """
        Rearrange pairs: [x0, x1, x2, x3, ...] -> [-x_{d/2}, ..., x0, x1, ...]
        
        This implements the "swap and negate" part of the rotation.
        """
        d_half = x.shape[-1] // 2
        x1 = x[..., :d_half]
        x2 = x[..., d_half:]
        return torch.cat([-x2, x1], dim=-1)

How RoPE Integrates with Multi-Head Attention

RoPE modifies MultiHeadAttention.forward() — after splitting heads and before computing attention:

# Inside MultiHeadAttention.forward():
Q = self._split_heads(self.W_Q(query))   # (B, H, T_q, d_k)
K = self._split_heads(self.W_K(key))     # (B, H, T_k, d_k)
V = self._split_heads(self.W_V(value))   # (B, H, T_k, d_k)

# =========================================
# Apply RoPE to Q and K (not V!)
# =========================================
Q, K = self.rope(Q, K, offset=cache_len)

# Then proceed with attention as before
attn_output, weights = scaled_dot_product_attention(Q, K, V, mask)

Note: RoPE is applied to Q and K only — not to V. Position should affect which tokens attend to each other (the attention weights), but not what information they provide (the values).

RoPE and KV-Cache

RoPE integrates naturally with KV-cache. The offset parameter tells RoPE which absolute position the current tokens start at. During incremental decoding:

First pass (full sequence): offset=0, rotates all positions
Step $t$ (single token): offset=t, rotates by the correct position for the new token
Cached K values are already rotated — no re-rotation needed

This is another advantage over additive position encoding, where you’d need to carefully track which positions have already been encoded.

Key Insight: RoPE encodes relative position as a mathematical property of dot products — not as an additive signal that competes with content. The attention score $q_m^T k_n$ naturally depends on $m - n$ through the rotation algebra. This is why it generalizes better than additive approaches.

ALiBi: Attention with Linear Biases

An Even Simpler Alternative

ALiBi (Press et al., 2022) takes a radically different approach: don’t encode position in the embeddings at all. Instead, add a linear bias directly to the attention scores based on distance.

$\text{scores}_{ij} = q_i^T k_j - m \cdot |i - j|$

where $m$ is a head-specific slope. Closer tokens get a smaller penalty; distant tokens get a larger one.

Each attention head gets a different slope, geometrically spaced:

$m_h = \frac{1}{2^{8h/H}} \quad \text{for head } h = 1, \ldots, H$

With 8 heads: $m \in \{1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256\}$ .

Implementation

class ALiBi(torch.nn.Module):
    """
    Attention with Linear Biases.
    
    No position encoding in embeddings — position enters
    as a bias on attention scores.
    
    Args:
        n_heads: Number of attention heads.
        max_len: Maximum sequence length.
    """
    
    def __init__(self, n_heads: int, max_len: int = 8192):
        super().__init__()
        
        # =========================================
        # Compute head-specific slopes
        # =========================================
        slopes = torch.tensor([
            1.0 / (2 ** (8 * h / n_heads))
            for h in range(1, n_heads + 1)
        ])  # (H,)
        
        # =========================================
        # Precompute distance matrix
        # =========================================
        positions = torch.arange(max_len)
        # |i - j| for all position pairs
        distance = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs().float()
        # (max_len, max_len)
        
        # Bias: (H, max_len, max_len) — negative, penalizes distance
        bias = -slopes.view(-1, 1, 1) * distance.unsqueeze(0)
        
        self.register_buffer("bias", bias.unsqueeze(0))  # (1, H, max_len, max_len)
    
    def forward(self, T: int) -> torch.Tensor:
        """
        Returns ALiBi bias for sequence length T.
        
        Add this to attention scores before softmax.
        
        Returns:
            bias: (1, H, T, T)
        """
        return self.bias[:, :, :T, :T]

Usage is simple — add the bias to scores in the attention function:

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores + alibi.forward(T)  # add position bias
scores = scores + causal_mask        # add causal mask
weights = torch.softmax(scores, dim=-1)

Why ALiBi Works

1. Recency bias is a strong prior. In language, nearby tokens are usually more relevant. ALiBi encodes this directly.

2. Different heads, different horizons. Steep slopes ( $m = 1/2$ ) create heads that focus on very local context. Gentle slopes ( $m = 1/256$ ) create heads that can attend far back.

3. Extrapolation. Because the bias is a simple linear function of distance, it extends naturally to unseen lengths. This is ALiBi’s strongest selling point — it extrapolates better than sinusoidal or learned embeddings.

Limitations

No relative position in content — position only affects attention weights, not representations
Assumes recency — tasks where distant tokens are equally relevant (e.g., retrieval) may suffer
Not used in most modern LLMs — RoPE has largely won for open-weight models

Key Insight: ALiBi is the minimalist approach — position as a bias on attention scores, nothing more. It extrapolates well but sacrifices the richness of position information in representations. Think of it as a strong baseline that RoPE improves upon.

Comparison: Which to Use When

Property	Sinusoidal	Learned	RoPE	ALiBi
Parameters	0	$T \times d$	0	0
Where applied	Add to embeddings	Add to embeddings	Rotate Q, K	Bias on scores
Relative position	Via rotation property	Not explicit	Direct (dot product)	Direct (distance)
Extrapolation	Weak in practice	None	Good (with NTK-aware scaling)	Best out-of-box
Used by	Original Transformer	GPT-2, BERT	LLaMA, Qwen, Mistral	BLOOM, MPT
Content-position coupling	Additive (coupled)	Additive (coupled)	Multiplicative (decoupled)	Score-level only

The Decision

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TD Start{"Building a transformer?"} Start -->|"Modern LLM (decoder-only)"| RoPE["Use RoPE ━━━━━━━━ Industry standard Relative position Good extrapolation"] Start -->|"Short fixed-length (BERT-style)"| Learned["Learned Embeddings ━━━━━━━━ Simple, effective within trained range"] Start -->|"Need length extrapolation"| Q2{"Rich position info needed?"} Q2 -->|"Yes"| RoPE2["RoPE + NTK scaling ━━━━━━━━ Best of both worlds"] Q2 -->|"No, just recency"| ALiBi2["ALiBi ━━━━━━━━ Simplest extrapolation"] Start -->|"Educational / from scratch"| Sinusoidal["Sinusoidal ━━━━━━━━ Understand the math then move to RoPE"] style Start fill:#334155,stroke:#64748b,color:#e5e7eb style RoPE fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style RoPE2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Learned fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style ALiBi2 fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style Sinusoidal fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style Q2 fill:#334155,stroke:#64748b,color:#e5e7eb

For rlvr-from-scratch, we implement all three (sinusoidal, learned, RoPE) and use RoPE as the default — matching modern LLM practice.

Implementation

The full tested implementation is at src/rlvr_from_scratch/model/positional.py.

Module Summary

Component	Type	Parameters	How it works
`SinusoidalPositionalEncoding`	Additive	0	Pre-computed sin/cos buffer added to embeddings
`LearnedPositionalEmbedding`	Additive	$T \times d$	`nn.Embedding` lookup added to embeddings
`RotaryPositionalEmbedding`	Multiplicative	0	Rotation applied to Q, K after head split

Test Coverage

Correctness:

Sinusoidal: different positions produce different encodings
Learned: output shape matches input, gradients flow
RoPE: relative position property — $q_m^T k_n$ depends only on $m - n$
RoPE: rotation preserves vector norm

Extrapolation:

Sinusoidal: produces valid (non-NaN) output beyond trained length
Learned: raises error beyond max_len
RoPE: produces valid output at arbitrary positions

Integration:

Each variant integrates with MultiHeadAttention without shape errors
KV-cache works correctly with RoPE offset parameter

Key Takeaways

The Core Problem

Attention is permutation equivariant. Without position information, “the dog bit the man” and “the man bit the dog” are indistinguishable.

Three Approaches

Sinusoidal: Fixed multi-frequency signal added to embeddings. Educational, historically important, but superseded.
Learned: An embedding table. Simple, effective within range, can’t extrapolate.
RoPE: Rotation applied to Q and K. Relative position emerges from dot-product algebra. The modern standard.

Why RoPE Won

Decoupled from content — position enters through rotation, not addition
Relative by construction — $q_m^T k_n = f(m - n)$ , not $f(m, n)$
Compatible with KV-cache — rotated keys are cached as-is
Extrapolation — extends well with NTK-aware frequency scaling

What’s Next

We now have attention (Part 1) and position encoding (Part 2). In Part 3: Building a Transformer, I assemble the full transformer block: multi-head attention + feed-forward network + layer normalization + residual connections. The architecture goes from components to a complete, trainable model.

Positional Encoding: Teaching Transformers to Count

The Position Problem

Sinusoidal Positional Encoding

The Original Approach

The Frequency Intuition

Why Sin/Cos Pairs?

The Implementation

Strengths and Limitations

Learned Positional Embeddings

The Simplest Approach

When Learned Beats Sinusoidal

Comparison

Rotary Position Embeddings (RoPE)

The Modern Winner

The Core Idea

Extending to Higher Dimensions

Efficient Implementation

How RoPE Integrates with Multi-Head Attention

RoPE and KV-Cache

ALiBi: Attention with Linear Biases

An Even Simpler Alternative

Implementation

Why ALiBi Works

Limitations

Comparison: Which to Use When

The Decision

Implementation

Module Summary

Test Coverage

Key Takeaways

The Core Problem

Three Approaches

Why RoPE Won

What’s Next

Further Reading

Cite this reference

Discussion