Bits | Vitor Sousa — Senior Data Scientist & AI Engineer

GELU / SwiGLU from Scratch

import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_up   = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

Why it matters: The gate is a content-dependent multiplier on each hidden unit — the FFN learns which features to let through, not just how to mix them. Trap: vanilla FFN uses d_ff = 4 * d_model (2 matrices); SwiGLU has 3, so to match parameter count use d_ff ≈ (8/3) * d_model — Llama, Qwen and Mistral all do this. Copying 4x into a SwiGLU silently inflates the block by ~50%.

Full implementation in rlvr-from-scratch/model/ffn.py.

Pre-Norm vs Post-Norm: Why It Matters

# Post-norm (Vaswani 2017): norm after the add
def post_norm(x, attn, ffn, ln1, ln2):
    x = ln1(x + attn(x))
    x = ln2(x + ffn(x))
    return x

# Pre-norm (modern default): norm before the sublayer
def pre_norm(x, attn, ffn, ln1, ln2):
    x = x + attn(ln1(x))
    x = x + ffn(ln2(x))
    return x

Why it matters: In pre-norm the residual stream is never normalized, so every block has a clean identity gradient path back to the input — deep stacks (≥12 layers) train without learning-rate warmup and tolerate worse init. Post-norm, by contrast, needs careful warmup schedules to avoid divergence (Xiong et al. 2020, On Layer Normalization in the Transformer Architecture). Every modern LLM is pre-norm for exactly this reason.

RMSNorm vs LayerNorm — When and Why

import torch

def layer_norm(x, gamma, beta, eps=1e-5):
    mu  = x.mean(-1, keepdim=True)
    var = x.var(-1, keepdim=True, unbiased=False)
    return (x - mu) / torch.sqrt(var + eps) * gamma + beta

def rms_norm(x, gamma, eps=1e-5):
    rms = x.pow(2).mean(-1, keepdim=True)
    return x * torch.rsqrt(rms + eps) * gamma

Why it matters: RMSNorm drops the mean-centering step and the bias, normalizing by the root-mean-square instead of the standard deviation — same gain term, fewer ops, ~7–15% lower latency (Zhang & Sennrich 2019). The empirical finding that justified the simplification: re-centering isn’t actually needed for transformer training to converge, which is why Llama, Qwen, Mistral and Gemma all use RMSNorm by default.

RoPE: Rotary Position Embedding in 15 Lines

import torch

def precompute_freqs_cis(d_k, end, base=10000.0):
    freqs = 1.0 / (base ** (torch.arange(0, d_k, 2).float() / d_k))
    t = torch.arange(end)
    angles = torch.outer(t, freqs)                       # (T, d_k/2)
    return torch.polar(torch.ones_like(angles), angles)  # complex

def apply_rope(x, freqs_cis):
    # x: (B, H, T, d_k) — RoPE runs per-head, so d_k not d_model
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rot = x_complex * freqs_cis                        # broadcasts over B, H
    return torch.view_as_real(x_rot).flatten(-2).type_as(x)

Why it matters: RoPE rotates Q and K by position-dependent angles, so the dot product Q_iᵀ K_j ends up depending only on the relative offset i−j — relative position is baked into the attention score itself, no embedding to add to the input. That’s why it extrapolates to longer contexts more gracefully than sinusoidal or learned absolute embeddings, and why every modern LLM (Llama, Qwen, Mistral, Gemma) uses it.

Scaled Dot-Product Attention in 20 Lines

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: (B, H, T, d_k)
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights @ V

Why it matters: Without the √d_k divisor, dot products grow with dimension and push softmax into its saturated tail — gradients vanish and training stalls past d_k ≥ 64. The causal mask zeros out future positions before softmax, so each token’s representation depends only on its past — the property that makes autoregressive generation well-defined.

Pre-Norm vs Post-Norm: Why It Matters

The two placement strategies, gradient flow implications.

Post-Norm (original transformer):

x → Attention → Add(x) → Norm → FFN → Add → Norm

Pre-Norm (modern default):

x → Norm → Attention → Add(x) → Norm → FFN → Add

Pre-Norm enables stable training at depth because the residual path is unobstructed — gradients flow directly through the skip connection without passing through normalization.

Draft — expanded in Building a Transformer.

RoPE: Rotary Position Embedding in 15 Lines

RoPE implementation — rotation matrix applied to Q and K.

import torch

def apply_rope(x, freqs_cis):
    """
    x: (batch, heads, seq_len, d_k)
    freqs_cis: (seq_len, d_k // 2) complex frequencies
    """
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rotated = x_complex * freqs_cis
    return torch.view_as_real(x_rotated).flatten(-2).type_as(x)

Draft — full derivation coming in the Positional Encoding foundation post.

Scaled Dot-Product Attention in 20 Lines

The complete attention function in 20 lines of PyTorch, annotated.

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, heads, seq_len, d_k)
    """
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5  # (B, H, T, T)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)              # (B, H, T, T)
    return weights @ V                                # (B, H, T, d_k)

Draft — full annotation coming in the Attention from Scratch foundation post.

Bias-Variance Decomposition

$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}$

Bias: model too simple to capture f
Variance: model too sensitive to training sample
Noise: irreducible — the floor

Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.

Binary Search (Safe Midpoint)

def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = lo + (hi - lo) // 2   # avoids overflow
        if arr[mid] == target: return mid
        if arr[mid] < target:  lo = mid + 1
        else:                  hi = mid - 1
    return -1

Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.

Causal Mask as an Additive Tensor

def causal_mask(T, device):
    mask = torch.zeros(T, T, device=device)
    mask = mask.masked_fill(
        torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
        float("-inf"),
    )
    return mask  # (T, T), broadcasts to (B, H, T, T)

Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.

Chain Rule for Vectors

$y = f(g(x)), \quad \frac{\partial y}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}$

Shapes: (k × m) · (m × n) = (k × n).

Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.

Why `.contiguous()` After `transpose()`

# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)

# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)

Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.

Eigendecomposition: What It Buys You

$A = Q \Lambda Q^{-1} \quad\Rightarrow\quad A^t = Q \Lambda^t Q^{-1}$

For $x_{t+1} = Ax_t$ , iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.

Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.

Einops for Tensor Reshaping

from einops import rearrange

# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)

# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")

Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.

He Initialization

# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))

Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.

Jensen's Inequality

For convex f: $\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])$ . For concave f (e.g., log): $\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])$ .

Why it matters: Whole derivation of ELBO / variational inference starts here. Also why $\log \mathbb{E}[\cdot]$ and $\mathbb{E}[\log \cdot]$ aren’t interchangeable — they differ by the KL to the variational posterior.

KV-Cache in Five Lines

if kv_cache is not None:
    K_prev, V_prev = kv_cache
    K = torch.cat([K_prev, K], dim=2)  # (B, H, T_prev + T_k, d_k)
    V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)

Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.

Maximum Likelihood = Minimum Cross-Entropy

$\arg\max_\theta \sum_i \log p_\theta(y_i \mid x_i) \;=\; \arg\min_\theta \;-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_i)$

The RHS is cross-entropy between the empirical distribution and the model.

Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.

RoPE Applies at `d_k`, Not `d_model`

# WRONG — rotate before head split
x = apply_rope(x, freqs)            # (B, T, d_model)
Q, K, V = split_heads(project(x))   # breaks relative-position property

# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x))   # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)

Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.

Seed Everything for Reproducibility

import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.

Sinusoidal vs Learned vs Rotary

	Sinusoidal	Learned	RoPE
Type	Fixed	Parameter	Fixed
Applied at	Input embedding	Input embedding	Q and K inside attention
Extrapolates past max_len	Yes	No	Yes
Encodes relative position	Weakly	No	Explicitly

Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.

Softmax + Temperature

$\text{softmax}(x_i; \tau) = \frac{e^{x_i / \tau}}{\sum_j e^{x_j / \tau}}$

τ → 0: argmax (sharp)
τ = 1: standard softmax
τ → ∞: uniform

Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).

Standardize vs Normalize

# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()

# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())

# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))

Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.

Stratified Split

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.

SVD in One Line

$A = U \Sigma V^T$

U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
Columns of V: input directions
Columns of U: output directions
σᵢ: stretch factor along each direction

Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.

Two Pointers — The Canonical Pattern

def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target: return [l, r]
        if s < target:  l += 1
        else:           r -= 1
    return []

Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.

uv — Fast Python Package Management

# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync  # reproducible install from uv.lock

Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.

Calculating Average Precision (AP) without Sklearn

import numpy as np

def calculate_ap(recalls, precisions):
    # Ensure monotonic decreasing precision (11-point or all-point interpolation)
    m_rec = np.concatenate(([0.0], recalls, [1.0]))
    m_pre = np.concatenate(([0.0], precisions, [0.0]))

    for i in range(len(m_pre) - 1, 0, -1):
        m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])

    # Area under the curve via trapezoidal integration
    indices = np.where(m_rec[1:] != m_rec[:-1])[0]
    ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
    return ap

Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.

Bessel's Correction in Variance Calculation

import numpy as np

data = [10, 12, 23, 23, 16, 23, 21, 16]

# Population Variance (N)
pop_var = np.var(data)

# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)

Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.

Representative Centroid Selection for Long-Context RAG

from sklearn.cluster import KMeans
import numpy as np

def get_representative_embeddings(embeddings, k=5):
    # Instead of taking the top-K similar, take the K most diverse centroids
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
    kmeans.fit(embeddings)
    # Find the actual vectors closest to these centroids
    return kmeans.cluster_centers_

Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.

The Log-Sum-Exp Trick for Softmax

import numpy as np

def log_sum_exp(x):
    # Subtracting the max prevents overflow when exponentiating large numbers
    c = np.max(x)
    return c + np.log(np.sum(np.exp(x - c)))

def stable_softmax(x):
    return np.exp(x - log_sum_exp(x))

Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.

Vectorized Covariance Matrix Calculation

import numpy as np

def fast_covariance(X):
    # X is an (n_samples, n_features) matrix
    n = X.shape[0]
    X_centered = X - X.mean(axis=0)
    # Using the dot product is significantly faster than np.cov for large matrices
    return (X_centered.T @ X_centered) / (n - 1)

Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.

Python @dataclass

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

frozen=True - immutable instances
order=True - enables comparison operators
slots=True - use __slots__ for memory efficiency

Quick-fire references

GELU / SwiGLU from Scratch

Pre-Norm vs Post-Norm: Why It Matters

RMSNorm vs LayerNorm — When and Why

RoPE: Rotary Position Embedding in 15 Lines

Scaled Dot-Product Attention in 20 Lines

Pre-Norm vs Post-Norm: Why It Matters

RoPE: Rotary Position Embedding in 15 Lines

Scaled Dot-Product Attention in 20 Lines

Bias-Variance Decomposition

Binary Search (Safe Midpoint)

Causal Mask as an Additive Tensor

Chain Rule for Vectors

Why `.contiguous()` After `transpose()`

Eigendecomposition: What It Buys You

Einops for Tensor Reshaping

He Initialization

Jensen's Inequality

KV-Cache in Five Lines

Maximum Likelihood = Minimum Cross-Entropy

RoPE Applies at `d_k`, Not `d_model`

Seed Everything for Reproducibility

Sinusoidal vs Learned vs Rotary

Softmax + Temperature

Standardize vs Normalize

Stratified Split

SVD in One Line

Two Pointers — The Canonical Pattern

uv — Fast Python Package Management

Calculating Average Precision (AP) without Sklearn

Bessel's Correction in Variance Calculation

Representative Centroid Selection for Long-Context RAG

The Log-Sum-Exp Trick for Softmax

Vectorized Covariance Matrix Calculation

Python @dataclass