Technical Digital Garden

Bits

A collection of atomic notes, code snippets, and technical 'cheats' I’ve gathered over the years. These are unpolished references intended for quick utility rather than narrative reading.

Quick-fire references

Scan the grid, filter by utility tags, and grab the snippet you need without diving into long-form posts.

Deep Learning

GELU / SwiGLU from Scratch

Deep Learning
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_up   = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

Why it matters: The gate is a content-dependent multiplier on each hidden unit — the FFN learns which features to let through, not just how to mix them. Trap: vanilla FFN uses d_ff = 4 * d_model (2 matrices); SwiGLU has 3, so to match parameter count use d_ff ≈ (8/3) * d_model — Llama, Qwen and Mistral all do this. Copying 4x into a SwiGLU silently inflates the block by ~50%.

Full implementation in rlvr-from-scratch/model/ffn.py.

Deep Learning

Pre-Norm vs Post-Norm: Why It Matters

Deep Learning
# Post-norm (Vaswani 2017): norm after the add
def post_norm(x, attn, ffn, ln1, ln2):
    x = ln1(x + attn(x))
    x = ln2(x + ffn(x))
    return x

# Pre-norm (modern default): norm before the sublayer
def pre_norm(x, attn, ffn, ln1, ln2):
    x = x + attn(ln1(x))
    x = x + ffn(ln2(x))
    return x

Why it matters: In pre-norm the residual stream is never normalized, so every block has a clean identity gradient path back to the input — deep stacks (≥12 layers) train without learning-rate warmup and tolerate worse init. Post-norm, by contrast, needs careful warmup schedules to avoid divergence (Xiong et al. 2020, On Layer Normalization in the Transformer Architecture). Every modern LLM is pre-norm for exactly this reason.

Deep Learning

RMSNorm vs LayerNorm — When and Why

Deep Learning
import torch

def layer_norm(x, gamma, beta, eps=1e-5):
    mu  = x.mean(-1, keepdim=True)
    var = x.var(-1, keepdim=True, unbiased=False)
    return (x - mu) / torch.sqrt(var + eps) * gamma + beta

def rms_norm(x, gamma, eps=1e-5):
    rms = x.pow(2).mean(-1, keepdim=True)
    return x * torch.rsqrt(rms + eps) * gamma

Why it matters: RMSNorm drops the mean-centering step and the bias, normalizing by the root-mean-square instead of the standard deviation — same gain term, fewer ops, ~7–15% lower latency (Zhang & Sennrich 2019). The empirical finding that justified the simplification: re-centering isn’t actually needed for transformer training to converge, which is why Llama, Qwen, Mistral and Gemma all use RMSNorm by default.

Deep Learning

RoPE: Rotary Position Embedding in 15 Lines

Deep Learning
import torch

def precompute_freqs_cis(d_k, end, base=10000.0):
    freqs = 1.0 / (base ** (torch.arange(0, d_k, 2).float() / d_k))
    t = torch.arange(end)
    angles = torch.outer(t, freqs)                       # (T, d_k/2)
    return torch.polar(torch.ones_like(angles), angles)  # complex

def apply_rope(x, freqs_cis):
    # x: (B, H, T, d_k) — RoPE runs per-head, so d_k not d_model
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rot = x_complex * freqs_cis                        # broadcasts over B, H
    return torch.view_as_real(x_rot).flatten(-2).type_as(x)

Why it matters: RoPE rotates Q and K by position-dependent angles, so the dot product Q_iᵀ K_j ends up depending only on the relative offset i−j — relative position is baked into the attention score itself, no embedding to add to the input. That’s why it extrapolates to longer contexts more gracefully than sinusoidal or learned absolute embeddings, and why every modern LLM (Llama, Qwen, Mistral, Gemma) uses it.

Deep Learning

Scaled Dot-Product Attention in 20 Lines

Deep Learning
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: (B, H, T, d_k)
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights @ V

Why it matters: Without the √d_k divisor, dot products grow with dimension and push softmax into its saturated tail — gradients vanish and training stalls past d_k ≥ 64. The causal mask zeros out future positions before softmax, so each token’s representation depends only on its past — the property that makes autoregressive generation well-defined.

Deep Learning

Pre-Norm vs Post-Norm: Why It Matters

Deep Learning

The two placement strategies, gradient flow implications.

Post-Norm (original transformer):

x → Attention → Add(x) → Norm → FFN → Add → Norm

Pre-Norm (modern default):

x → Norm → Attention → Add(x) → Norm → FFN → Add

Pre-Norm enables stable training at depth because the residual path is unobstructed — gradients flow directly through the skip connection without passing through normalization.

Draft — expanded in Building a Transformer.

Deep Learning

RoPE: Rotary Position Embedding in 15 Lines

Deep Learning

RoPE implementation — rotation matrix applied to Q and K.

import torch

def apply_rope(x, freqs_cis):
    """
    x: (batch, heads, seq_len, d_k)
    freqs_cis: (seq_len, d_k // 2) complex frequencies
    """
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rotated = x_complex * freqs_cis
    return torch.view_as_real(x_rotated).flatten(-2).type_as(x)

Draft — full derivation coming in the Positional Encoding foundation post.

Deep Learning

Scaled Dot-Product Attention in 20 Lines

Deep Learning

The complete attention function in 20 lines of PyTorch, annotated.

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, heads, seq_len, d_k)
    """
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5  # (B, H, T, T)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)              # (B, H, T, T)
    return weights @ V                                # (B, H, T, d_k)

Draft — full annotation coming in the Attention from Scratch foundation post.

ML Theory

Bias-Variance Decomposition

ML Theory

E[(yf^(x))2]=(E[f^(x)]f(x))2bias2+Var(f^(x))variance+σ2noise\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}

  • Bias: model too simple to capture f
  • Variance: model too sensitive to training sample
  • Noise: irreducible — the floor

Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.

Algorithms

Binary Search (Safe Midpoint)

Algorithms
def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = lo + (hi - lo) // 2   # avoids overflow
        if arr[mid] == target: return mid
        if arr[mid] < target:  lo = mid + 1
        else:                  hi = mid - 1
    return -1

Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.

Deep Learning

Causal Mask as an Additive Tensor

Deep Learning
def causal_mask(T, device):
    mask = torch.zeros(T, T, device=device)
    mask = mask.masked_fill(
        torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
        float("-inf"),
    )
    return mask  # (T, T), broadcasts to (B, H, T, T)

Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.

Math

Chain Rule for Vectors

Math

y=f(g(x)),yx=fggxy = f(g(x)), \quad \frac{\partial y}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

Shapes: (k × m) · (m × n) = (k × n).

Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.

Deep Learning

Why `.contiguous()` After `transpose()`

Deep Learning
# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)

# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)

Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.

Math

Eigendecomposition: What It Buys You

Math

A=QΛQ1At=QΛtQ1A = Q \Lambda Q^{-1} \quad\Rightarrow\quad A^t = Q \Lambda^t Q^{-1}

For xt+1=Axtx_{t+1} = Ax_t, iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.

Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.

Code

Einops for Tensor Reshaping

Code
from einops import rearrange

# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)

# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")

Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.

Deep Learning

He Initialization

Deep Learning
# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))

Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.

ML Theory

Jensen's Inequality

ML Theory

For convex f: E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]). For concave f (e.g., log): E[f(X)]f(E[X])\mathbb{E}[f(X)] \leq f(\mathbb{E}[X]).

Why it matters: Whole derivation of ELBO / variational inference starts here. Also why logE[]\log \mathbb{E}[\cdot] and E[log]\mathbb{E}[\log \cdot] aren’t interchangeable — they differ by the KL to the variational posterior.

Deep Learning

KV-Cache in Five Lines

Deep Learning
if kv_cache is not None:
    K_prev, V_prev = kv_cache
    K = torch.cat([K_prev, K], dim=2)  # (B, H, T_prev + T_k, d_k)
    V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)

Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.

ML Theory

Maximum Likelihood = Minimum Cross-Entropy

ML Theory

argmaxθilogpθ(yixi)  =  argminθ  1Nilogpθ(yixi)\arg\max_\theta \sum_i \log p_\theta(y_i \mid x_i) \;=\; \arg\min_\theta \;-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_i)

The RHS is cross-entropy between the empirical distribution and the model.

Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.

Deep Learning

RoPE Applies at `d_k`, Not `d_model`

Deep Learning
# WRONG — rotate before head split
x = apply_rope(x, freqs)            # (B, T, d_model)
Q, K, V = split_heads(project(x))   # breaks relative-position property

# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x))   # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)

Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.

Tools

Seed Everything for Reproducibility

Tools
import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.

Deep Learning

Sinusoidal vs Learned vs Rotary

Deep Learning
SinusoidalLearnedRoPE
TypeFixedParameterFixed
Applied atInput embeddingInput embeddingQ and K inside attention
Extrapolates past max_lenYesNoYes
Encodes relative positionWeaklyNoExplicitly

Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.

Math

Softmax + Temperature

Math

softmax(xi;τ)=exi/τjexj/τ\text{softmax}(x_i; \tau) = \frac{e^{x_i / \tau}}{\sum_j e^{x_j / \tau}}

  • τ → 0: argmax (sharp)
  • τ = 1: standard softmax
  • τ → ∞: uniform

Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).

Data

Standardize vs Normalize

Data
# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()

# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())

# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))

Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.

Data

Stratified Split

Data
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.

Math

SVD in One Line

Math

A=UΣVTA = U \Sigma V^T

  • U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
  • Columns of V: input directions
  • Columns of U: output directions
  • σᵢ: stretch factor along each direction

Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.

Algorithms

Two Pointers — The Canonical Pattern

Algorithms
def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target: return [l, r]
        if s < target:  l += 1
        else:           r -= 1
    return []

Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.

Tools

uv — Fast Python Package Management

Tools
# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync  # reproducible install from uv.lock

Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.

Algorithms

Calculating Average Precision (AP) without Sklearn

Algorithms
import numpy as np

def calculate_ap(recalls, precisions):
    # Ensure monotonic decreasing precision (11-point or all-point interpolation)
    m_rec = np.concatenate(([0.0], recalls, [1.0]))
    m_pre = np.concatenate(([0.0], precisions, [0.0]))

    for i in range(len(m_pre) - 1, 0, -1):
        m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])

    # Area under the curve via trapezoidal integration
    indices = np.where(m_rec[1:] != m_rec[:-1])[0]
    ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
    return ap

Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.

Data

Bessel's Correction in Variance Calculation

Data
import numpy as np

data = [10, 12, 23, 23, 16, 23, 21, 16]

# Population Variance (N)
pop_var = np.var(data)

# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)

Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.

Algorithms

Representative Centroid Selection for Long-Context RAG

Algorithms
from sklearn.cluster import KMeans
import numpy as np

def get_representative_embeddings(embeddings, k=5):
    # Instead of taking the top-K similar, take the K most diverse centroids
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
    kmeans.fit(embeddings)
    # Find the actual vectors closest to these centroids
    return kmeans.cluster_centers_

Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.

Deep Learning

The Log-Sum-Exp Trick for Softmax

Deep Learning
import numpy as np

def log_sum_exp(x):
    # Subtracting the max prevents overflow when exponentiating large numbers
    c = np.max(x)
    return c + np.log(np.sum(np.exp(x - c)))

def stable_softmax(x):
    return np.exp(x - log_sum_exp(x))

Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.

Code

Vectorized Covariance Matrix Calculation

Code
import numpy as np

def fast_covariance(X):
    # X is an (n_samples, n_features) matrix
    n = X.shape[0]
    X_centered = X - X.mean(axis=0)
    # Using the dot product is significantly faster than np.cov for large matrices
    return (X_centered.T @ X_centered) / (n - 1)

Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.

Code

Python @dataclass

Code

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

  • frozen=True - immutable instances
  • order=True - enables comparison operators
  • slots=True - use __slots__ for memory efficiency