Technical Digital Garden
Bits
A collection of atomic notes, code snippets, and technical 'cheats' I’ve gathered over the years. These are unpolished references intended for quick utility rather than narrative reading.
Quick-fire references
Scan the grid, filter by utility tags, and grab the snippet you need without diving into long-form posts.
GELU / SwiGLU from Scratch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLU(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.w_gate = nn.Linear(d_model, d_ff, bias=False)
self.w_up = nn.Linear(d_model, d_ff, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x):
return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))
Why it matters: The gate is a content-dependent multiplier on each hidden unit — the FFN learns which features to let through, not just how to mix them. Trap: vanilla FFN uses d_ff = 4 * d_model (2 matrices); SwiGLU has 3, so to match parameter count use d_ff ≈ (8/3) * d_model — Llama, Qwen and Mistral all do this. Copying 4x into a SwiGLU silently inflates the block by ~50%.
Full implementation in rlvr-from-scratch/model/ffn.py.
Pre-Norm vs Post-Norm: Why It Matters
# Post-norm (Vaswani 2017): norm after the add
def post_norm(x, attn, ffn, ln1, ln2):
x = ln1(x + attn(x))
x = ln2(x + ffn(x))
return x
# Pre-norm (modern default): norm before the sublayer
def pre_norm(x, attn, ffn, ln1, ln2):
x = x + attn(ln1(x))
x = x + ffn(ln2(x))
return x
Why it matters: In pre-norm the residual stream is never normalized, so every block has a clean identity gradient path back to the input — deep stacks (≥12 layers) train without learning-rate warmup and tolerate worse init. Post-norm, by contrast, needs careful warmup schedules to avoid divergence (Xiong et al. 2020, On Layer Normalization in the Transformer Architecture). Every modern LLM is pre-norm for exactly this reason.
RMSNorm vs LayerNorm — When and Why
import torch
def layer_norm(x, gamma, beta, eps=1e-5):
mu = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
return (x - mu) / torch.sqrt(var + eps) * gamma + beta
def rms_norm(x, gamma, eps=1e-5):
rms = x.pow(2).mean(-1, keepdim=True)
return x * torch.rsqrt(rms + eps) * gamma
Why it matters: RMSNorm drops the mean-centering step and the bias, normalizing by the root-mean-square instead of the standard deviation — same gain term, fewer ops, ~7–15% lower latency (Zhang & Sennrich 2019). The empirical finding that justified the simplification: re-centering isn’t actually needed for transformer training to converge, which is why Llama, Qwen, Mistral and Gemma all use RMSNorm by default.
RoPE: Rotary Position Embedding in 15 Lines
import torch
def precompute_freqs_cis(d_k, end, base=10000.0):
freqs = 1.0 / (base ** (torch.arange(0, d_k, 2).float() / d_k))
t = torch.arange(end)
angles = torch.outer(t, freqs) # (T, d_k/2)
return torch.polar(torch.ones_like(angles), angles) # complex
def apply_rope(x, freqs_cis):
# x: (B, H, T, d_k) — RoPE runs per-head, so d_k not d_model
x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
x_rot = x_complex * freqs_cis # broadcasts over B, H
return torch.view_as_real(x_rot).flatten(-2).type_as(x)
Why it matters: RoPE rotates Q and K by position-dependent angles, so the dot product Q_iᵀ K_j ends up depending only on the relative offset i−j — relative position is baked into the attention score itself, no embedding to add to the input. That’s why it extrapolates to longer contexts more gracefully than sinusoidal or learned absolute embeddings, and why every modern LLM (Llama, Qwen, Mistral, Gemma) uses it.
Scaled Dot-Product Attention in 20 Lines
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
# Q, K, V: (B, H, T, d_k)
d_k = Q.size(-1)
scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return weights @ V
Why it matters: Without the √d_k divisor, dot products grow with dimension and push softmax into its saturated tail — gradients vanish and training stalls past d_k ≥ 64. The causal mask zeros out future positions before softmax, so each token’s representation depends only on its past — the property that makes autoregressive generation well-defined.
Pre-Norm vs Post-Norm: Why It Matters
The two placement strategies, gradient flow implications.
Post-Norm (original transformer):
x → Attention → Add(x) → Norm → FFN → Add → Norm
Pre-Norm (modern default):
x → Norm → Attention → Add(x) → Norm → FFN → Add
Pre-Norm enables stable training at depth because the residual path is unobstructed — gradients flow directly through the skip connection without passing through normalization.
Draft — expanded in Building a Transformer.
RoPE: Rotary Position Embedding in 15 Lines
RoPE implementation — rotation matrix applied to Q and K.
import torch
def apply_rope(x, freqs_cis):
"""
x: (batch, heads, seq_len, d_k)
freqs_cis: (seq_len, d_k // 2) complex frequencies
"""
x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
x_rotated = x_complex * freqs_cis
return torch.view_as_real(x_rotated).flatten(-2).type_as(x)
Draft — full derivation coming in the Positional Encoding foundation post.
Scaled Dot-Product Attention in 20 Lines
The complete attention function in 20 lines of PyTorch, annotated.
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, heads, seq_len, d_k)
"""
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5 # (B, H, T, T)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1) # (B, H, T, T)
return weights @ V # (B, H, T, d_k)
Draft — full annotation coming in the Attention from Scratch foundation post.
Bias-Variance Decomposition
- Bias: model too simple to capture f
- Variance: model too sensitive to training sample
- Noise: irreducible — the floor
Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.
Binary Search (Safe Midpoint)
def bsearch(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = lo + (hi - lo) // 2 # avoids overflow
if arr[mid] == target: return mid
if arr[mid] < target: lo = mid + 1
else: hi = mid - 1
return -1
Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.
Causal Mask as an Additive Tensor
def causal_mask(T, device):
mask = torch.zeros(T, T, device=device)
mask = mask.masked_fill(
torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
float("-inf"),
)
return mask # (T, T), broadcasts to (B, H, T, T)
Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.
Chain Rule for Vectors
Shapes: (k × m) · (m × n) = (k × n).
Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.
Why `.contiguous()` After `transpose()`
# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)
# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)
Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.
Eigendecomposition: What It Buys You
For , iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.
Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.
Einops for Tensor Reshaping
from einops import rearrange
# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)
# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")
Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.
He Initialization
# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))
Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.
Jensen's Inequality
For convex f: . For concave f (e.g., log): .
Why it matters: Whole derivation of ELBO / variational inference starts here. Also why and aren’t interchangeable — they differ by the KL to the variational posterior.
KV-Cache in Five Lines
if kv_cache is not None:
K_prev, V_prev = kv_cache
K = torch.cat([K_prev, K], dim=2) # (B, H, T_prev + T_k, d_k)
V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)
Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.
Maximum Likelihood = Minimum Cross-Entropy
The RHS is cross-entropy between the empirical distribution and the model.
Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.
RoPE Applies at `d_k`, Not `d_model`
# WRONG — rotate before head split
x = apply_rope(x, freqs) # (B, T, d_model)
Q, K, V = split_heads(project(x)) # breaks relative-position property
# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x)) # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)
Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.
Seed Everything for Reproducibility
import random, numpy as np, torch
def seed_everything(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.
Sinusoidal vs Learned vs Rotary
| Sinusoidal | Learned | RoPE | |
|---|---|---|---|
| Type | Fixed | Parameter | Fixed |
| Applied at | Input embedding | Input embedding | Q and K inside attention |
| Extrapolates past max_len | Yes | No | Yes |
| Encodes relative position | Weakly | No | Explicitly |
Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.
Softmax + Temperature
- τ → 0: argmax (sharp)
- τ = 1: standard softmax
- τ → ∞: uniform
Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).
Standardize vs Normalize
# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()
# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())
# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))
Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.
Stratified Split
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.
SVD in One Line
- U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
- Columns of V: input directions
- Columns of U: output directions
- σᵢ: stretch factor along each direction
Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.
Two Pointers — The Canonical Pattern
def two_sum_sorted(nums, target):
l, r = 0, len(nums) - 1
while l < r:
s = nums[l] + nums[r]
if s == target: return [l, r]
if s < target: l += 1
else: r -= 1
return []
Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.
uv — Fast Python Package Management
# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync # reproducible install from uv.lock
Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.
Calculating Average Precision (AP) without Sklearn
import numpy as np
def calculate_ap(recalls, precisions):
# Ensure monotonic decreasing precision (11-point or all-point interpolation)
m_rec = np.concatenate(([0.0], recalls, [1.0]))
m_pre = np.concatenate(([0.0], precisions, [0.0]))
for i in range(len(m_pre) - 1, 0, -1):
m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])
# Area under the curve via trapezoidal integration
indices = np.where(m_rec[1:] != m_rec[:-1])[0]
ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
return ap
Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.
Bessel's Correction in Variance Calculation
import numpy as np
data = [10, 12, 23, 23, 16, 23, 21, 16]
# Population Variance (N)
pop_var = np.var(data)
# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)
Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.
Representative Centroid Selection for Long-Context RAG
from sklearn.cluster import KMeans
import numpy as np
def get_representative_embeddings(embeddings, k=5):
# Instead of taking the top-K similar, take the K most diverse centroids
kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
kmeans.fit(embeddings)
# Find the actual vectors closest to these centroids
return kmeans.cluster_centers_
Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.
The Log-Sum-Exp Trick for Softmax
import numpy as np
def log_sum_exp(x):
# Subtracting the max prevents overflow when exponentiating large numbers
c = np.max(x)
return c + np.log(np.sum(np.exp(x - c)))
def stable_softmax(x):
return np.exp(x - log_sum_exp(x))
Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.
Vectorized Covariance Matrix Calculation
import numpy as np
def fast_covariance(X):
# X is an (n_samples, n_features) matrix
n = X.shape[0]
X_centered = X - X.mean(axis=0)
# Using the dot product is significantly faster than np.cov for large matrices
return (X_centered.T @ X_centered) / (n - 1)
Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.
Python @dataclass
The @dataclass decorator auto-generates __init__, __repr__, and __eq__:
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
label: str = "origin"
Useful options:
frozen=True- immutable instancesorder=True- enables comparison operatorsslots=True- use__slots__for memory efficiency
No bits matched your filters. Try a different keyword or category.