RLVR from Scratch: Full LLM Alignment Pipeline

Published on Mar 30, 2026 by Vitor Sousa

Personal project 🔨 In Development — Phase 1/5 GitHub →

Phase 1 — Transformer Internals Phase 1 of 5

TransformerPretrainSFTGRPOGDPO

Why this project

Most people learn alignment from blog posts or paper summaries. That gives you intuition — but not the kind of understanding you get from writing the gradient update yourself, watching a reward signal reshape a model’s outputs, and debugging why your policy just collapsed.

This project is an invitation to follow along. Every phase produces code you can run, foundations articles that explain the theory from first principles, and blog posts that show what actually happened when the code met real data. If you’ve ever wanted to truly understand how RLHF and GRPO work — not just “conceptually” but at the level of tensors and loss curves — this is for you.

”nanoGPT for the alignment era.” Karpathy showed you can pretrain in one file. This shows you can align in one file too.

What this project builds

A complete, from-scratch implementation of the modern LLM alignment pipeline — from raw attention to GRPO reward optimization:

Phase 1 — Transformer Internals ← current Scaled dot-product attention, multi-head attention, positional encoding (sinusoidal + RoPE), layer normalization, RMSNorm, feed-forward networks, and a full decoder-only transformer. Every component has shape tests, numerical tests, and gradient tests.

Phase 2 — Pretraining

Training loop with AdamW, cosine schedule, gradient clipping, mixed precision. Train a 50M parameter model on TinyStories. Loss curves, gradient norms, and generated text logged at every checkpoint.

Phase 3 — Supervised Fine-Tuning

SFT on GSM8K math reasoning. Instruction formatting, learning rate sweeps, baseline accuracy established for RLVR comparison.

Phase 4 — GRPO Alignment

The flagship phase. Rollout generation, advantage estimation, the GRPO objective derived and implemented, reward functions for math correctness and format compliance. Full ablation study.

Phase 5 — GDPO

Multi-objective extension validated against NVlabs/GDPO. Three-way comparison: GRPO-single vs GRPO-multi vs GDPO. Analysis of reward collapse and Pareto frontiers.

Follow along

Each phase publishes companion content so you can learn alongside the code:

Foundations — deep-dive reference articles on the theory behind each component (attention, positional encoding, transformer architecture, training dynamics)
Blog posts — empirical write-ups showing what actually happens at each stage: loss curves, failure modes, ablations, and lessons learned
Bits — short observations published as the work progresses

Whether you read the code, the articles, or both — the goal is that you walk away understanding alignment at a level most tutorials never reach.

Constraints

Single GPU (Apple Silicon or one A100)
Small models (10M–50M parameters)
One file per concept, readable top-to-bottom
Comprehensive tests for every module
NOT a library — a reference implementation

Tech stack

Python · PyTorch · uv · ruff · pytest · GSM8K · TinyStories

Why this project

What this project builds

Follow along

More Projects

Tailor: Size Recommendations at Farfetch Scale

RAG System with LlamaIndex, Elasticsearch & Llama3

LoRA and DoRA Implementation

Large Language Models with MLX

Why this project

What this project builds

Follow along

Foundations

Articles

More Projects