Back to portfolio
Published on Mar 30, 2026 by Vitor Sousa
Personal project 🔨 In Development — Phase 1/5 GitHub →
Phase 1 — Transformer Internals Phase 1 of 5
TransformerPretrainSFTGRPOGDPO

Why this project

Most people learn alignment from blog posts or paper summaries. That gives you intuition — but not the kind of understanding you get from writing the gradient update yourself, watching a reward signal reshape a model’s outputs, and debugging why your policy just collapsed.

This project is an invitation to follow along. Every phase produces code you can run, foundations articles that explain the theory from first principles, and blog posts that show what actually happened when the code met real data. If you’ve ever wanted to truly understand how RLHF and GRPO work — not just “conceptually” but at the level of tensors and loss curves — this is for you.

”nanoGPT for the alignment era.” Karpathy showed you can pretrain in one file. This shows you can align in one file too.

What this project builds

A complete, from-scratch implementation of the modern LLM alignment pipeline — from raw attention to GRPO reward optimization:

Phase 1 — Transformer Internalscurrent Scaled dot-product attention, multi-head attention, positional encoding (sinusoidal + RoPE), layer normalization, RMSNorm, feed-forward networks, and a full decoder-only transformer. Every component has shape tests, numerical tests, and gradient tests.

Phase 2 — Pretraining

Training loop with AdamW, cosine schedule, gradient clipping, mixed precision. Train a 50M parameter model on TinyStories. Loss curves, gradient norms, and generated text logged at every checkpoint.

Phase 3 — Supervised Fine-Tuning

SFT on GSM8K math reasoning. Instruction formatting, learning rate sweeps, baseline accuracy established for RLVR comparison.

Phase 4 — GRPO Alignment

The flagship phase. Rollout generation, advantage estimation, the GRPO objective derived and implemented, reward functions for math correctness and format compliance. Full ablation study.

Phase 5 — GDPO

Multi-objective extension validated against NVlabs/GDPO. Three-way comparison: GRPO-single vs GRPO-multi vs GDPO. Analysis of reward collapse and Pareto frontiers.

Follow along

Each phase publishes companion content so you can learn alongside the code:

  • Foundations — deep-dive reference articles on the theory behind each component (attention, positional encoding, transformer architecture, training dynamics)
  • Blog posts — empirical write-ups showing what actually happens at each stage: loss curves, failure modes, ablations, and lessons learned
  • Bits — short observations published as the work progresses

Whether you read the code, the articles, or both — the goal is that you walk away understanding alignment at a level most tutorials never reach.

Constraints
  • Single GPU (Apple Silicon or one A100)
  • Small models (10M–50M parameters)
  • One file per concept, readable top-to-bottom
  • Comprehensive tests for every module
  • NOT a library — a reference implementation
Tech stack

Python · PyTorch · uv · ruff · pytest · GSM8K · TinyStories

More Projects

LoRA and DoRA Implementation

Parameter-efficient fine-tuning from first principles — every matrix decomposition derived and implemented in PyTorch without libraries. Validated against Hugging Face PEFT outputs for correctness.

Personal project
Explore full portfolio