Article Series

Deep-dive series that build understanding across multiple articles, from foundations to production.

Policy Optimization for LLMs: From Fundamentals to Production

From PPO fundamentals to GRPO and GDPO — the complete policy optimization series for aligning language models with reinforcement learning.

4 parts · ~120 min total reading time

A 5-part journey from decision frameworks and regret theory through algorithm implementations to production deployment of contextual bandit systems.

5 parts · ~113 min total reading time