Articles tagged reinforcement-learning

Back to blog
Comparison diagram showing PPO with value network versus GRPO with group-based advantage estimation

GRPO: Eliminating the Value Network

Group Relative Policy Optimization replaces PPO's learned value function with a simple insight: sample multiple outputs and use their relative rewards as advantages. 33% memory savings, simpler implementation, and the algorithm powering DeepSeek-R1.

Series
Policy Optimization for LLMs: From Fundamentals to Production Part 3

~32 min

Read article
Diagram showing PPO four-model architecture for LLM training

PPO for Language Models: The RLHF Workhorse

Deep dive into Proximal Policy Optimization—the algorithm behind most LLM alignment. Understand trust regions, the clipped objective, GAE, and why PPO's four-model architecture creates problems at scale.

Series
Policy Optimization for LLMs: From Fundamentals to Production Part 2

~28 min

Read article