Blog: reinforcement-learning | Vitor Sousa — AI Engineer & Data Scientist <meta name="astro-view-transitions-enabled" content="true"><meta name="astro-view-transitions-fallback" content="animate"> <script> (() => { const storageKey = 'vitor-theme'; const getPreferred = () => { try { const saved = window.localStorage.getItem(storageKey); if (saved === 'light' || saved === 'dark') return saved; } catch (error) { console.warn('Unable to access theme preference storage.', error); } return window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light'; }; /** * @param {'light' | 'dark'} theme */ const applyTheme = (theme) => { const root = document.documentElement; root.dataset.theme = theme; root.style.colorScheme = theme; }; applyTheme(getPreferred()); })(); </script>

Articles tagged reinforcement-learning

Comparison diagram showing PPO with value network versus GRPO with group-based advantage estimation

GRPO: Eliminating the Value Network

Group Relative Policy Optimization replaces PPO's learned value function with a simple insight: sample multiple outputs and use their relative rewards as advantages. 33% memory savings, simpler implementation, and the algorithm powering DeepSeek-R1.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 3

Feb 3, 2026 ~32 min

Read article

Diagram showing PPO four-model architecture for LLM training

PPO for Language Models: The RLHF Workhorse

Deep dive into Proximal Policy Optimization—the algorithm behind most LLM alignment. Understand trust regions, the clipped objective, GAE, and why PPO's four-model architecture creates problems at scale.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 2

Jan 18, 2026 ~28 min

Read article

Diagram showing the reinforcement learning loop applied to language model fine-tuning

Reinforcement Learning Foundations for LLM Alignment

Master the RL fundamentals powering modern LLM training: from MDPs and policy gradients through value functions and actor-critic methods. The mathematical foundations you need before diving into PPO, GRPO, and beyond.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 1

Jan 11, 2026 ~35 min

Read article

Decision tree diagram showing when to use contextual bandits versus alternatives

When to Use Contextual Bandits: The Decision Framework

Stop running month-long A/B tests that leave value on the table. Learn when contextual bandits are the right choice for adaptive, personalized optimization—and when to stick with simpler alternatives.

Series

Adaptive Optimization at Scale: Contextual Bandits from Theory to Production Part 1

Nov 13, 2025 ~20 min

Read article