Feb 3, 2026 ~32 min By Vitor Sousa

GRPO: Eliminating the Value Network

Part 3 of 4: The Memory-Efficient Alternative

TL;DR: GRPO (Group Relative Policy Optimization) replaces PPO’s learned value function with a simple insight: sample $G$ outputs per prompt and use their normalized relative rewards as advantages. This eliminates ~33% of memory overhead while maintaining (or exceeding) PPO’s performance. The algorithm powers DeepSeekMath’s 51.7% MATH benchmark score and later became central to DeepSeek-R1’s remarkable reasoning capabilities.

Reading time: ~32 minutes

Prerequisites: Part 2: PPO for Language Models covers the clipped objective and GAE that GRPO builds upon.

The Memory Problem Revisited

In Part 2, we saw PPO’s four-model architecture:

Model	Purpose	Memory (7B)
Policy $\pi_\theta$	Generate responses	~28 GB
Reference $\pi_{\text{ref}}$	KL anchor	~14 GB
Value $V_\psi$	Advantage estimation	~28 GB
Reward $r_\phi$	Score responses	~14 GB
Total		~84 GB

The value network accounts for one-third of memory, and it’s there for a single purpose: estimating advantages via GAE.

But what if we didn’t need to learn $V(s)$ ? What if we could estimate advantages directly from samples?

This is GRPO’s insight: instead of learning “how good is this state?”, just sample multiple completions for the same prompt and compare them. The mean reward becomes your baseline. No neural network required.

The Core Insight: Group-Relative Advantages

The Algorithm

For each prompt $q$ , GRPO:

Sample a group of $G$ outputs ${o_1, o_2, \ldots, o_G}$ from the current policy
Score each output with the reward model: ${r_1, r_2, \ldots, r_G}$
Normalize rewards using group statistics
Use normalized rewards as advantages for all tokens in each output

The advantage for output $o_i$ is:

$\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$

where $\mathbf{r} = {r_1, \ldots, r_G}$ are rewards for all outputs in the group.

That’s it. No learned value function. No GAE computation. Just sample, score, normalize.

Visual Comparison

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TB subgraph PPO["PPO: Learned Value Function"] direction TB P_prompt["Prompt q"] --> P_policy["Policy π_θ"] P_policy --> P_output["Output o"] P_output --> P_reward["Reward r"] P_output --> P_value["Value Network V_ψ ━━━━━━━━━━ Same size as policy Learns V(s) for all states ~28 GB memory"] P_value --> P_gae["GAE Computation ━━━━━━━━━━ δ_t = r + γV(s') − V(s) Â = Σ (γλ)^k δ_{t+k}"] P_reward --> P_gae P_gae --> P_adv["Advantage Â_t per token"] end subgraph GRPO["GRPO: Group Statistics"] direction TB G_prompt["Prompt q"] --> G_policy["Policy π_θ"] G_policy --> G_o1["o₁ → r₁"] G_policy --> G_o2["o₂ → r₂"] G_policy --> G_dots["⋮"] G_policy --> G_oG["o_G → r_G"] G_o1 --> G_norm["Normalize ━━━━━━━━━━ Â_i = (r_i − μ) / σ No neural network 0 GB memory"] G_o2 --> G_norm G_dots --> G_norm G_oG --> G_norm G_norm --> G_adv["Advantage Â_i per output"] end style P_value fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2.5px style P_gae fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style G_norm fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style PPO fill:#0f172a,stroke:#ef4444,color:#fecaca style GRPO fill:#0f172a,stroke:#10b981,color:#d1fae5

Memory Savings

Component	PPO	GRPO
Policy (training)	~28 GB	~28 GB
Reference (frozen)	~14 GB	~14 GB
Value (training)	~28 GB	0 GB
Reward (inference)	~14 GB	~14 GB
Total	~84 GB	~56 GB

A 33% reduction in model memory. For large models, this often makes the difference between fitting on available hardware and needing more GPUs.

Key Insight: GRPO trades compute (sampling $G$ outputs per prompt) for memory (no value network). For LLMs where memory is the bottleneck, this is an excellent trade.

Mathematical Formulation

Let’s formalize GRPO precisely.

The GRPO Objective

GRPO maximizes:

$\mathcal{J}*{\text{GRPO}}(\theta) = \mathbb{E}*{q \sim P(Q), , {o_i}*{i=1}^G \sim \pi*{\theta_{\text{old}}}}\left[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \mathcal{L}*{i,t} - \beta , \mathbb{D}*{\text{KL}}[\pi_\theta | \pi_{\text{ref}}]\right]$

The Per-Token Loss

Like PPO, GRPO uses the clipped surrogate objective:

$\mathcal{L}*{i,t} = \min\left[\rho*{i,t} , \hat{A}*{i,t}, ; \text{clip}(\rho*{i,t}, 1-\varepsilon, 1+\varepsilon) \cdot \hat{A}_{i,t}\right]$

where the importance ratio is:

$\rho_{i,t} = \frac{\pi_\theta(o_{i,t} ,|, q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} ,|, q, o_{i,<t})}$

The clipping mechanism is identical to PPO—what changes is how we compute $\hat{A}_{i,t}$ .

Group-Relative Advantage Estimation

For outcome supervision (the standard case), all tokens in output $o_i$ share the same advantage:

$\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$

where:

$r_i = r_\phi(q, o_i)$ is the reward for output $o_i$
$\mathbf{r} = {r_1, r_2, \ldots, r_G}$ are rewards for all $G$ outputs
$\text{mean}(\mathbf{r}) = \frac{1}{G}\sum_{i=1}^G r_i$
$\text{std}(\mathbf{r}) = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \text{mean}(\mathbf{r}))^2}$

Properties of Group-Relative Advantages

1. Zero-mean (approximately): Within each group, advantages sum to approximately zero: $\sum_{i=1}^G \tilde{r}_i \approx 0$

This provides balanced training signal: reinforcing good outputs requires suppressing worse ones.

2. Unit variance: By dividing by std, advantages are normalized: $\text{Var}(\tilde{r}) = 1$

This makes gradients comparable across prompts with different reward scales.

3. Relative, not absolute: The advantage measures performance relative to the group, not absolute quality. An output with reward 0.5 could have positive advantage (if others scored 0.3) or negative (if others scored 0.7).

Why This Works

The group-relative approach succeeds for several reasons:

1. Natural Baseline

The mean reward $\text{mean}(\mathbf{r})$ estimates the expected reward under the current policy for this prompt:

$\mathbb{E}*{o \sim \pi*\theta(O|q)}[r(q, o)] \approx \frac{1}{G}\sum_{i=1}^G r_i$

With $G=64$ samples (as in DeepSeekMath), this estimate is reasonably accurate. This is exactly what we want from a baseline.

2. Matched to Reward Model Training

Reward models are typically trained on comparisons between outputs for the same prompt:

“Output A is better than output B for prompt q”

GRPO’s group-relative advantages align with this comparative signal. We’re asking: “Among these $G$ outputs, which are better?“

3. Automatic Scale Normalization

Different prompts may have different “difficulty,” leading to different reward distributions:

Easy prompt: All outputs score 0.8-0.9
Hard prompt: All outputs score 0.2-0.4

Dividing by std normalizes these, preventing easy prompts from dominating gradients.

4. No Bootstrapping Bias

GAE uses $V(s_{t+1})$ estimates, introducing bias if the value function is inaccurate. GRPO uses only observed rewards—no bootstrapping, no bias from learned estimates.

5. Simplicity

The entire advantage computation is:

advantages = (rewards - rewards.mean()) / (rewards.std() + eps)

No recursive GAE computation, no value function forward passes, no TD errors.

Key Insight: The group mean is a Monte Carlo estimate of $V(s_0)$ for this prompt. By sampling enough outputs, we get a good baseline without learning anything.

Outcome vs. Process Supervision

GRPO supports two supervision modes.

Outcome Supervision (Default)

A single reward at sequence end:

$\hat{A}_{i,t} = \tilde{r}_i \quad \forall , t \in {1, \ldots, |o_i|}$

All tokens receive the same advantage—the normalized final reward.

Pros:

Simple: Works with any output-level reward model
Stable: Single reward signal, no step-level noise

Cons:

Coarse credit assignment: Every token gets equal “blame” or “credit”

Process Supervision

Rewards at each reasoning step (e.g., separated by newlines):

Let $\text{idx}(j)$ be the token index at the end of step $j$ , and $K_i$ the total steps in output $o_i$ .

Step 1: Normalize per-step rewards

$\tilde{r}_i^{\text{idx}(j)} = \frac{r_i^{\text{idx}(j)} - \text{mean}(\mathbf{R})}{\text{std}(\mathbf{R})}$

where $\mathbf{R}$ contains all step rewards across all outputs in the group.

Step 2: Compute return-to-go

$\hat{A}*{i,t} = \sum*{\text{idx}(j) \geq t} \tilde{r}_i^{\text{idx}(j)}$

The advantage at token $t$ is the sum of normalized rewards from all subsequent steps.

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph LR subgraph Process["Process Supervision"] direction LR S1["Step 1 r̃₁ = 0.5 ━━━━━ Â = 1.4"] --> S2["Step 2 r̃₂ = −0.2 ━━━━━ Â = 0.9"] S2 --> S3["Step 3 r̃₃ = 0.8 ━━━━━ Â = 1.1"] S3 --> S4["Step 4 r̃₄ = 0.3 ━━━━━ Â = 0.3"] end subgraph Outcome["Outcome Supervision"] direction LR T1["Step 1 ━━━━━ Â = 0.7"] --> T2["Step 2 ━━━━━ Â = 0.7"] T2 --> T3["Step 3 ━━━━━ Â = 0.7"] T3 --> T4["Step 4 r = 0.7 ━━━━━ Â = 0.7"] end style S1 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S2 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S3 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S4 fill:#1e293b,stroke:#10b981,color:#d1fae5 style T1 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T2 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T3 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T4 fill:#1e293b,stroke:#06b6d4,color:#cffafe style Process fill:#0f172a,stroke:#10b981,color:#d1fae5 style Outcome fill:#0f172a,stroke:#06b6d4,color:#cffafe

Which to Choose?

Situation	Recommendation
Standard RLHF	Outcome supervision
Multi-step reasoning (math, code)	Process supervision if PRM available
No process reward model	Outcome supervision
Training stability concerns	Outcome supervision

DeepSeekMath experiments show process supervision slightly outperforms outcome supervision for math, but the difference is modest. Unless you have a high-quality Process Reward Model (PRM), outcome supervision is the simpler default.

Decoupled KL Regularization

GRPO handles KL penalty differently from standard PPO.

PPO: KL in the Reward

Standard PPO-RLHF adds KL to the per-token reward:

$r_t^{\text{total}} = r_t - \beta \log\frac{\pi_\theta(o_t|q, o_{<t})}{\pi_{\text{ref}}(o_t|q, o_{<t})}$

This KL penalty flows into GAE computation, coupling reward signal with regularization.

GRPO: KL in the Loss

GRPO adds KL directly to the loss, decoupled from advantages:

$\mathcal{J}*{\text{GRPO}} = \mathcal{L}^{\text{CLIP}} - \beta , \mathbb{D}*{\text{KL}}[\pi_\theta | \pi_{\text{ref}}]$

The Schulman KL Estimator

GRPO uses an unbiased KL estimator:

$\mathbb{D}*{\text{KL}}[\pi*\theta | \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_t|q, o_{<t})}{\pi_\theta(o_t|q, o_{<t})} - \log\frac{\pi_{\text{ref}}(o_t|q, o_{<t})}{\pi_\theta(o_t|q, o_{<t})} - 1$

Properties:

Non-negative: Always $\geq 0$
Zero at equality: Exactly 0 when $\pi_\theta = \pi_{\text{ref}}$
Unbiased: Correct expectation under sampling

Why Decouple?

1. Cleaner advantage signal: Advantages reflect only task performance, not regularization.

2. Independent tuning: KL coefficient $\beta$ doesn’t interact with advantage normalization.

3. Simpler implementation: No need to track per-token KL during advantage computation.

Key Insight: Decoupling KL from rewards simplifies the algorithm. The policy gradient focuses on task performance; the KL term separately ensures we don’t drift too far from the reference.

Iterative GRPO

As training progresses, the reward model may become a poor discriminator for the improved policy’s outputs. Iterative GRPO addresses this.

The Staleness Problem

Initially: Policy outputs vary in quality; reward model discriminates well
After training: Policy consistently produces good outputs; reward model’s training distribution no longer matches
Result: Degraded training signal

The Solution: Co-evolve Policy and Reward Model

Algorithm: Iterative GRPO

For iteration = 1, ..., I:
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 1: Reset Reference                            │
    └─────────────────────────────────────────────────────┘
    π_ref ← π_θ   (current policy becomes new reference)
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 2: Policy Training                            │
    └─────────────────────────────────────────────────────┘
    For step = 1, ..., M:
        Sample batch of prompts
        Sample G outputs per prompt from π_θ
        Compute rewards using r_φ
        Compute group-relative advantages
        Update π_θ using GRPO objective
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 3: Reward Model Update                        │
    └─────────────────────────────────────────────────────┘
    Generate comparison data from improved π_θ
    Continually train r_φ with:
        - 90% new data (from current policy)
        - 10% historical data (replay buffer)

Key Design Choices

Reference reset: Each iteration starts fresh, allowing larger cumulative policy drift while constraining per-iteration change.

Replay buffer: 10% historical data prevents the reward model from forgetting how to score earlier, easier outputs.

Single GRPO update: DeepSeekMath uses one gradient step per sample batch ( $\mu = 1$ ), prioritizing fresh samples over reuse.

PyTorch Implementation

Here’s a complete, educational implementation of GRPO:

"""
Group Relative Policy Optimization (GRPO)
=========================================

Memory-efficient alternative to PPO that eliminates the value network
by computing advantages from groups of sampled outputs.

Reference: DeepSeekMath (Shao et al., 2024)
"""

import torch
import torch.nn.functional as F
from torch import Tensor
from dataclasses import dataclass
from typing import Tuple, List


@dataclass
class GRPOConfig:
    """GRPO hyperparameters."""
    
    # Core settings
    group_size: int = 64           # G: outputs per prompt
    clip_epsilon: float = 0.2      # ε: PPO clipping
    kl_coef: float = 0.04          # β: KL penalty coefficient
    
    # Generation
    max_length: int = 1024
    temperature: float = 1.0
    
    # Training
    learning_rate: float = 1e-6
    max_grad_norm: float = 1.0
    
    # Numerical stability
    eps: float = 1e-8


def compute_group_advantages(rewards: Tensor, eps: float = 1e-8) -> Tensor:
    """
    Compute group-relative advantages.
    
    This is GRPO's key innovation: use group statistics
    instead of a learned value function.
    
    Args:
        rewards: [G] rewards for each output in the group
        eps: Small constant for numerical stability
        
    Returns:
        advantages: [G] normalized advantages
    """
    mean_r = rewards.mean()
    std_r = rewards.std() + eps
    advantages = (rewards - mean_r) / std_r
    return advantages


def compute_schulman_kl(
    policy_log_probs: Tensor,
    reference_log_probs: Tensor,
) -> Tensor:
    """
    Compute Schulman's unbiased KL estimator.
    
    D_KL = π_ref/π_θ - log(π_ref/π_θ) - 1
    
    This estimator is:
    - Non-negative (always >= 0)
    - Unbiased (correct expectation)
    - Zero when π_θ = π_ref
    
    Args:
        policy_log_probs: [batch, seq_len] from π_θ
        reference_log_probs: [batch, seq_len] from π_ref
        
    Returns:
        kl: [batch, seq_len] per-token KL
    """
    log_ratio = reference_log_probs - policy_log_probs  # log(π_ref/π_θ)
    kl = torch.exp(log_ratio) - log_ratio - 1
    return kl


def compute_grpo_loss(
    policy_log_probs: Tensor,
    old_log_probs: Tensor,
    reference_log_probs: Tensor,
    advantages: Tensor,
    response_mask: Tensor,
    clip_epsilon: float = 0.2,
    kl_coef: float = 0.04,
) -> Tuple[Tensor, dict]:
    """
    Compute GRPO loss with clipping and KL penalty.
    
    Args:
        policy_log_probs: [G, seq_len] current policy
        old_log_probs: [G, seq_len] policy at sampling time
        reference_log_probs: [G, seq_len] reference policy
        advantages: [G] per-output advantages (will broadcast)
        response_mask: [G, seq_len] valid token mask
        clip_epsilon: PPO clipping parameter
        kl_coef: KL penalty coefficient
        
    Returns:
        loss: Scalar loss to minimize
        metrics: Dictionary of training metrics
    """
    G, seq_len = policy_log_probs.shape
    
    # =========================================
    # 1. Importance ratio: ρ = π_θ / π_old
    # =========================================
    log_ratio = policy_log_probs - old_log_probs
    ratio = torch.exp(log_ratio)  # [G, seq_len]
    
    # =========================================
    # 2. Expand advantages to token level
    # =========================================
    # [G] -> [G, seq_len]
    token_advantages = advantages.unsqueeze(1).expand(-1, seq_len)
    
    # =========================================
    # 3. PPO clipped objective
    # =========================================
    clipped_ratio = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
    
    policy_loss_1 = ratio * token_advantages
    policy_loss_2 = clipped_ratio * token_advantages
    policy_loss = torch.min(policy_loss_1, policy_loss_2)
    
    # =========================================
    # 4. KL penalty (decoupled from reward)
    # =========================================
    kl_penalty = compute_schulman_kl(policy_log_probs, reference_log_probs)
    
    # =========================================
    # 5. Combine losses
    # =========================================
    # Maximize policy objective, minimize KL
    token_loss = -policy_loss + kl_coef * kl_penalty
    
    # Mask and average
    masked_loss = token_loss * response_mask
    per_output_loss = masked_loss.sum(dim=1) / response_mask.sum(dim=1).clamp(min=1)
    loss = per_output_loss.mean()
    
    # =========================================
    # 6. Metrics
    # =========================================
    with torch.no_grad():
        clip_fraction = ((ratio - 1).abs() > clip_epsilon).float()
        clip_fraction = (clip_fraction * response_mask).sum() / response_mask.sum()
        
        mean_kl = (kl_penalty * response_mask).sum() / response_mask.sum()
    
    metrics = {
        "loss": loss.item(),
        "clip_fraction": clip_fraction.item(),
        "mean_kl": mean_kl.item(),
        "mean_advantage": advantages.mean().item(),
        "advantage_std": advantages.std().item(),
    }
    
    return loss, metrics


class GRPOTrainer:
    """
    GRPO trainer for LLM fine-tuning.
    
    Unlike PPO, requires only THREE models (no value network):
    - policy_model: The LLM being trained
    - reference_model: Frozen copy for KL penalty
    - reward_model: Scores complete responses
    
    Memory savings: ~33% compared to PPO
    """
    
    def __init__(
        self,
        policy_model: torch.nn.Module,
        reference_model: torch.nn.Module,
        reward_model,  # Can be a function or model
        tokenizer,
        config: GRPOConfig,
    ):
        self.policy = policy_model
        self.reference = reference_model
        self.reward_fn = reward_model
        self.tokenizer = tokenizer
        self.config = config
        
        # Freeze reference model
        for param in self.reference.parameters():
            param.requires_grad = False
        
        # Single optimizer (no value network!)
        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config.learning_rate,
            betas=(0.9, 0.95),
            weight_decay=0.1
        )
    
    @torch.no_grad()
    def sample_group(
        self,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> Tuple[Tensor, Tensor]:
        """
        Sample G outputs for a single prompt.
        
        Args:
            prompt_ids: [1, prompt_len] token ids
            attention_mask: [1, prompt_len]
            
        Returns:
            output_ids: [G, total_len] generated sequences
            output_masks: [G, total_len] attention masks
        """
        G = self.config.group_size
        device = prompt_ids.device
        
        # Expand prompt for batch generation
        prompt_ids = prompt_ids.expand(G, -1)
        attention_mask = attention_mask.expand(G, -1)
        
        # Sample from policy
        self.policy.eval()
        outputs = self.policy.generate(
            input_ids=prompt_ids,
            attention_mask=attention_mask,
            max_new_tokens=self.config.max_length,
            do_sample=True,
            temperature=self.config.temperature,
            pad_token_id=self.tokenizer.pad_token_id,
            return_dict_in_generate=True,
        )
        
        output_ids = outputs.sequences
        output_masks = (output_ids != self.tokenizer.pad_token_id).long()
        
        return output_ids, output_masks
    
    @torch.no_grad()
    def compute_rewards(
        self,
        prompt_text: str,
        output_ids: Tensor,
        prompt_len: int,
    ) -> Tensor:
        """
        Compute rewards for each output in the group.
        
        Args:
            prompt_text: The original prompt string
            output_ids: [G, total_len] generated sequences
            prompt_len: Length of prompt in tokens
            
        Returns:
            rewards: [G] scalar reward per output
        """
        G = output_ids.shape[0]
        rewards = []
        
        for i in range(G):
            # Extract response tokens
            response_ids = output_ids[i, prompt_len:]
            response_text = self.tokenizer.decode(
                response_ids, skip_special_tokens=True
            )
            
            # Score with reward model
            reward = self.reward_fn(prompt_text, response_text)
            rewards.append(reward)
        
        return torch.tensor(rewards, device=output_ids.device)
    
    def compute_log_probs(
        self,
        model: torch.nn.Module,
        input_ids: Tensor,
        attention_mask: Tensor,
        response_start_idx: int,
    ) -> Tensor:
        """
        Compute per-token log probabilities for response tokens.
        
        Args:
            model: Language model
            input_ids: [batch, seq_len] full sequences
            attention_mask: [batch, seq_len]
            response_start_idx: Where response tokens begin
            
        Returns:
            log_probs: [batch, response_len] per-token log probs
        """
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=True,
        )
        logits = outputs.logits  # [batch, seq_len, vocab]
        
        # Shift for next-token prediction
        shift_logits = logits[:, :-1, :]
        shift_labels = input_ids[:, 1:]
        
        # Log probabilities
        log_probs = F.log_softmax(shift_logits, dim=-1)
        
        # Gather log probs for actual tokens
        token_log_probs = log_probs.gather(
            dim=-1,
            index=shift_labels.unsqueeze(-1)
        ).squeeze(-1)
        
        # Extract response tokens only
        response_log_probs = token_log_probs[:, response_start_idx - 1:]
        
        return response_log_probs
    
    def train_step(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> dict:
        """
        Single GRPO training step.
        
        Args:
            prompt_text: Original prompt string
            prompt_ids: [1, prompt_len] tokenized prompt
            attention_mask: [1, prompt_len]
            
        Returns:
            metrics: Training metrics dictionary
        """
        prompt_len = prompt_ids.shape[1]
        device = prompt_ids.device
        
        # =============================================
        # Step 1: Sample group of outputs
        # =============================================
        output_ids, output_masks = self.sample_group(prompt_ids, attention_mask)
        G = output_ids.shape[0]
        
        # =============================================
        # Step 2: Compute rewards
        # =============================================
        rewards = self.compute_rewards(prompt_text, output_ids, prompt_len)
        
        # =============================================
        # Step 3: Compute group-relative advantages
        # =============================================
        advantages = compute_group_advantages(rewards, self.config.eps)
        
        # =============================================
        # Step 4: Get log probs (old = current pre-update)
        # =============================================
        with torch.no_grad():
            old_log_probs = self.compute_log_probs(
                self.policy, output_ids, output_masks, prompt_len
            )
            reference_log_probs = self.compute_log_probs(
                self.reference, output_ids, output_masks, prompt_len
            )
        
        # =============================================
        # Step 5: Policy gradient update
        # =============================================
        self.policy.train()
        
        policy_log_probs = self.compute_log_probs(
            self.policy, output_ids, output_masks, prompt_len
        )
        
        # Response mask
        response_mask = output_masks[:, prompt_len:]
        
        loss, metrics = compute_grpo_loss(
            policy_log_probs,
            old_log_probs,
            reference_log_probs,
            advantages,
            response_mask,
            self.config.clip_epsilon,
            self.config.kl_coef,
        )
        
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(
            self.policy.parameters(), 
            self.config.max_grad_norm
        )
        self.optimizer.step()
        
        # Add reward metrics
        metrics["mean_reward"] = rewards.mean().item()
        metrics["reward_std"] = rewards.std().item()
        
        return metrics

Implementation Notes

Group size selection: DeepSeekMath uses $G=64$ . Larger groups give more stable advantage estimates. Reasonable range: 16-128.

Single update per sample: The implementation does one gradient step per sampled batch. Multiple steps would require re-computing advantages as the policy changes.

KL coefficient: Start with $\beta = 0.04$ . Lower values allow more divergence; higher values constrain updates.

DeepSeek-R1: GRPO at Scale

In January 2025, DeepSeek released DeepSeek-R1, a reasoning model matching OpenAI’s o1 on many benchmarks. GRPO is central to its training.

R1’s Training Pipeline

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph TB subgraph R1Pipeline["DeepSeek-R1 Training Pipeline"] direction TB Stage1["Stage 1: Cold Start ━━━━━━━━━━ Small amount of long CoT data Teaches extended reasoning format"] Stage2["Stage 2: Reasoning RL ━━━━━━━━━━ GRPO with rule-based rewards Math correctness, code execution Model develops 'thinking' traces"] Stage3["Stage 3: Rejection Sampling + SFT ━━━━━━━━━━ Sample from RL checkpoint Human annotation for quality SFT on curated reasoning data"] Stage4["Stage 4: Final RL ━━━━━━━━━━ Broader GRPO training Helpfulness + harmlessness Multi-objective alignment"] Stage1 --> Stage2 Stage2 --> Stage3 Stage3 --> Stage4 end style Stage1 fill:#1e293b,stroke:#64748b,color:#cbd5e1,stroke-width:2px style Stage2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Stage3 fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style Stage4 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style R1Pipeline fill:none,stroke:#334155,color:#94a3b8,stroke-width:1px

Why GRPO for R1?

1. Long reasoning traces: R1 produces outputs with thousands of tokens. A value network would need to predict final reward from any point in this long sequence—extremely difficult.

2. Verifiable rewards: For math and code, rewards are rule-based (correct/incorrect). No reward model uncertainty.

3. Memory efficiency: At R1’s scale, GRPO’s memory savings are crucial.

4. Training stability: Simpler architecture means fewer things can go wrong.

The “Aha Moment” Phenomenon

The R1 paper describes emergent behavior during GRPO training: the model spontaneously developed “aha moment” patterns—places where it reconsiders and corrects earlier reasoning mistakes.

Example reasoning trace:
"Let me solve this step by step...
 First, I'll set up the equation: 2x + 5 = 13
 Subtracting 5: 2x = 8
 Wait, let me verify that. 13 - 5 = 8. Yes, correct.
 Dividing by 2: x = 4
 
 Let me check: 2(4) + 5 = 8 + 5 = 13 ✓
 
 The answer is x = 4."

This self-correction emerged without explicit supervision—GRPO’s optimization pressure (combined with correctness rewards) naturally discovered it.

R1-Zero: Pure RL Without SFT

DeepSeek also released R1-Zero, trained with GRPO directly from the base model without any SFT stage. Remarkably, it still develops coherent reasoning, though with some readability issues that the full R1 pipeline addresses.

This demonstrates GRPO’s power: given only a correctness signal, the algorithm can induce sophisticated reasoning behavior.

Key Insight: GRPO’s simplicity makes it stable enough for extreme-scale training. DeepSeek-R1 shows that you can train world-class reasoning models without the complexity of PPO’s value network.

A Unified View: GRPO, PPO, DPO, and RFT

The DeepSeekMath paper provides a unified framework for understanding different alignment methods.

The General Policy Gradient

All methods can be viewed through:

$\nabla_\theta J = \mathbb{E}*{(q, o) \sim \mathcal{D}}\left[\sum_t \nabla*\theta \log \pi_\theta(o_t|q, o_{<t}) \cdot w(q, o)\right]$

Methods differ in:

Data source $\mathcal{D}$ : Where do samples come from?
Gradient coefficient $w(q, o)$ : How are outputs weighted?

Method Comparison

Method	Data Source	Coefficient $w$	Online?
SFT	Human demos	$w = 1$ (uniform)	❌
RFT	Filtered policy samples	$w = 1$ (filtered)	❌
DPO	Preference pairs	$w = -\beta\sigma'(\cdot)$	❌
Online RFT	Current policy	$w = 1$ (filtered)	✅
PPO	Current policy	$w = A_t$ (GAE)	✅
GRPO	Current policy	$w = \tilde{r}$ (group-normalized)	✅

Key Insights

Online vs. Offline: GRPO, PPO, and Online RFT sample from the current policy and can adapt. DPO and RFT use fixed datasets.

Weighted vs. Filtered: RFT keeps only correct outputs (binary filter). GRPO keeps all outputs but weights by normalized reward (continuous weighting).

Advantage estimation: PPO learns $V(s)$ ; GRPO uses group statistics; DPO implicitly computes advantages through preferences.

Limitations and When to Use Alternatives

GRPO Limitations

1. Sampling cost: $G$ outputs per prompt (typically 64). More compute than PPO’s single sample.

2. Advantage variance: With small groups, the mean is a noisy baseline. Need $G \geq 32$ for stability.

3. Uniform token credit: All tokens get the same advantage. No fine-grained credit assignment without process supervision.

4. Reward model dependency: Like all RL methods, susceptible to reward hacking if the reward model has exploits.

5. No value function benefits: Can’t use value-guided search or intrinsic motivation bonuses.

When to Use Each Method

Situation	Recommendation
Memory-constrained	✅ GRPO
Compute-constrained	Consider PPO
Outcome rewards only	✅ GRPO
Dense per-step rewards	Consider PPO (GAE)
Offline preferences only	Use DPO
No reward model	Use DPO or RFT
Value-guided search needed	Use PPO
Multi-reward training	Use GDPO (Part 4)

Key Takeaways

The Core Innovation

$\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$

Sample multiple outputs, normalize rewards, use as advantages. No value network.

Memory Savings

Eliminates ~28GB value network (for 7B model)
33% reduction in model memory
Often the difference between 1 GPU and multiple

Design Choices

Group size: $G = 64$ for stability
Clip epsilon: $\varepsilon = 0.2$ (standard PPO)
KL coefficient: $\beta = 0.04$
Decoupled KL: In loss, not reward

When GRPO Excels

Memory-constrained training
Outcome-based rewards
Long generations (where value prediction is hard)
Verifiable rewards (math, code)

DeepSeek-R1 Connection

GRPO powers R1’s remarkable reasoning through:

Rule-based correctness rewards
Emergent self-correction (“aha moments”)
Stable training at scale

What’s Next

In Part 4: GDPO, we’ll see what happens when GRPO meets multi-reward settings:

Why combined reward normalization causes “advantage collapse”
GDPO’s solution: decouple normalization per reward
Production patterns for tool calling, math, and coding

GRPO: Eliminating the Value Network

The Memory Problem Revisited

Table of Contents

The Core Insight: Group-Relative Advantages

The Algorithm

Visual Comparison

Memory Savings

Mathematical Formulation

The GRPO Objective

The Per-Token Loss

Group-Relative Advantage Estimation

Properties of Group-Relative Advantages

Why This Works

1. Natural Baseline

2. Matched to Reward Model Training

3. Automatic Scale Normalization

4. No Bootstrapping Bias

5. Simplicity

Outcome vs. Process Supervision

Outcome Supervision (Default)

Process Supervision

Which to Choose?

Decoupled KL Regularization

PPO: KL in the Reward

GRPO: KL in the Loss

The Schulman KL Estimator

Why Decouple?

Iterative GRPO

The Staleness Problem

The Solution: Co-evolve Policy and Reward Model

Key Design Choices

PyTorch Implementation

Implementation Notes

DeepSeek-R1: GRPO at Scale

R1’s Training Pipeline

Why GRPO for R1?

The “Aha Moment” Phenomenon

R1-Zero: Pure RL Without SFT

A Unified View: GRPO, PPO, DPO, and RFT

The General Policy Gradient

Method Comparison

Key Insights

Limitations and When to Use Alternatives

GRPO Limitations

When to Use Each Method

Key Takeaways

The Core Innovation

Memory Savings

Design Choices

When GRPO Excels

DeepSeek-R1 Connection

What’s Next

Further Reading

Keep Reading

GRPO: Eliminating the Value Network

PPO for Language Models: The RLHF Workhorse

Reinforcement Learning Foundations for LLM Alignment

Deploying Contextual Bandits: Production Guide and Offline Evaluation

Neural Contextual Bandits for High-Dimensional Data

Implementing Contextual Bandits: Complete Algorithm Guide