Back to blog
~32 min By Vitor Sousa

GRPO: Eliminating the Value Network

Part 3 of 4: The Memory-Efficient Alternative

TL;DR: GRPO (Group Relative Policy Optimization) replaces PPO’s learned value function with a simple insight: sample GG outputs per prompt and use their normalized relative rewards as advantages. This eliminates ~33% of memory overhead while maintaining (or exceeding) PPO’s performance. The algorithm powers DeepSeekMath’s 51.7% MATH benchmark score and later became central to DeepSeek-R1’s remarkable reasoning capabilities.

Reading time: ~32 minutes

Prerequisites: Part 2: PPO for Language Models covers the clipped objective and GAE that GRPO builds upon.


The Memory Problem Revisited

In Part 2, we saw PPO’s four-model architecture:

ModelPurposeMemory (7B)
Policy πθ\pi_\thetaGenerate responses~28 GB
Reference πref\pi_{\text{ref}}KL anchor~14 GB
Value VψV_\psiAdvantage estimation~28 GB
Reward rϕr_\phiScore responses~14 GB
Total~84 GB

The value network accounts for one-third of memory, and it’s there for a single purpose: estimating advantages via GAE.

But what if we didn’t need to learn V(s)V(s)? What if we could estimate advantages directly from samples?

This is GRPO’s insight: instead of learning “how good is this state?”, just sample multiple completions for the same prompt and compare them. The mean reward becomes your baseline. No neural network required.


Table of Contents

  1. The Core Insight: Group-Relative Advantages
  2. Mathematical Formulation
  3. Why This Works
  4. Outcome vs. Process Supervision
  5. Decoupled KL Regularization
  6. Iterative GRPO
  7. PyTorch Implementation
  8. DeepSeek-R1: GRPO at Scale
  9. A Unified View: GRPO, PPO, DPO, and RFT
  10. Limitations and When to Use Alternatives
  11. Key Takeaways

The Core Insight: Group-Relative Advantages

The Algorithm

For each prompt qq, GRPO:

  1. Sample a group of GG outputs o1,o2,,oG{o_1, o_2, \ldots, o_G} from the current policy
  2. Score each output with the reward model: r1,r2,,rG{r_1, r_2, \ldots, r_G}
  3. Normalize rewards using group statistics
  4. Use normalized rewards as advantages for all tokens in each output

The advantage for output oio_i is:

A^i=rimean(r)std(r)\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}

where r=r1,,rG\mathbf{r} = {r_1, \ldots, r_G} are rewards for all outputs in the group.

That’s it. No learned value function. No GAE computation. Just sample, score, normalize.

Visual Comparison

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TB subgraph PPO["<b>PPO: Learned Value Function</b>"] direction TB P_prompt["Prompt q"] --> P_policy["Policy π_θ"] P_policy --> P_output["Output o"] P_output --> P_reward["Reward r"] P_output --> P_value["<b>Value Network V_ψ</b><br/>━━━━━━━━━━<br/>Same size as policy<br/>Learns V(s) for all states<br/><i>~28 GB memory</i>"] P_value --> P_gae["GAE Computation<br/>━━━━━━━━━━<br/>δ_t = r + γV(s') − V(s)<br/>Â = Σ (γλ)^k δ_{t+k}"] P_reward --> P_gae P_gae --> P_adv["Advantage Â_t<br/><i>per token</i>"] end subgraph GRPO["<b>GRPO: Group Statistics</b>"] direction TB G_prompt["Prompt q"] --> G_policy["Policy π_θ"] G_policy --> G_o1["o₁ → r₁"] G_policy --> G_o2["o₂ → r₂"] G_policy --> G_dots["⋮"] G_policy --> G_oG["o_G → r_G"] G_o1 --> G_norm["<b>Normalize</b><br/>━━━━━━━━━━<br/>Â_i = (r_i − μ) / σ<br/>No neural network<br/><i>0 GB memory</i>"] G_o2 --> G_norm G_dots --> G_norm G_oG --> G_norm G_norm --> G_adv["Advantage Â_i<br/><i>per output</i>"] end style P_value fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2.5px style P_gae fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style G_norm fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style PPO fill:#0f172a,stroke:#ef4444,color:#fecaca style GRPO fill:#0f172a,stroke:#10b981,color:#d1fae5

Memory Savings

ComponentPPOGRPO
Policy (training)~28 GB~28 GB
Reference (frozen)~14 GB~14 GB
Value (training)~28 GB0 GB
Reward (inference)~14 GB~14 GB
Total~84 GB~56 GB

A 33% reduction in model memory. For large models, this often makes the difference between fitting on available hardware and needing more GPUs.

Key Insight: GRPO trades compute (sampling GG outputs per prompt) for memory (no value network). For LLMs where memory is the bottleneck, this is an excellent trade.


Mathematical Formulation

Let’s formalize GRPO precisely.

The GRPO Objective

GRPO maximizes:

JGRPO(θ)=EqP(Q),,oii=1Gπθold[1Gi=1G1oit=1oiLi,tβ,DKL[πθπref]]\mathcal{J}*{\text{GRPO}}(\theta) = \mathbb{E}*{q \sim P(Q), , {o_i}*{i=1}^G \sim \pi*{\theta_{\text{old}}}}\left[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \mathcal{L}*{i,t} - \beta , \mathbb{D}*{\text{KL}}[\pi_\theta | \pi_{\text{ref}}]\right]

The Per-Token Loss

Like PPO, GRPO uses the clipped surrogate objective:

Li,t=min[ρi,t,A^i,t,;clip(ρi,t,1ε,1+ε)A^i,t]\mathcal{L}*{i,t} = \min\left[\rho*{i,t} , \hat{A}*{i,t}, ; \text{clip}(\rho*{i,t}, 1-\varepsilon, 1+\varepsilon) \cdot \hat{A}_{i,t}\right]

where the importance ratio is:

ρi,t=πθ(oi,t,,q,oi,<t)πθold(oi,t,,q,oi,<t)\rho_{i,t} = \frac{\pi_\theta(o_{i,t} ,|, q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} ,|, q, o_{i,<t})}

The clipping mechanism is identical to PPO—what changes is how we compute A^i,t\hat{A}_{i,t}.

Group-Relative Advantage Estimation

For outcome supervision (the standard case), all tokens in output oio_i share the same advantage:

A^i,t=r~i=rimean(r)std(r)\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}

where:

  • ri=rϕ(q,oi)r_i = r_\phi(q, o_i) is the reward for output oio_i
  • r=r1,r2,,rG\mathbf{r} = {r_1, r_2, \ldots, r_G} are rewards for all GG outputs
  • mean(r)=1Gi=1Gri\text{mean}(\mathbf{r}) = \frac{1}{G}\sum_{i=1}^G r_i
  • std(r)=1Gi=1G(rimean(r))2\text{std}(\mathbf{r}) = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \text{mean}(\mathbf{r}))^2}

Properties of Group-Relative Advantages

1. Zero-mean (approximately): Within each group, advantages sum to approximately zero: i=1Gr~i0\sum_{i=1}^G \tilde{r}_i \approx 0

This provides balanced training signal: reinforcing good outputs requires suppressing worse ones.

2. Unit variance: By dividing by std, advantages are normalized: Var(r~)=1\text{Var}(\tilde{r}) = 1

This makes gradients comparable across prompts with different reward scales.

3. Relative, not absolute: The advantage measures performance relative to the group, not absolute quality. An output with reward 0.5 could have positive advantage (if others scored 0.3) or negative (if others scored 0.7).


Why This Works

The group-relative approach succeeds for several reasons:

1. Natural Baseline

The mean reward mean(r)\text{mean}(\mathbf{r}) estimates the expected reward under the current policy for this prompt:

Eoπθ(Oq)[r(q,o)]1Gi=1Gri\mathbb{E}*{o \sim \pi*\theta(O|q)}[r(q, o)] \approx \frac{1}{G}\sum_{i=1}^G r_i

With G=64G=64 samples (as in DeepSeekMath), this estimate is reasonably accurate. This is exactly what we want from a baseline.

2. Matched to Reward Model Training

Reward models are typically trained on comparisons between outputs for the same prompt:

  • “Output A is better than output B for prompt q”

GRPO’s group-relative advantages align with this comparative signal. We’re asking: “Among these GG outputs, which are better?“

3. Automatic Scale Normalization

Different prompts may have different “difficulty,” leading to different reward distributions:

  • Easy prompt: All outputs score 0.8-0.9
  • Hard prompt: All outputs score 0.2-0.4

Dividing by std normalizes these, preventing easy prompts from dominating gradients.

4. No Bootstrapping Bias

GAE uses V(st+1)V(s_{t+1}) estimates, introducing bias if the value function is inaccurate. GRPO uses only observed rewards—no bootstrapping, no bias from learned estimates.

5. Simplicity

The entire advantage computation is:

advantages = (rewards - rewards.mean()) / (rewards.std() + eps)

No recursive GAE computation, no value function forward passes, no TD errors.

Key Insight: The group mean is a Monte Carlo estimate of V(s0)V(s_0) for this prompt. By sampling enough outputs, we get a good baseline without learning anything.


Outcome vs. Process Supervision

GRPO supports two supervision modes.

Outcome Supervision (Default)

A single reward at sequence end:

A^i,t=r~i,t1,,oi\hat{A}_{i,t} = \tilde{r}_i \quad \forall , t \in {1, \ldots, |o_i|}

All tokens receive the same advantage—the normalized final reward.

Pros:

  • Simple: Works with any output-level reward model
  • Stable: Single reward signal, no step-level noise

Cons:

  • Coarse credit assignment: Every token gets equal “blame” or “credit”

Process Supervision

Rewards at each reasoning step (e.g., separated by newlines):

Let idx(j)\text{idx}(j) be the token index at the end of step jj, and KiK_i the total steps in output oio_i.

Step 1: Normalize per-step rewards

r~iidx(j)=riidx(j)mean(R)std(R)\tilde{r}_i^{\text{idx}(j)} = \frac{r_i^{\text{idx}(j)} - \text{mean}(\mathbf{R})}{\text{std}(\mathbf{R})}

where R\mathbf{R} contains all step rewards across all outputs in the group.

Step 2: Compute return-to-go

A^i,t=idx(j)tr~iidx(j)\hat{A}*{i,t} = \sum*{\text{idx}(j) \geq t} \tilde{r}_i^{\text{idx}(j)}

The advantage at token tt is the sum of normalized rewards from all subsequent steps.

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph LR subgraph Process["<b>Process Supervision</b>"] direction LR S1["Step 1<br/>r̃₁ = 0.5<br/>━━━━━<br/>Â = 1.4"] --> S2["Step 2<br/>r̃₂ = −0.2<br/>━━━━━<br/>Â = 0.9"] S2 --> S3["Step 3<br/>r̃₃ = 0.8<br/>━━━━━<br/>Â = 1.1"] S3 --> S4["Step 4<br/>r̃₄ = 0.3<br/>━━━━━<br/>Â = 0.3"] end subgraph Outcome["<b>Outcome Supervision</b>"] direction LR T1["Step 1<br/>━━━━━<br/>Â = 0.7"] --> T2["Step 2<br/>━━━━━<br/>Â = 0.7"] T2 --> T3["Step 3<br/>━━━━━<br/>Â = 0.7"] T3 --> T4["Step 4<br/>r = 0.7<br/>━━━━━<br/>Â = 0.7"] end style S1 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S2 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S3 fill:#1e293b,stroke:#10b981,color:#d1fae5 style S4 fill:#1e293b,stroke:#10b981,color:#d1fae5 style T1 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T2 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T3 fill:#1e293b,stroke:#06b6d4,color:#cffafe style T4 fill:#1e293b,stroke:#06b6d4,color:#cffafe style Process fill:#0f172a,stroke:#10b981,color:#d1fae5 style Outcome fill:#0f172a,stroke:#06b6d4,color:#cffafe

Which to Choose?

SituationRecommendation
Standard RLHFOutcome supervision
Multi-step reasoning (math, code)Process supervision if PRM available
No process reward modelOutcome supervision
Training stability concernsOutcome supervision

DeepSeekMath experiments show process supervision slightly outperforms outcome supervision for math, but the difference is modest. Unless you have a high-quality Process Reward Model (PRM), outcome supervision is the simpler default.


Decoupled KL Regularization

GRPO handles KL penalty differently from standard PPO.

PPO: KL in the Reward

Standard PPO-RLHF adds KL to the per-token reward:

rttotal=rtβlogπθ(otq,o<t)πref(otq,o<t)r_t^{\text{total}} = r_t - \beta \log\frac{\pi_\theta(o_t|q, o_{<t})}{\pi_{\text{ref}}(o_t|q, o_{<t})}

This KL penalty flows into GAE computation, coupling reward signal with regularization.

GRPO: KL in the Loss

GRPO adds KL directly to the loss, decoupled from advantages:

JGRPO=LCLIPβ,DKL[πθπref]\mathcal{J}*{\text{GRPO}} = \mathcal{L}^{\text{CLIP}} - \beta , \mathbb{D}*{\text{KL}}[\pi_\theta | \pi_{\text{ref}}]

The Schulman KL Estimator

GRPO uses an unbiased KL estimator:

DKL[πθπref]=πref(otq,o<t)πθ(otq,o<t)logπref(otq,o<t)πθ(otq,o<t)1\mathbb{D}*{\text{KL}}[\pi*\theta | \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_t|q, o_{<t})}{\pi_\theta(o_t|q, o_{<t})} - \log\frac{\pi_{\text{ref}}(o_t|q, o_{<t})}{\pi_\theta(o_t|q, o_{<t})} - 1

Properties:

  • Non-negative: Always 0\geq 0
  • Zero at equality: Exactly 0 when πθ=πref\pi_\theta = \pi_{\text{ref}}
  • Unbiased: Correct expectation under sampling

Why Decouple?

1. Cleaner advantage signal: Advantages reflect only task performance, not regularization.

2. Independent tuning: KL coefficient β\beta doesn’t interact with advantage normalization.

3. Simpler implementation: No need to track per-token KL during advantage computation.

Key Insight: Decoupling KL from rewards simplifies the algorithm. The policy gradient focuses on task performance; the KL term separately ensures we don’t drift too far from the reference.


Iterative GRPO

As training progresses, the reward model may become a poor discriminator for the improved policy’s outputs. Iterative GRPO addresses this.

The Staleness Problem

  1. Initially: Policy outputs vary in quality; reward model discriminates well
  2. After training: Policy consistently produces good outputs; reward model’s training distribution no longer matches
  3. Result: Degraded training signal

The Solution: Co-evolve Policy and Reward Model

Algorithm: Iterative GRPO

For iteration = 1, ..., I:
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 1: Reset Reference                            │
    └─────────────────────────────────────────────────────┘
    π_ref ← π_θ   (current policy becomes new reference)
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 2: Policy Training                            │
    └─────────────────────────────────────────────────────┘
    For step = 1, ..., M:
        Sample batch of prompts
        Sample G outputs per prompt from π_θ
        Compute rewards using r_φ
        Compute group-relative advantages
        Update π_θ using GRPO objective
    
    ┌─────────────────────────────────────────────────────┐
    │ Phase 3: Reward Model Update                        │
    └─────────────────────────────────────────────────────┘
    Generate comparison data from improved π_θ
    Continually train r_φ with:
        - 90% new data (from current policy)
        - 10% historical data (replay buffer)

Key Design Choices

Reference reset: Each iteration starts fresh, allowing larger cumulative policy drift while constraining per-iteration change.

Replay buffer: 10% historical data prevents the reward model from forgetting how to score earlier, easier outputs.

Single GRPO update: DeepSeekMath uses one gradient step per sample batch (μ=1\mu = 1), prioritizing fresh samples over reuse.


PyTorch Implementation

Here’s a complete, educational implementation of GRPO:

"""
Group Relative Policy Optimization (GRPO)
=========================================

Memory-efficient alternative to PPO that eliminates the value network
by computing advantages from groups of sampled outputs.

Reference: DeepSeekMath (Shao et al., 2024)
"""

import torch
import torch.nn.functional as F
from torch import Tensor
from dataclasses import dataclass
from typing import Tuple, List


@dataclass
class GRPOConfig:
    """GRPO hyperparameters."""
    
    # Core settings
    group_size: int = 64           # G: outputs per prompt
    clip_epsilon: float = 0.2      # ε: PPO clipping
    kl_coef: float = 0.04          # β: KL penalty coefficient
    
    # Generation
    max_length: int = 1024
    temperature: float = 1.0
    
    # Training
    learning_rate: float = 1e-6
    max_grad_norm: float = 1.0
    
    # Numerical stability
    eps: float = 1e-8


def compute_group_advantages(rewards: Tensor, eps: float = 1e-8) -> Tensor:
    """
    Compute group-relative advantages.
    
    This is GRPO's key innovation: use group statistics
    instead of a learned value function.
    
    Args:
        rewards: [G] rewards for each output in the group
        eps: Small constant for numerical stability
        
    Returns:
        advantages: [G] normalized advantages
    """
    mean_r = rewards.mean()
    std_r = rewards.std() + eps
    advantages = (rewards - mean_r) / std_r
    return advantages


def compute_schulman_kl(
    policy_log_probs: Tensor,
    reference_log_probs: Tensor,
) -> Tensor:
    """
    Compute Schulman's unbiased KL estimator.
    
    D_KL = π_ref/π_θ - log(π_ref/π_θ) - 1
    
    This estimator is:
    - Non-negative (always >= 0)
    - Unbiased (correct expectation)
    - Zero when π_θ = π_ref
    
    Args:
        policy_log_probs: [batch, seq_len] from π_θ
        reference_log_probs: [batch, seq_len] from π_ref
        
    Returns:
        kl: [batch, seq_len] per-token KL
    """
    log_ratio = reference_log_probs - policy_log_probs  # log(π_ref/π_θ)
    kl = torch.exp(log_ratio) - log_ratio - 1
    return kl


def compute_grpo_loss(
    policy_log_probs: Tensor,
    old_log_probs: Tensor,
    reference_log_probs: Tensor,
    advantages: Tensor,
    response_mask: Tensor,
    clip_epsilon: float = 0.2,
    kl_coef: float = 0.04,
) -> Tuple[Tensor, dict]:
    """
    Compute GRPO loss with clipping and KL penalty.
    
    Args:
        policy_log_probs: [G, seq_len] current policy
        old_log_probs: [G, seq_len] policy at sampling time
        reference_log_probs: [G, seq_len] reference policy
        advantages: [G] per-output advantages (will broadcast)
        response_mask: [G, seq_len] valid token mask
        clip_epsilon: PPO clipping parameter
        kl_coef: KL penalty coefficient
        
    Returns:
        loss: Scalar loss to minimize
        metrics: Dictionary of training metrics
    """
    G, seq_len = policy_log_probs.shape
    
    # =========================================
    # 1. Importance ratio: ρ = π_θ / π_old
    # =========================================
    log_ratio = policy_log_probs - old_log_probs
    ratio = torch.exp(log_ratio)  # [G, seq_len]
    
    # =========================================
    # 2. Expand advantages to token level
    # =========================================
    # [G] -> [G, seq_len]
    token_advantages = advantages.unsqueeze(1).expand(-1, seq_len)
    
    # =========================================
    # 3. PPO clipped objective
    # =========================================
    clipped_ratio = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
    
    policy_loss_1 = ratio * token_advantages
    policy_loss_2 = clipped_ratio * token_advantages
    policy_loss = torch.min(policy_loss_1, policy_loss_2)
    
    # =========================================
    # 4. KL penalty (decoupled from reward)
    # =========================================
    kl_penalty = compute_schulman_kl(policy_log_probs, reference_log_probs)
    
    # =========================================
    # 5. Combine losses
    # =========================================
    # Maximize policy objective, minimize KL
    token_loss = -policy_loss + kl_coef * kl_penalty
    
    # Mask and average
    masked_loss = token_loss * response_mask
    per_output_loss = masked_loss.sum(dim=1) / response_mask.sum(dim=1).clamp(min=1)
    loss = per_output_loss.mean()
    
    # =========================================
    # 6. Metrics
    # =========================================
    with torch.no_grad():
        clip_fraction = ((ratio - 1).abs() > clip_epsilon).float()
        clip_fraction = (clip_fraction * response_mask).sum() / response_mask.sum()
        
        mean_kl = (kl_penalty * response_mask).sum() / response_mask.sum()
    
    metrics = {
        "loss": loss.item(),
        "clip_fraction": clip_fraction.item(),
        "mean_kl": mean_kl.item(),
        "mean_advantage": advantages.mean().item(),
        "advantage_std": advantages.std().item(),
    }
    
    return loss, metrics


class GRPOTrainer:
    """
    GRPO trainer for LLM fine-tuning.
    
    Unlike PPO, requires only THREE models (no value network):
    - policy_model: The LLM being trained
    - reference_model: Frozen copy for KL penalty
    - reward_model: Scores complete responses
    
    Memory savings: ~33% compared to PPO
    """
    
    def __init__(
        self,
        policy_model: torch.nn.Module,
        reference_model: torch.nn.Module,
        reward_model,  # Can be a function or model
        tokenizer,
        config: GRPOConfig,
    ):
        self.policy = policy_model
        self.reference = reference_model
        self.reward_fn = reward_model
        self.tokenizer = tokenizer
        self.config = config
        
        # Freeze reference model
        for param in self.reference.parameters():
            param.requires_grad = False
        
        # Single optimizer (no value network!)
        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config.learning_rate,
            betas=(0.9, 0.95),
            weight_decay=0.1
        )
    
    @torch.no_grad()
    def sample_group(
        self,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> Tuple[Tensor, Tensor]:
        """
        Sample G outputs for a single prompt.
        
        Args:
            prompt_ids: [1, prompt_len] token ids
            attention_mask: [1, prompt_len]
            
        Returns:
            output_ids: [G, total_len] generated sequences
            output_masks: [G, total_len] attention masks
        """
        G = self.config.group_size
        device = prompt_ids.device
        
        # Expand prompt for batch generation
        prompt_ids = prompt_ids.expand(G, -1)
        attention_mask = attention_mask.expand(G, -1)
        
        # Sample from policy
        self.policy.eval()
        outputs = self.policy.generate(
            input_ids=prompt_ids,
            attention_mask=attention_mask,
            max_new_tokens=self.config.max_length,
            do_sample=True,
            temperature=self.config.temperature,
            pad_token_id=self.tokenizer.pad_token_id,
            return_dict_in_generate=True,
        )
        
        output_ids = outputs.sequences
        output_masks = (output_ids != self.tokenizer.pad_token_id).long()
        
        return output_ids, output_masks
    
    @torch.no_grad()
    def compute_rewards(
        self,
        prompt_text: str,
        output_ids: Tensor,
        prompt_len: int,
    ) -> Tensor:
        """
        Compute rewards for each output in the group.
        
        Args:
            prompt_text: The original prompt string
            output_ids: [G, total_len] generated sequences
            prompt_len: Length of prompt in tokens
            
        Returns:
            rewards: [G] scalar reward per output
        """
        G = output_ids.shape[0]
        rewards = []
        
        for i in range(G):
            # Extract response tokens
            response_ids = output_ids[i, prompt_len:]
            response_text = self.tokenizer.decode(
                response_ids, skip_special_tokens=True
            )
            
            # Score with reward model
            reward = self.reward_fn(prompt_text, response_text)
            rewards.append(reward)
        
        return torch.tensor(rewards, device=output_ids.device)
    
    def compute_log_probs(
        self,
        model: torch.nn.Module,
        input_ids: Tensor,
        attention_mask: Tensor,
        response_start_idx: int,
    ) -> Tensor:
        """
        Compute per-token log probabilities for response tokens.
        
        Args:
            model: Language model
            input_ids: [batch, seq_len] full sequences
            attention_mask: [batch, seq_len]
            response_start_idx: Where response tokens begin
            
        Returns:
            log_probs: [batch, response_len] per-token log probs
        """
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=True,
        )
        logits = outputs.logits  # [batch, seq_len, vocab]
        
        # Shift for next-token prediction
        shift_logits = logits[:, :-1, :]
        shift_labels = input_ids[:, 1:]
        
        # Log probabilities
        log_probs = F.log_softmax(shift_logits, dim=-1)
        
        # Gather log probs for actual tokens
        token_log_probs = log_probs.gather(
            dim=-1,
            index=shift_labels.unsqueeze(-1)
        ).squeeze(-1)
        
        # Extract response tokens only
        response_log_probs = token_log_probs[:, response_start_idx - 1:]
        
        return response_log_probs
    
    def train_step(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> dict:
        """
        Single GRPO training step.
        
        Args:
            prompt_text: Original prompt string
            prompt_ids: [1, prompt_len] tokenized prompt
            attention_mask: [1, prompt_len]
            
        Returns:
            metrics: Training metrics dictionary
        """
        prompt_len = prompt_ids.shape[1]
        device = prompt_ids.device
        
        # =============================================
        # Step 1: Sample group of outputs
        # =============================================
        output_ids, output_masks = self.sample_group(prompt_ids, attention_mask)
        G = output_ids.shape[0]
        
        # =============================================
        # Step 2: Compute rewards
        # =============================================
        rewards = self.compute_rewards(prompt_text, output_ids, prompt_len)
        
        # =============================================
        # Step 3: Compute group-relative advantages
        # =============================================
        advantages = compute_group_advantages(rewards, self.config.eps)
        
        # =============================================
        # Step 4: Get log probs (old = current pre-update)
        # =============================================
        with torch.no_grad():
            old_log_probs = self.compute_log_probs(
                self.policy, output_ids, output_masks, prompt_len
            )
            reference_log_probs = self.compute_log_probs(
                self.reference, output_ids, output_masks, prompt_len
            )
        
        # =============================================
        # Step 5: Policy gradient update
        # =============================================
        self.policy.train()
        
        policy_log_probs = self.compute_log_probs(
            self.policy, output_ids, output_masks, prompt_len
        )
        
        # Response mask
        response_mask = output_masks[:, prompt_len:]
        
        loss, metrics = compute_grpo_loss(
            policy_log_probs,
            old_log_probs,
            reference_log_probs,
            advantages,
            response_mask,
            self.config.clip_epsilon,
            self.config.kl_coef,
        )
        
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(
            self.policy.parameters(), 
            self.config.max_grad_norm
        )
        self.optimizer.step()
        
        # Add reward metrics
        metrics["mean_reward"] = rewards.mean().item()
        metrics["reward_std"] = rewards.std().item()
        
        return metrics

Implementation Notes

Group size selection: DeepSeekMath uses G=64G=64. Larger groups give more stable advantage estimates. Reasonable range: 16-128.

Single update per sample: The implementation does one gradient step per sampled batch. Multiple steps would require re-computing advantages as the policy changes.

KL coefficient: Start with β=0.04\beta = 0.04. Lower values allow more divergence; higher values constrain updates.


DeepSeek-R1: GRPO at Scale

In January 2025, DeepSeek released DeepSeek-R1, a reasoning model matching OpenAI’s o1 on many benchmarks. GRPO is central to its training.

R1’s Training Pipeline

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph TB subgraph R1Pipeline["<b>DeepSeek-R1 Training Pipeline</b>"] direction TB Stage1["<b>Stage 1: Cold Start</b><br/>━━━━━━━━━━<br/>Small amount of long CoT data<br/>Teaches extended reasoning format"] Stage2["<b>Stage 2: Reasoning RL</b><br/>━━━━━━━━━━<br/>GRPO with rule-based rewards<br/>Math correctness, code execution<br/>Model develops 'thinking' traces"] Stage3["<b>Stage 3: Rejection Sampling + SFT</b><br/>━━━━━━━━━━<br/>Sample from RL checkpoint<br/>Human annotation for quality<br/>SFT on curated reasoning data"] Stage4["<b>Stage 4: Final RL</b><br/>━━━━━━━━━━<br/>Broader GRPO training<br/>Helpfulness + harmlessness<br/>Multi-objective alignment"] Stage1 --> Stage2 Stage2 --> Stage3 Stage3 --> Stage4 end style Stage1 fill:#1e293b,stroke:#64748b,color:#cbd5e1,stroke-width:2px style Stage2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Stage3 fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style Stage4 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style R1Pipeline fill:none,stroke:#334155,color:#94a3b8,stroke-width:1px

Why GRPO for R1?

1. Long reasoning traces: R1 produces outputs with thousands of tokens. A value network would need to predict final reward from any point in this long sequence—extremely difficult.

2. Verifiable rewards: For math and code, rewards are rule-based (correct/incorrect). No reward model uncertainty.

3. Memory efficiency: At R1’s scale, GRPO’s memory savings are crucial.

4. Training stability: Simpler architecture means fewer things can go wrong.

The “Aha Moment” Phenomenon

The R1 paper describes emergent behavior during GRPO training: the model spontaneously developed “aha moment” patterns—places where it reconsiders and corrects earlier reasoning mistakes.

Example reasoning trace:
"Let me solve this step by step...
 First, I'll set up the equation: 2x + 5 = 13
 Subtracting 5: 2x = 8
 Wait, let me verify that. 13 - 5 = 8. Yes, correct.
 Dividing by 2: x = 4
 
 Let me check: 2(4) + 5 = 8 + 5 = 13 ✓
 
 The answer is x = 4."

This self-correction emerged without explicit supervision—GRPO’s optimization pressure (combined with correctness rewards) naturally discovered it.

R1-Zero: Pure RL Without SFT

DeepSeek also released R1-Zero, trained with GRPO directly from the base model without any SFT stage. Remarkably, it still develops coherent reasoning, though with some readability issues that the full R1 pipeline addresses.

This demonstrates GRPO’s power: given only a correctness signal, the algorithm can induce sophisticated reasoning behavior.

Key Insight: GRPO’s simplicity makes it stable enough for extreme-scale training. DeepSeek-R1 shows that you can train world-class reasoning models without the complexity of PPO’s value network.


A Unified View: GRPO, PPO, DPO, and RFT

The DeepSeekMath paper provides a unified framework for understanding different alignment methods.

The General Policy Gradient

All methods can be viewed through:

θJ=E(q,o)D[tθlogπθ(otq,o<t)w(q,o)]\nabla_\theta J = \mathbb{E}*{(q, o) \sim \mathcal{D}}\left[\sum_t \nabla*\theta \log \pi_\theta(o_t|q, o_{<t}) \cdot w(q, o)\right]

Methods differ in:

  1. Data source D\mathcal{D}: Where do samples come from?
  2. Gradient coefficient w(q,o)w(q, o): How are outputs weighted?

Method Comparison

MethodData SourceCoefficient wwOnline?
SFTHuman demosw=1w = 1 (uniform)
RFTFiltered policy samplesw=1w = 1 (filtered)
DPOPreference pairsw=βσ()w = -\beta\sigma'(\cdot)
Online RFTCurrent policyw=1w = 1 (filtered)
PPOCurrent policyw=Atw = A_t (GAE)
GRPOCurrent policyw=r~w = \tilde{r} (group-normalized)

Key Insights

Online vs. Offline: GRPO, PPO, and Online RFT sample from the current policy and can adapt. DPO and RFT use fixed datasets.

Weighted vs. Filtered: RFT keeps only correct outputs (binary filter). GRPO keeps all outputs but weights by normalized reward (continuous weighting).

Advantage estimation: PPO learns V(s)V(s); GRPO uses group statistics; DPO implicitly computes advantages through preferences.

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph LR subgraph Methods["<b>Alignment Methods: A Unified View</b>"] direction LR SFT["<b>SFT</b><br/>━━━━━<br/>Offline<br/>w = 1"] --> RFT["<b>RFT</b><br/>━━━━━<br/>Offline<br/>w = 1 (filtered)"] RFT --> DPO["<b>DPO</b><br/>━━━━━<br/>Offline<br/>w = −βσ'"] DPO --> PPO["<b>PPO</b><br/>━━━━━<br/>Online<br/>w = GAE"] PPO --> GRPO["<b>GRPO</b><br/>━━━━━<br/>Online<br/>w = (r−μ)/σ"] end style SFT fill:#1e293b,stroke:#64748b,color:#cbd5e1,stroke-width:2px style RFT fill:#1e293b,stroke:#64748b,color:#cbd5e1,stroke-width:2px style DPO fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style PPO fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style GRPO fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Methods fill:none,stroke:#334155,color:#94a3b8,stroke-width:1px

Limitations and When to Use Alternatives

GRPO Limitations

1. Sampling cost: GG outputs per prompt (typically 64). More compute than PPO’s single sample.

2. Advantage variance: With small groups, the mean is a noisy baseline. Need G32G \geq 32 for stability.

3. Uniform token credit: All tokens get the same advantage. No fine-grained credit assignment without process supervision.

4. Reward model dependency: Like all RL methods, susceptible to reward hacking if the reward model has exploits.

5. No value function benefits: Can’t use value-guided search or intrinsic motivation bonuses.

When to Use Each Method

SituationRecommendation
Memory-constrained✅ GRPO
Compute-constrainedConsider PPO
Outcome rewards only✅ GRPO
Dense per-step rewardsConsider PPO (GAE)
Offline preferences onlyUse DPO
No reward modelUse DPO or RFT
Value-guided search neededUse PPO
Multi-reward trainingUse GDPO (Part 4)

Key Takeaways

The Core Innovation

A^i=rimean(r)std(r)\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}

Sample multiple outputs, normalize rewards, use as advantages. No value network.

Memory Savings

  • Eliminates ~28GB value network (for 7B model)
  • 33% reduction in model memory
  • Often the difference between 1 GPU and multiple

Design Choices

  • Group size: G=64G = 64 for stability
  • Clip epsilon: ε=0.2\varepsilon = 0.2 (standard PPO)
  • KL coefficient: β=0.04\beta = 0.04
  • Decoupled KL: In loss, not reward

When GRPO Excels

  • Memory-constrained training
  • Outcome-based rewards
  • Long generations (where value prediction is hard)
  • Verifiable rewards (math, code)

DeepSeek-R1 Connection

GRPO powers R1’s remarkable reasoning through:

  • Rule-based correctness rewards
  • Emergent self-correction (“aha moments”)
  • Stable training at scale

What’s Next

In Part 4: GDPO, we’ll see what happens when GRPO meets multi-reward settings:

  • Why combined reward normalization causes “advantage collapse”
  • GDPO’s solution: decouple normalization per reward
  • Production patterns for tool calling, math, and coding

Further Reading

Original Papers:

Related Work:

Implementations:

Article series

Policy Optimization for LLMs: From Fundamentals to Production

Part 3 of 4

  1. Part 1 Reinforcement Learning Foundations for LLM Alignment
  2. Part 2 PPO for Language Models: The RLHF Workhorse
  3. Part 3 GRPO: Eliminating the Value Network
  4. Part 4 GDPO: Multi-Reward RL Done Right

Keep Reading

Comparison diagram showing PPO with value network versus GRPO with group-based advantage estimation

GRPO: Eliminating the Value Network

Group Relative Policy Optimization replaces PPO's learned value function with a simple insight: sample multiple outputs and use their relative rewards as advantages. 33% memory savings, simpler implementation, and the algorithm powering DeepSeek-R1.

Series
Policy Optimization for LLMs: From Fundamentals to Production Part 3

~32 min

Read article
Diagram showing PPO four-model architecture for LLM training

PPO for Language Models: The RLHF Workhorse

Deep dive into Proximal Policy Optimization—the algorithm behind most LLM alignment. Understand trust regions, the clipped objective, GAE, and why PPO's four-model architecture creates problems at scale.

Series
Policy Optimization for LLMs: From Fundamentals to Production Part 2

~28 min

Read article
Diagram showing the production architecture for contextual bandits deployments

Deploying Contextual Bandits: Production Guide and Offline Evaluation

Systems design, offline evaluation, and monitoring strategies for running contextual bandits safely in production.

Series
Adaptive Optimization at Scale: Contextual Bandits from Theory to Production Part 5

24 min read

Read article
Comparison flowchart of contextual bandit algorithms

Implementing Contextual Bandits: Complete Algorithm Guide

Complete Python implementations of ε-greedy, UCB, LinUCB, and Thompson Sampling. Learn which algorithm to use for your problem with default hyperparameters and practical tuning guidance.

Series
Adaptive Optimization at Scale: Contextual Bandits from Theory to Production Part 3

~25 min

Read article
View all articles