← back ·
PPO vs GRPO
A Deep Dive into Two Policy Gradient Implementations
This post compares two policy optimization algorithms implemented in our RL codebase: Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). We’ll examine their architectural differences, loss functions, and training loops from an engineering perspective.
Shared Abstractions: The Policy Interface
Both PPO and GRPO optimize a stochastic policy—a neural network that outputs a probability distribution over actions given a state. Despite their algorithmic differences, both algorithms need the same core operations from a policy:
-
Sample actions during environment rollout
-
Evaluate action probabilities during training (for importance sampling)
This shared requirement allows us to define a single StochasticPolicy interface that both algorithms consume:
class StochasticPolicy:
def get_action(states) -> (action, log_prob)
def evaluate_actions(states, actions) -> (log_probs, entropy)
The benefit is that policy architectures become algorithm-agnostic. Whether using a simple MLP, a continuous Gaussian policy, or a LoRA-finetuned LLM backbone, the same network can be trained with either PPO or GRPO without modification.
The Core Architectural Difference: Value Functions
The fundamental distinction between PPO and GRPO lies in how they estimate the baseline for variance reduction.
-
PPO trains a value network alongside the policy. This network learns to predict expected returns from each state, providing a per-timestep baseline for advantage computation.
-
GRPO eliminates the value network entirely. Instead, it collects multiple trajectories per update and normalizes rewards *across the group*. The baseline is simply the mean reward of the batch—no learned parameters required.
This trade-off is significant - PPO has more model complexity but needs fewer environment samples per update. GRPO has a simpler architecture but requires collecting group_dim trajectories before each optimization step.
Loss Function Comparison
Both algorithms use the clipped surrogate objective from PPO, but differ in advantage computation and regularization.
The Clipped Objective (Shared)
Both algorithms compute the probability ratio between current and old policies, then clip it to prevent destructive updates:
Where
is the importance sampling ratio.
Advantage estimaqtion
PPO uses Generalized Advantage Estimation (GAE)
where
Meanwhile, GRPO uses group-relative normalization (that gives rise to its name):
where
are computed are computed across multiple collected trajectories for the same episode rollout.
Value loss and regularization
GRPO looses PPO’s value loss term, because it’s not using the value function.
Regularization also differs - the baseline PPO algorithm used an Entropy bonus, and some later extensions migrated to using KL divergence to minimize the changes to policy distributions during a training step. And KL divergence is what GRPO uses.
GRPO paper also introduced a more stable way of approximating KL divergence.
Final loss
For PPO, it’s:
while for GRPO it’s:
Key Insight: Temporal vs. Batch Normalization
The advantage computation is where these algorithms fundamentally diverge:
-
PPO ’s GAE performs temporal credit assignment—it propagates information backward through time using the learned value function to determine which actions led to good outcomes.
-
GRPO ’s group normalization performs batch comparison—it asks “how did this trajectory’s reward compare to other trajectories from the same policy?” This sidesteps temporal credit assignment entirely.
Training Loop Comparison
Because GRPO trains on multiple trajectories for the same initial condition (or, in LLM terms - multiple completions to the same input query), several aspects of the training loop change:
PPO’s tensor shapes for State, Actions, Rewards etc. take form (T, …), where T - trajectory length. GRPO introduces an additional dimension G - group dimension - changing the shape to (T, G, …).
Because of this additional Group dimension, which can be thought of as a Batch dimension, trajectories for GRPO need to be padded and masked.
Lastly, because GRPO doesn’t use Value function, only the Policy module is trained.
The Masking Requirement in GRPO
Since GRPO stacks multiple trajectories of different lengths, it must:
-
Pad shorter trajectories to match the longest
-
Create a mask tensor indicating valid (non-padded) positions
-
Apply the mask during:
-
Advantage normalization (compute mean/std only over valid rewards)
-
Loss computation (exclude padded timesteps from gradients)
-
This adds implementation complexity but is unavoidable when batching variable-length sequences.
Pseudocode Comparison
PPO Training Loop:
for epoch in epochs:
trajectory = collect_episodes(policy, env) # Shape: (T, ...)
advantages = compute_gae(trajectory, value_fn) # Temporal credit assignment
loss = ppo_loss(policy, value_fn, trajectory, advantages)
optimizer.step(loss) # Updates policy AND value_fn
GRPO Training Loop:
for epoch in epochs:
trajectories = [collect_episodes(policy, env)
for _ in range(group_dim)] # List of G trajectories
batched = stack_and_pad(trajectories) # Shape: (T, G, ...)
mask = compute_mask(batched) # Valid position indicator
advantages = group_normalize(batched.rewards, mask) # Batch comparison
loss = grpo_loss(policy, batched, advantages, mask)
optimizer.step(loss) # Updates policy only
When to Use Each Algorithm
The obvious difference is that PPO is a general purpose policy optimization algorithm that can handle both LLM tuning as well as other RL use cases.
In addition, it takes the candle when dealing with limited environment samples - the value function increases sample efficiency and enables learning from fewer trajectories.
GRPO on the other hand has been designed specifically for LLM tuning. I took a liberty of unleashing it on a regular RL environment, however it’s not suitable to that - it’s much less sample efficient and assumes a strong policy baseline.
Conclusion
PPO and GRPO share the same clipped surrogate objective but differ fundamentally in how they estimate advantages:
-
PPO learns a value function for temporal credit assignment, adding model complexity but reducing sample requirements
-
GRPO uses group-relative normalization, trading sample efficiency for architectural simplicity
The shared policy interface in our implementation means you can benchmark both algorithms with identical network architectures—only the training loop and loss computation differ.
Thank you for reading !