Add entropy regularization to GRPO#6140
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| else: | ||
| return (x * mask).sum() / completion_token_count | ||
|
|
||
| self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item()) |
There was a problem hiding this comment.
policy_loss is captured after loss = loss / normalizer, where normalizer = current_gradient_accumulation_steps, so the logged policy_loss is the per-micro-batch contribution, not the step loss; it'll read ~accum× too small. I think we should capture before dividing.
There was a problem hiding this comment.
another remark, it can be misleading to have policy_loss logged when it's not used in the loss. Maybe we should gate its logging.
There was a problem hiding this comment.
and just for matching home-style:
self._metrics[mode]["policy_loss"].append(self.accelerator.gather(policy_loss).nanmean().item())| # drops below entropy_target again. | ||
| if self.entropy_coef != 0.0 or self.use_adaptive_entropy: | ||
| if self.loss_type in ["grpo", "sapo", "luspo"]: | ||
| entropy_loss = ((entropies * mask).sum(-1) / mask.sum(-1).clamp(min=1.0)).mean() / normalizer |
There was a problem hiding this comment.
The entropy block lumps luspo with grpo/sapo, but luspo's actual loss is (per_token_loss * mask.sum(1,keepdim=True)).mean(). So for luspo the entropy bonus lives on a different scale than the policy term and entropy_coef means something different there. Either give luspo its own branch or note it. To be confirmed, but probably something like:
entropy_loss = (entropies * mask).sum(-1).mean() / normalizer|
|
||
| self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item()) | ||
| if self.entropy_coef != 0.0 or self.use_adaptive_entropy: | ||
| self._metrics[mode]["entropy_loss"].append(world_entropy) |
There was a problem hiding this comment.
We would have two near-duplicate entropy metrics. We could drop it, and just log entropy. The existing entropy already logs mean entropy; entropy_loss is just a slightly different (global vs gathered-local-mean) computation of the same quantity.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6e8f498. Configure here.

Add entropy regularization to GRPO.
Fix #3320.
Supersede and close #3628.
This PR adds support for entropy regularization to the GRPO trainer, including both static and adaptive entropy control. Entropy regularization helps encourage exploration and prevents the policy from collapsing to deterministic outputs. The changes include configuration options, loss computation, metrics logging, checkpointing, and documentation updates, as well as new tests to verify the feature.
Changes
GRPO Trainer: Entropy Regularization
Implementation and Loss Function
GRPOConfig, and an error is raised if entropy regularization is used with the Liger kernel.entropy_loss,entropy_coef) are logged, and the adaptive coefficient is updated based on the current entropy.Configuration and Checkpointing
GRPOConfigfor entropy regularization, includingentropy_coef,use_adaptive_entropy,entropy_coef_min,entropy_coef_max,entropy_coef_delta, andentropy_target, with detailed documentation.Documentation and Testing
Note
Medium Risk
Changes core GRPO loss computation and optimizer-step behavior for all users who enable entropy options; default
entropy_coef=0leaves behavior unchanged, but adaptive entropy and multi-loss-type scaling add complexity where training bugs could affect policy updates.Overview
Adds entropy regularization to GRPO: the objective becomes policy loss minus
entropy_coeftimes mean per-token entropy, with the bonus using the same token mask as policy loss whentop_entropy_quantileis set.GRPOConfiggainsentropy_coef,use_adaptive_entropy,entropy_target,entropy_coef_delta, and min/max bounds. Adaptive mode (Skywork-OR1) updates the coefficient once per optimizer step from global entropy vs. target, applies the bonus only when entropy is at or below target, keeps coef fixed across gradient-accumulation micro-batches, and persists state inentropy_ctrl_state.jsonon checkpoint resume. Entropy regularization is rejected with Liger kernel.The trainer logs
policy_lossandentropy_coef, with loss-type-specific scaling soentropy_coefmeans the same acrossgrpo,dr_grpo,dapo,luspo, etc. Docs cover usage and metrics; paper_index adds Skywork-OR1; tests cover static/adaptive training, per-loss-type bonus scale, and adaptive behavior under gradient accumulation.Reviewed by Cursor Bugbot for commit bccd8eb. Bugbot is set up for automated code reviews on this repo. Configure here.