Skip to content

Add entropy regularization to GRPO#6140

Open
albertvillanova wants to merge 41 commits into
mainfrom
worktree-fix-3320
Open

Add entropy regularization to GRPO#6140
albertvillanova wants to merge 41 commits into
mainfrom
worktree-fix-3320

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Jun 22, 2026

Copy link
Copy Markdown
Member

Add entropy regularization to GRPO.

Fix #3320.
Supersede and close #3628.

This PR adds support for entropy regularization to the GRPO trainer, including both static and adaptive entropy control. Entropy regularization helps encourage exploration and prevents the policy from collapsing to deterministic outputs. The changes include configuration options, loss computation, metrics logging, checkpointing, and documentation updates, as well as new tests to verify the feature.

Changes

GRPO Trainer: Entropy Regularization

Implementation and Loss Function

  • Added entropy regularization to the GRPO loss, supporting both static and adaptive entropy coefficients. The entropy bonus is applied to the loss, and the coefficient can be updated each optimizer step if adaptive entropy is enabled.
  • The entropy coefficient and adaptive entropy settings are initialized from GRPOConfig, and an error is raised if entropy regularization is used with the Liger kernel.
  • During training, entropy-related metrics (entropy_loss, entropy_coef) are logged, and the adaptive coefficient is updated based on the current entropy.

Configuration and Checkpointing

  • Added new fields to GRPOConfig for entropy regularization, including entropy_coef, use_adaptive_entropy, entropy_coef_min, entropy_coef_max, entropy_coef_delta, and entropy_target, with detailed documentation.
  • When using adaptive entropy, the current coefficient is saved and restored with checkpoints to ensure training is resumable.

Documentation and Testing

  • Updated the documentation to describe entropy regularization, its configuration, and usage examples for both static and adaptive modes. Also documented new reward metrics related to entropy.
  • Added tests for both static and adaptive entropy regularization to verify training and logging of the new metrics.

Note

Medium Risk
Changes core GRPO loss computation and optimizer-step behavior for all users who enable entropy options; default entropy_coef=0 leaves behavior unchanged, but adaptive entropy and multi-loss-type scaling add complexity where training bugs could affect policy updates.

Overview
Adds entropy regularization to GRPO: the objective becomes policy loss minus entropy_coef times mean per-token entropy, with the bonus using the same token mask as policy loss when top_entropy_quantile is set.

GRPOConfig gains entropy_coef, use_adaptive_entropy, entropy_target, entropy_coef_delta, and min/max bounds. Adaptive mode (Skywork-OR1) updates the coefficient once per optimizer step from global entropy vs. target, applies the bonus only when entropy is at or below target, keeps coef fixed across gradient-accumulation micro-batches, and persists state in entropy_ctrl_state.json on checkpoint resume. Entropy regularization is rejected with Liger kernel.

The trainer logs policy_loss and entropy_coef, with loss-type-specific scaling so entropy_coef means the same across grpo, dr_grpo, dapo, luspo, etc. Docs cover usage and metrics; paper_index adds Skywork-OR1; tests cover static/adaptive training, per-loss-type bonus scale, and adaptive behavior under gradient accumulation.

Reviewed by Cursor Bugbot for commit bccd8eb. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_config.py
@albertvillanova albertvillanova changed the title Add Adaptive Entropy Control to GRPO Add entropy regularization to GRPO Jun 22, 2026

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, a few remarks

Comment thread trl/trainer/grpo_trainer.py Outdated
else:
return (x * mask).sum() / completion_token_count

self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy_loss is captured after loss = loss / normalizer, where normalizer = current_gradient_accumulation_steps, so the logged policy_loss is the per-micro-batch contribution, not the step loss; it'll read ~accum× too small. I think we should capture before dividing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another remark, it can be misleading to have policy_loss logged when it's not used in the loss. Maybe we should gate its logging.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and just for matching home-style:

self._metrics[mode]["policy_loss"].append(self.accelerator.gather(policy_loss).nanmean().item())

Comment thread trl/trainer/grpo_trainer.py Outdated
# drops below entropy_target again.
if self.entropy_coef != 0.0 or self.use_adaptive_entropy:
if self.loss_type in ["grpo", "sapo", "luspo"]:
entropy_loss = ((entropies * mask).sum(-1) / mask.sum(-1).clamp(min=1.0)).mean() / normalizer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entropy block lumps luspo with grpo/sapo, but luspo's actual loss is (per_token_loss * mask.sum(1,keepdim=True)).mean(). So for luspo the entropy bonus lives on a different scale than the policy term and entropy_coef means something different there. Either give luspo its own branch or note it. To be confirmed, but probably something like:

entropy_loss = (entropies * mask).sum(-1).mean() / normalizer

Comment thread trl/trainer/grpo_trainer.py Outdated

self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item())
if self.entropy_coef != 0.0 or self.use_adaptive_entropy:
self._metrics[mode]["entropy_loss"].append(world_entropy)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have two near-duplicate entropy metrics. We could drop it, and just log entropy. The existing entropy already logs mean entropy; entropy_loss is just a slightly different (global vs gathered-local-mean) computation of the same quantity.

Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6e8f498. Configure here.

Comment thread trl/trainer/grpo_trainer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Adaptive Entropy Control to GRPOTrainer

2 participants