Add entropy regularization to GRPO by albertvillanova · Pull Request #6140 · huggingface/trl

albertvillanova · 2026-06-22T11:33:38Z

Add entropy regularization to GRPO.

Fix #3320.
Supersede and close #3628.

This PR adds support for entropy regularization to the GRPO trainer, including both static and adaptive entropy control. Entropy regularization helps encourage exploration and prevents the policy from collapsing to deterministic outputs. The changes include configuration options, loss computation, metrics logging, checkpointing, and documentation updates, as well as new tests to verify the feature.

Changes

GRPO Trainer: Entropy Regularization

Implementation and Loss Function

Added entropy regularization to the GRPO loss, supporting both static and adaptive entropy coefficients. The entropy bonus is applied to the loss, and the coefficient can be updated each optimizer step if adaptive entropy is enabled.
The entropy coefficient and adaptive entropy settings are initialized from GRPOConfig, and an error is raised if entropy regularization is used with the Liger kernel.
During training, entropy-related metrics (entropy_loss, entropy_coef) are logged, and the adaptive coefficient is updated based on the current entropy.

Configuration and Checkpointing

Added new fields to GRPOConfig for entropy regularization, including entropy_coef, use_adaptive_entropy, entropy_coef_min, entropy_coef_max, entropy_coef_delta, and entropy_target, with detailed documentation.
When using adaptive entropy, the current coefficient is saved and restored with checkpoints to ensure training is resumable.

Documentation and Testing

Updated the documentation to describe entropy regularization, its configuration, and usage examples for both static and adaptive modes. Also documented new reward metrics related to entropy.
Added tests for both static and adaptive entropy regularization to verify training and logging of the new metrics.

Note

Medium Risk
Changes core GRPO loss computation and optimizer-step behavior for all users who enable entropy options; default entropy_coef=0 leaves behavior unchanged, but adaptive entropy and multi-loss-type scaling add complexity where training bugs could affect policy updates.

Overview
Adds entropy regularization to GRPO: the objective becomes policy loss minus entropy_coef times mean per-token entropy, with the bonus using the same token mask as policy loss when top_entropy_quantile is set.

GRPOConfig gains entropy_coef, use_adaptive_entropy, entropy_target, entropy_coef_delta, and min/max bounds. Adaptive mode (Skywork-OR1) updates the coefficient once per optimizer step from global entropy vs. target, applies the bonus only when entropy is at or below target, keeps coef fixed across gradient-accumulation micro-batches, and persists state in entropy_ctrl_state.json on checkpoint resume. Entropy regularization is rejected with Liger kernel.

The trainer logs policy_loss and entropy_coef, with loss-type-specific scaling so entropy_coef means the same across grpo, dr_grpo, dapo, luspo, etc. Docs cover usage and metrics; paper_index adds Skywork-OR1; tests cover static/adaptive training, per-loss-type bonus scale, and adaptive behavior under gradient accumulation.

^{Reviewed by Cursor Bugbot for commit bccd8eb. Bugbot is set up for automated code reviews on this repo. Configure here.}

bot-ci-comment · 2026-06-22T11:36:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…_coef

qgallouedec

thanks, a few remarks

qgallouedec · 2026-06-22T17:37:20Z

            else:
                return (x * mask).sum() / completion_token_count

+        self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item())


policy_loss is captured after loss = loss / normalizer, where normalizer = current_gradient_accumulation_steps, so the logged policy_loss is the per-micro-batch contribution, not the step loss; it'll read ~accum× too small. I think we should capture before dividing.

another remark, it can be misleading to have policy_loss logged when it's not used in the loss. Maybe we should gate its logging.

and just for matching home-style:

self._metrics[mode]["policy_loss"].append(self.accelerator.gather(policy_loss).nanmean().item())

qgallouedec · 2026-06-22T17:42:06Z

+        # drops below entropy_target again.
+        if self.entropy_coef != 0.0 or self.use_adaptive_entropy:
+            if self.loss_type in ["grpo", "sapo", "luspo"]:
+                entropy_loss = ((entropies * mask).sum(-1) / mask.sum(-1).clamp(min=1.0)).mean() / normalizer


The entropy block lumps luspo with grpo/sapo, but luspo's actual loss is (per_token_loss * mask.sum(1,keepdim=True)).mean(). So for luspo the entropy bonus lives on a different scale than the policy term and entropy_coef means something different there. Either give luspo its own branch or note it. To be confirmed, but probably something like:

entropy_loss = (entropies * mask).sum(-1).mean() / normalizer

qgallouedec · 2026-06-22T17:44:59Z


+        self._metrics[mode]["policy_loss"].append(self.accelerator.reduce(policy_loss, reduction="mean").item())
+        if self.entropy_coef != 0.0 or self.use_adaptive_entropy:
+            self._metrics[mode]["entropy_loss"].append(world_entropy)


We would have two near-duplicate entropy metrics. We could drop it, and just log entropy. The existing entropy already logs mean entropy; entropy_loss is just a slightly different (global vs gathered-local-mean) computation of the same quantity.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

^{Reviewed by Cursor Bugbot for commit 6e8f498. Configure here.}

…pe normalizes

albertvillanova added 6 commits June 22, 2026 13:20

Add fields to GRPOConfig

ac50a11

Add init fields to GRPOTrainer

dcaaf67

Update _compute_loss

0f6306e

Add checkpoint persistence

9b1cc65

Update GRPO docs

e944713

Add tests

f47d5a5