Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
ac50a11
Add fields to GRPOConfig
albertvillanova Jun 22, 2026
dcaaf67
Add init fields to GRPOTrainer
albertvillanova Jun 22, 2026
0f6306e
Update _compute_loss
albertvillanova Jun 22, 2026
9b1cc65
Add checkpoint persistence
albertvillanova Jun 22, 2026
e944713
Update GRPO docs
albertvillanova Jun 22, 2026
f47d5a5
Add tests
albertvillanova Jun 22, 2026
2484e70
Address issues from review
albertvillanova Jun 22, 2026
4507747
Fix wrong entropy for adaptive control
albertvillanova Jun 22, 2026
9b70a4a
Fix Liger skips adaptive entropy guard
albertvillanova Jun 22, 2026
9d79e4a
Fix inconsistent inequality
albertvillanova Jun 22, 2026
46c8a64
Fix mean reduction with sum-count-divide
albertvillanova Jun 22, 2026
3f7a669
Set _last_world_entropy at init
albertvillanova Jun 22, 2026
a05c979
Cache world_entropy at sync point and use that cached value for apply…
albertvillanova Jun 22, 2026
fe03dd1
Persist also _last_world_entropy
albertvillanova Jun 22, 2026
f099349
Add paper_index entry
albertvillanova Jun 22, 2026
5288cd5
Capture the pure policy loss before normalization
albertvillanova Jun 24, 2026
03f4208
Fix luspo loss
albertvillanova Jun 24, 2026
dbc0c75
Gate policy_loss logging and align style
albertvillanova Jun 24, 2026
391da7a
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 24, 2026
506fbf9
Fix entropy state written to wrong path
albertvillanova Jun 24, 2026
8a6b53d
Fix is_world_process_zero() vs args.should_save guard mismatch
albertvillanova Jun 24, 2026
474b30c
Update docs: policy_loss only logged inside entropy block
albertvillanova Jun 24, 2026
a0b9ec6
Log entropy_coef only when sync_gradients=True
albertvillanova Jun 24, 2026
608b1e0
Add guard for entropy-loss dispatch matching policy-loss dispatch
albertvillanova Jun 24, 2026
81841ad
Remove entropy_loss
albertvillanova Jun 24, 2026
bee5126
Gate on train mode to avoid entropy state update during eval
albertvillanova Jun 24, 2026
5c442a0
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 24, 2026
2f34d15
Fix entropy bonus ignores quantile mask
albertvillanova Jun 24, 2026
806078d
Use effective_mask for the world_entropy all-reduce too
albertvillanova Jun 24, 2026
2845ef4
Update docs
albertvillanova Jun 24, 2026
2ed11c0
Use unified formula with mean per-token entropy of active tokens
albertvillanova Jun 24, 2026
7f0562b
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 25, 2026
76255d3
Make three-branch entropy-loss split
albertvillanova Jun 25, 2026
fc76d4b
Compute bonus from frozen state, update per optimizer step
albertvillanova Jun 25, 2026
bed5188
Fix "nearly always triggers" docs
albertvillanova Jun 25, 2026
6e8f498
Add scale test and grad-accumulation adaptive test
albertvillanova Jun 25, 2026
607d911
Fix dr_grpo entropy scale mismatch
albertvillanova Jun 25, 2026
0cfad37
Accumulate to mean per-token entropy, independent of how each loss ty…
albertvillanova Jun 25, 2026
8e05132
Update tests
albertvillanova Jun 25, 2026
f15e04a
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 26, 2026
bccd8eb
Add clarifying sentence
albertvillanova Jun 26, 2026
0f3e145
Merge branch 'main' into worktree-fix-3320
qgallouedec Jun 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,9 @@ While training and evaluating, we record the following reward metrics:
- `reward`: The overall average reward after summing rewards across functions (weighted by `reward_weights`).
- `reward_std`: The standard deviation of summed rewards across functions (weighted by `reward_weights`), computed over the full batch.
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `policy_loss`: The policy gradient loss value (before any entropy bonus). Logged when `entropy_coef` is nonzero or `use_adaptive_entropy=True`.
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `entropy_coef`: The current entropy regularization coefficient. Logged when `entropy_coef` is nonzero or `use_adaptive_entropy=True`. Updated once per optimizer step when `use_adaptive_entropy=True`.
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region: \\( \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \\). A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
Expand Down Expand Up @@ -641,6 +643,46 @@ and the reward will be computed as the sum of the rewards from each function, or

Note that [`GRPOTrainer`] supports multiple reward functions of different types. See the parameters documentation for more details.

### Entropy regularization

To encourage exploration and prevent the policy from collapsing to near-deterministic outputs, you can add an entropy bonus to the training objective. The entropy regularization augments the GRPO loss as follows:

$$
\mathcal{L}(\theta) = \mathcal{L}_{\text{GRPO}}(\theta) - \alpha \cdot \mathcal{H}(\pi_\theta),
$$

where \\(\mathcal{H}(\pi_\theta)\\) is the mean per-token entropy of the policy and \\(\alpha\\) is the entropy coefficient. The bonus is always the mean per-token entropy regardless of `loss_type`; it is not rescaled to match a loss type's policy normalization (e.g. Dr. GRPO's `batch_size * max_completion_length` denominator), so `entropy_coef` has the same meaning for every loss type.

**Static entropy** — a fixed coefficient throughout training:

```python
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(entropy_coef=0.05, ...)
```

**Adaptive entropy** — the coefficient is updated each optimizer step based on a target entropy, as introduced in [Skywork-OR1](https://huggingface.co/papers/2505.22312). When the current entropy falls at or below `entropy_target`, the coefficient is incremented by `entropy_coef_delta`; otherwise it is decremented. The coefficient is only applied (i.e. non-zero) while entropy is at or below the target:

```python
training_args = GRPOConfig(
entropy_coef=0.01, # initial coefficient
use_adaptive_entropy=True,
entropy_target=5.0, # target mean per-token entropy (nats); tune for your model
entropy_coef_delta=0.005, # step size per optimizer step
entropy_coef_min=0.0,
entropy_coef_max=1.0,
...
)
```

<Tip>

Typical language models have per-token entropies of 2–10 nats, so the default `entropy_target=0.2` almost never triggers regularization — the bonus only engages once entropy is at or below the target, i.e. near-complete collapse. Set it to a value meaningful for your model, e.g. close to the entropy you observe early in training (logged as the `entropy` metric). When using `top_entropy_quantile < 1.0`, `entropy_target` applies to the high-entropy token subset — that subset's entropy will be higher than the logged full-token `entropy`, so calibrate accordingly.

</Tip>

When `use_adaptive_entropy=True`, the current entropy coefficient `entropy_coef` is saved alongside each checkpoint and restored on resume, so training is fully resumable.

### Rapid Experimentation for GRPO

RapidFire AI is an open-source experimentation engine that sits on top of TRL and lets you launch multiple GRPO configurations at once, even on a single GPU. Instead of trying configurations sequentially, RapidFire lets you **see all their learning curves earlier, stop underperforming runs, and clone promising ones with new settings in flight** without restarting. For more information, see [RapidFire AI Integration](rapidfire_integration).
Expand Down
21 changes: 21 additions & 0 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,27 @@ training_args = GRPOConfig(
)
```

### Skywork-OR1: Open Reasoning Models

**📜 Paper**: https://huggingface.co/papers/2505.22312

Skywork-OR1 is a family of open reasoning models trained with GRPO. The paper introduces **adaptive entropy control**: an entropy regularization term `−α·H(π_θ)` is added to the GRPO objective, and the coefficient `α` is automatically adjusted each optimizer step. When the model's mean per-token entropy falls at or below a target, `α` is incremented to encourage more exploration; otherwise it is decremented. The bonus is only applied while entropy is at or below the target. To replicate this adaptive entropy control, use the following configuration:

```python
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
use_adaptive_entropy=True, # enable adaptive entropy control (Section 3.3 of the paper)
entropy_coef=0.01, # initial entropy regularization coefficient
entropy_target=5.0, # target mean per-token entropy (nats); tune for your model
entropy_coef_delta=0.005, # step size for coefficient updates per optimizer step
)
trainer = GRPOTrainer(
...,
args=training_args,
)
```

### Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

**📜 Paper**: https://huggingface.co/papers/2506.01939
Expand Down
148 changes: 148 additions & 0 deletions tests/test_grpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1474,6 +1474,154 @@ def test_train_with_cast_lm_head_to_fp32(self, model_name):
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_static_entropy(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=8, # reduce the completion length to reduce memory usage
report_to="none",
entropy_coef=0.1,
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None
assert trainer.state.log_history[-1]["policy_loss"] is not None
assert trainer.state.log_history[-1]["entropy_coef"] is not None

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_adaptive_entropy(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=8, # reduce the completion length to reduce memory usage
report_to="none",
entropy_coef=0.01,
use_adaptive_entropy=True,
entropy_target=15.0, # above any realistic entropy → coef is always incremented
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None
assert trainer.state.log_history[-1]["policy_loss"] is not None
assert trainer.state.log_history[-1]["entropy_coef"] is not None
# Coefficient should have increased since entropy < target throughout training
assert trainer.entropy_coef > 0.01

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

@pytest.mark.parametrize("loss_type", ["grpo", "dr_grpo", "dapo", "luspo"])
def test_entropy_bonus_scale(self, loss_type):
# Regression test: the entropy bonus is the mean per-token entropy H for every loss type (documented
# objective L = L_policy - entropy_coef * H), so it must not inherit any loss-type-specific policy
# normalization. A previous "unified" formula divided H by a global token count for the
# cispo/dapo/vespo family, making the bonus ~1/sequence_length too small; conversely, scaling the
# bonus like the dr_grpo (fixed budget) or luspo (sequence-weighted) policy term would also be wrong.
# With gradient_accumulation_steps=1 the per-step entropy contribution to the loss is
# contrib = policy_loss - loss = entropy_coef * entropy_loss, so contrib / entropy must equal
# entropy_coef for all loss types.
entropy_coef = 0.5
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
importance_sampling_level="sequence" if loss_type == "luspo" else "token",
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=16, # reduce the completion length to reduce memory usage
gradient_accumulation_steps=1, # so contrib == entropy_coef * entropy_loss holds per step
loss_type=loss_type,
logging_steps=1,
report_to="none",
entropy_coef=entropy_coef,
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

trainer.train()

logs = [h for h in trainer.state.log_history if "policy_loss" in h and "loss" in h and h.get("entropy")]
assert logs
ratios = sorted((h["policy_loss"] - h["loss"]) / h["entropy"] for h in logs)
ratio = ratios[len(ratios) // 2] # median, robust to per-step noise
# Every loss type regularizes the mean per-token entropy, so contrib == entropy_coef * entropy.
assert ratio == pytest.approx(entropy_coef, rel=0.3)

def test_train_with_adaptive_entropy_gradient_accumulation(self):
# Adaptive entropy must behave correctly under gradient accumulation: the coefficient and gating are
# frozen across an accumulation window and the controller updates once per optimizer step (not once
# per micro-batch). With entropy_target above any realistic entropy the coefficient is incremented by
# entropy_coef_delta on every optimizer step, so the final value pins down the number of updates.
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=8, # reduce the completion length to reduce memory usage
gradient_accumulation_steps=2, # exercise the accumulation window
report_to="none",
entropy_coef=0.01,
use_adaptive_entropy=True,
entropy_target=15.0, # above any realistic entropy → coef incremented once per optimizer step
entropy_coef_delta=0.005,
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None
# Exactly one increment per optimizer step (global_step counts optimizer steps, not micro-batches);
# a per-micro-batch update would overshoot this.
expected_coef = min(0.01 + 0.005 * trainer.state.global_step, 1.0)
assert trainer.entropy_coef == pytest.approx(expected_coef, abs=1e-6)

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_entropy_filter(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
Expand Down
Loading
Loading