Skip to content
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ac50a11
Add fields to GRPOConfig
albertvillanova Jun 22, 2026
dcaaf67
Add init fields to GRPOTrainer
albertvillanova Jun 22, 2026
0f6306e
Update _compute_loss
albertvillanova Jun 22, 2026
9b1cc65
Add checkpoint persistence
albertvillanova Jun 22, 2026
e944713
Update GRPO docs
albertvillanova Jun 22, 2026
f47d5a5
Add tests
albertvillanova Jun 22, 2026
2484e70
Address issues from review
albertvillanova Jun 22, 2026
4507747
Fix wrong entropy for adaptive control
albertvillanova Jun 22, 2026
9b70a4a
Fix Liger skips adaptive entropy guard
albertvillanova Jun 22, 2026
9d79e4a
Fix inconsistent inequality
albertvillanova Jun 22, 2026
46c8a64
Fix mean reduction with sum-count-divide
albertvillanova Jun 22, 2026
3f7a669
Set _last_world_entropy at init
albertvillanova Jun 22, 2026
a05c979
Cache world_entropy at sync point and use that cached value for apply…
albertvillanova Jun 22, 2026
fe03dd1
Persist also _last_world_entropy
albertvillanova Jun 22, 2026
f099349
Add paper_index entry
albertvillanova Jun 22, 2026
5288cd5
Capture the pure policy loss before normalization
albertvillanova Jun 24, 2026
03f4208
Fix luspo loss
albertvillanova Jun 24, 2026
dbc0c75
Gate policy_loss logging and align style
albertvillanova Jun 24, 2026
391da7a
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 24, 2026
506fbf9
Fix entropy state written to wrong path
albertvillanova Jun 24, 2026
8a6b53d
Fix is_world_process_zero() vs args.should_save guard mismatch
albertvillanova Jun 24, 2026
474b30c
Update docs: policy_loss only logged inside entropy block
albertvillanova Jun 24, 2026
a0b9ec6
Log entropy_coef only when sync_gradients=True
albertvillanova Jun 24, 2026
608b1e0
Add guard for entropy-loss dispatch matching policy-loss dispatch
albertvillanova Jun 24, 2026
81841ad
Remove entropy_loss
albertvillanova Jun 24, 2026
bee5126
Gate on train mode to avoid entropy state update during eval
albertvillanova Jun 24, 2026
5c442a0
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 24, 2026
2f34d15
Fix entropy bonus ignores quantile mask
albertvillanova Jun 24, 2026
806078d
Use effective_mask for the world_entropy all-reduce too
albertvillanova Jun 24, 2026
2845ef4
Update docs
albertvillanova Jun 24, 2026
2ed11c0
Use unified formula with mean per-token entropy of active tokens
albertvillanova Jun 24, 2026
7f0562b
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 25, 2026
76255d3
Make three-branch entropy-loss split
albertvillanova Jun 25, 2026
fc76d4b
Compute bonus from frozen state, update per optimizer step
albertvillanova Jun 25, 2026
bed5188
Fix "nearly always triggers" docs
albertvillanova Jun 25, 2026
6e8f498
Add scale test and grad-accumulation adaptive test
albertvillanova Jun 25, 2026
607d911
Fix dr_grpo entropy scale mismatch
albertvillanova Jun 25, 2026
0cfad37
Accumulate to mean per-token entropy, independent of how each loss ty…
albertvillanova Jun 25, 2026
8e05132
Update tests
albertvillanova Jun 25, 2026
f15e04a
Merge remote-tracking branch 'upstream/main' into worktree-fix-3320
albertvillanova Jun 26, 2026
bccd8eb
Add clarifying sentence
albertvillanova Jun 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,9 @@ While training and evaluating, we record the following reward metrics:
- `reward`: The overall average reward after summing rewards across functions (weighted by `reward_weights`).
- `reward_std`: The standard deviation of summed rewards across functions (weighted by `reward_weights`), computed over the full batch.
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `policy_loss`: The policy gradient loss value (before any entropy bonus). Logged when `entropy_coef` is nonzero or `use_adaptive_entropy=True`.
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `entropy_coef`: The current entropy regularization coefficient. Logged when `entropy_coef` is nonzero or `use_adaptive_entropy=True`. Updated once per optimizer step when `use_adaptive_entropy=True`.
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region: \\( \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \\). A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
Expand Down Expand Up @@ -641,6 +643,46 @@ and the reward will be computed as the sum of the rewards from each function, or

Note that [`GRPOTrainer`] supports multiple reward functions of different types. See the parameters documentation for more details.

### Entropy regularization

To encourage exploration and prevent the policy from collapsing to near-deterministic outputs, you can add an entropy bonus to the training objective. The entropy regularization augments the GRPO loss as follows:

$$
\mathcal{L}(\theta) = \mathcal{L}_{\text{GRPO}}(\theta) - \alpha \cdot \mathcal{H}(\pi_\theta),
$$

where \\(\mathcal{H}(\pi_\theta)\\) is the mean per-token entropy of the policy and \\(\alpha\\) is the entropy coefficient.

**Static entropy** — a fixed coefficient throughout training:

```python
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(entropy_coef=0.05, ...)
```

**Adaptive entropy** — the coefficient is updated each optimizer step based on a target entropy, as introduced in [Skywork-OR1](https://huggingface.co/papers/2505.22312). When the current entropy falls at or below `entropy_target`, the coefficient is incremented by `entropy_coef_delta`; otherwise it is decremented. The coefficient is only applied (i.e. non-zero) while entropy is at or below the target:

```python
training_args = GRPOConfig(
entropy_coef=0.01, # initial coefficient
use_adaptive_entropy=True,
entropy_target=5.0, # target mean per-token entropy (nats); tune for your model
entropy_coef_delta=0.005, # step size per optimizer step
entropy_coef_min=0.0,
entropy_coef_max=1.0,
...
)
```

<Tip>

Typical language models have per-token entropies of 2–10 nats. The default `entropy_target=0.2` nearly always triggers regularization; set it to a value meaningful for your model (e.g. the entropy you observe early in training, logged as the `entropy` metric).

</Tip>

When `use_adaptive_entropy=True`, the current entropy coefficient `entropy_coef` is saved alongside each checkpoint and restored on resume, so training is fully resumable.

### Rapid Experimentation for GRPO

RapidFire AI is an open-source experimentation engine that sits on top of TRL and lets you launch multiple GRPO configurations at once, even on a single GPU. Instead of trying configurations sequentially, RapidFire lets you **see all their learning curves earlier, stop underperforming runs, and clone promising ones with new settings in flight** without restarting. For more information, see [RapidFire AI Integration](rapidfire_integration).
Expand Down
21 changes: 21 additions & 0 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,27 @@ training_args = GRPOConfig(
)
```

### Skywork-OR1: Open Reasoning Models

**📜 Paper**: https://huggingface.co/papers/2505.22312

Skywork-OR1 is a family of open reasoning models trained with GRPO. The paper introduces **adaptive entropy control**: an entropy regularization term `−α·H(π_θ)` is added to the GRPO objective, and the coefficient `α` is automatically adjusted each optimizer step. When the model's mean per-token entropy falls at or below a target, `α` is incremented to encourage more exploration; otherwise it is decremented. The bonus is only applied while entropy is at or below the target. To replicate this adaptive entropy control, use the following configuration:

```python
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
use_adaptive_entropy=True, # enable adaptive entropy control (Section 3.3 of the paper)
entropy_coef=0.01, # initial entropy regularization coefficient
entropy_target=5.0, # target mean per-token entropy (nats); tune for your model
entropy_coef_delta=0.005, # step size for coefficient updates per optimizer step
)
trainer = GRPOTrainer(
...,
args=training_args,
)
```

### Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

**📜 Paper**: https://huggingface.co/papers/2506.01939
Expand Down
66 changes: 66 additions & 0 deletions tests/test_grpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1474,6 +1474,72 @@ def test_train_with_cast_lm_head_to_fp32(self, model_name):
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_static_entropy(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=8, # reduce the completion length to reduce memory usage
report_to="none",
entropy_coef=0.1,
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None
assert trainer.state.log_history[-1]["policy_loss"] is not None
assert trainer.state.log_history[-1]["entropy_coef"] is not None

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_adaptive_entropy(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
output_dir=self.tmp_dir,
learning_rate=0.1, # use higher lr because gradients are tiny and default lr can stall updates
per_device_train_batch_size=3, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=8, # reduce the completion length to reduce memory usage
report_to="none",
entropy_coef=0.01,
use_adaptive_entropy=True,
entropy_target=15.0, # above any realistic entropy → coef is always incremented
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None
assert trainer.state.log_history[-1]["policy_loss"] is not None
assert trainer.state.log_history[-1]["entropy_coef"] is not None
# Coefficient should have increased since entropy < target throughout training
assert trainer.entropy_coef > 0.01

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

def test_train_with_entropy_filter(self):
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
training_args = GRPOConfig(
Expand Down
58 changes: 58 additions & 0 deletions trl/trainer/grpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,28 @@ class GRPOConfig(_BaseConfig):
position, improving results. Range: `[0.0-1.0]`. A value of `0.0` masks all but the highest entropy token;
`1.0` keeps all tokens. The paper recommends a value of `0.2`. If used with
`mask_truncated_completions=True`, only tokens from non-truncated completions are considered.
entropy_coef (`float`, *optional*, defaults to `0.0`):
Coefficient of the entropy regularization term in the loss. A positive value adds an entropy bonus that
encourages exploration by keeping the policy from collapsing to near-deterministic outputs. When
`use_adaptive_entropy=True`, this serves as the initial coefficient and is updated each optimizer step.
Has no effect when set to `0.0` (default).
use_adaptive_entropy (`bool`, *optional*, defaults to `False`):
Whether to use adaptive entropy control, introduced in
[Skywork-OR1](https://huggingface.co/papers/2505.22312). When enabled, the entropy coefficient
`entropy_coef` is updated each optimizer step: incremented by `entropy_coef_delta` when the current
entropy is below `entropy_target`, and decremented otherwise. The coefficient is only applied when
entropy is at or below `entropy_target`.
Comment thread
cursor[bot] marked this conversation as resolved.
entropy_coef_min (`float`, *optional*, defaults to `0.0`):
Lower bound for the entropy coefficient when using adaptive entropy control.
entropy_coef_max (`float`, *optional*, defaults to `1.0`):
Upper bound for the entropy coefficient when using adaptive entropy control.
entropy_coef_delta (`float`, *optional*, defaults to `0.005`):
Step size for adjusting the entropy coefficient at each optimizer step during adaptive entropy control.
entropy_target (`float`, *optional*, defaults to `0.2`):
Target mean per-token entropy (in nats) used by adaptive entropy control. The coefficient is only
applied when the current entropy falls at or below this value. Typical language models have per-token
entropies in the range 2–10 nats; the default of `0.2` nearly always triggers regularization, so users
should tune this to a value appropriate for their model and task.
max_tool_calling_iterations (`int`, *optional*):
Maximum number of tool-calling turns when training an agent. If `None`, there is no limit and generation
stops when the model generates a response turn with no tool calls or when the total response length reaches
Expand Down Expand Up @@ -832,6 +854,42 @@ class GRPOConfig(_BaseConfig):
"non-truncated completions are considered."
},
)
entropy_coef: float = field(
default=0.0,
metadata={
"help": "Coefficient of the entropy regularization term in the loss. A positive value adds an entropy "
"bonus that encourages exploration. When `use_adaptive_entropy=True`, this serves as the initial "
"coefficient and is updated each optimizer step. Has no effect when set to `0.0` (default)."
},
)
use_adaptive_entropy: bool = field(
default=False,
metadata={
"help": "Whether to use adaptive entropy control, introduced in Skywork-OR1 "
"(https://huggingface.co/papers/2505.22312). When enabled, `entropy_coef` is incremented by "
"`entropy_coef_delta` when entropy is below `entropy_target`, and decremented otherwise."
},
)
entropy_coef_min: float = field(
default=0.0,
metadata={"help": "Lower bound for the entropy coefficient when using adaptive entropy control."},
)
entropy_coef_max: float = field(
default=1.0,
metadata={"help": "Upper bound for the entropy coefficient when using adaptive entropy control."},
)
entropy_coef_delta: float = field(
default=0.005,
metadata={"help": "Step size for adjusting the entropy coefficient during adaptive entropy control."},
)
entropy_target: float = field(
default=0.2,
metadata={
"help": "Target mean per-token entropy (nats) for adaptive entropy control. The coefficient is only "
"applied when current entropy is at or below this value. Typical language models have per-token "
"entropies of 2–10 nats; the default of 0.2 nearly always triggers regularization, so tune this."
},
)
max_tool_calling_iterations: int | None = field(
default=None,
metadata={
Expand Down
Loading
Loading