SFT: Truncate during dataset preparation, not collation by qgallouedec · Pull Request #6155 · huggingface/trl

qgallouedec · 2026-06-23T18:09:56Z

Move sequence truncation in SFTTrainer from the data collator into _prepare_dataset.

Why

Truncation is a pure per-example slice. Doing it once during (cached) dataset preparation is cleaner than recomputing it in the collator on every batch, and keeps all dataset shaping (tokenize → build labels → truncate → pack) in one place.

It will also allow to drop rows with no trainable rewards, see #6025

Changes

_prepare_dataset now truncates input_ids and labels to max_length (respecting truncation_mode), right after labels are built. Skipped when packing (packing already chunks to max_length).
build_labels now drops the completion_mask / assistant_masks columns: they're fully baked into labels.
DataCollatorForLanguageModeling no longer truncates: the max_length and truncation_mode arguments are removed.

⚠️ Behavior change

With skip_prepare_dataset=True, preparation (and therefore truncation) is skipped. The dataset must already be truncated.

Note

Medium Risk
Behavior change for skip_prepare_dataset=True and custom collators that relied on collator truncation; default prepared-dataset paths should match prior semantics but truncation timing affects assistant-only loss edge cases.

Overview
SFT sequence truncation now runs in cached dataset preparation instead of on every batch in DataCollatorForLanguageModeling. max_length and truncation_mode are removed from the collator; _prepare_dataset slices input_ids and labels after label building (honoring keep_start / keep_end), skipped when packing is enabled.

Mask columns are dropped after labels are built (assistant_masks / completion_mask removed via build_labels remove_columns), so training rows only carry input_ids and labels.

⚠️ With skip_prepare_dataset=True, truncation no longer happens anywhere — inputs must already be within max_length. Collator tests for truncation and the test that asserted truncation_mode is passed to the collator were removed; preparation-focused tests were added/updated (including #3927: all-masked labels after aggressive truncation).

^{Reviewed by Cursor Bugbot for commit e6eec1f. Bugbot is set up for automated code reviews on this repo. Configure here.}

bot-ci-comment · 2026-06-23T18:13:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0579163751

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

qgallouedec and others added 2 commits June 23, 2026 18:07

SFT: Truncate during dataset preparation, not collation

0579163

Merge branch 'main' into truncate-during-dataset-preparation

b44a3e0

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread trl/trainer/sft_trainer.py

qgallouedec and others added 5 commits June 23, 2026 15:19

Merge branch 'main' into truncate-during-dataset-preparation

8a06afe

better doc

9d0e59d

Merge branch 'main' into truncate-during-dataset-preparation

252da69

Merge branch 'main' into truncate-during-dataset-preparation

7174153

Merge branch 'main' into truncate-during-dataset-preparation

736723b

qgallouedec requested review from AmineDiro, albertvillanova and kashif June 25, 2026 22:33

qgallouedec added 5 commits June 26, 2026 10:37

Merge branch 'main' into truncate-during-dataset-preparation

04062c3

Merge branch 'main' into truncate-during-dataset-preparation

a47c142

Merge branch 'main' into truncate-during-dataset-preparation

2a4f780

Merge branch 'main' into truncate-during-dataset-preparation

b3bbb4f

Merge branch 'main' into truncate-during-dataset-preparation

e6eec1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SFT: Truncate during dataset preparation, not collation#6155

SFT: Truncate during dataset preparation, not collation#6155
qgallouedec wants to merge 12 commits into
mainfrom
truncate-during-dataset-preparation

qgallouedec commented Jun 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

bot-ci-comment Bot commented Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

qgallouedec commented Jun 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changes

⚠️ Behavior change

Uh oh!

bot-ci-comment Bot commented Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qgallouedec commented Jun 23, 2026 •

edited by cursor Bot

Loading