fix(data): make finetuning batch sampler epoch-aware on checkpoint resume by Achyuthan-S · Pull Request #4601 · NVIDIA-NeMo/Megatron-Bridge

Achyuthan-S · 2026-06-30T18:46:09Z

What does this PR do ?

Fixes abnormal loss when resuming fine-tuning from a checkpoint (#4565). The
fine-tuning batch sampler froze consumed_samples and re-applied the resume
offset on every epoch, so after a mid-epoch resume each later epoch skipped the
start of the dataset and replayed only the tail — repeated data that depressed
the loss. This makes MegatronPretrainingBatchSampler epoch-aware (and adds
deterministic per-epoch reshuffling), mirroring MegatronPretrainingRandomSampler.

Changelog

data/samplers.py: MegatronPretrainingBatchSampler.__iter__ now advances
consumed_samples as it yields and computes the within-epoch offset against the
whole-global-batch multiple, so each cyclic_iter re-iteration starts a fresh
full epoch instead of permanently re-applying the resume offset and replaying
the tail.
data/samplers.py: add shuffle/seed params for deterministic, seed- and
epoch-derived per-epoch reshuffling (resume-reproducible). Threaded through
build_pretraining_data_loader. Both default to off → backward compatible.
data/loaders.py: enable shuffle for the fine-tuning (batch) train
dataloader, seeded by the dataset seed. Eval/test and pretraining paths are
unchanged.
tests/.../data/test_samplers.py: regression tests for resume-serves-full-epoch,
epoch-boundary resume, per-epoch reshuffle, seed determinism, and shuffle resume
parity. Existing batch-sampler assertions are unchanged.

Behavior change: fine-tuning data order now reshuffles each epoch
(deterministic and resume-reproducible). SFT loss curves will shift versus the
previous sequential order. New params are optional and default off.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? (sampler resume-parity + reshuffle regression tests)
Did you add or update any necessary documentation? (evaluated — no existing doc covers fine-tuning sampler/resume order; the behavior change is described above. Happy to add a docs note if maintainers prefer.)
Does the PR affect components that are optional to install? No
- Reviewer: Does the PR have correct import guards for all optional libraries? N/A

Additional Information

Closes [bug] Abnormal loss fluctuation when resuming training from checkpoint #4565
Suggested labels: bug, area:data, needs-more-tests (this changes SFT data ordering).

…sume MegatronPretrainingBatchSampler froze consumed_samples and re-applied the resume offset on every cyclic_iter re-iteration. After a mid-epoch resume, each subsequent epoch skipped the head of the dataset and replayed only the tail, re-training the model on the same samples and depressing the loss (matching the report: mid-epoch resume drops loss, epoch-boundary resume is fine). - Advance consumed_samples as batches yield and compute the within-epoch offset against the whole-global-batch multiple, so each re-iteration starts a fresh full epoch instead of replaying the tail. - Add deterministic, seed- and epoch-derived per-epoch reshuffling (shuffle/seed params), enabled for the fine-tuning 'batch' train dataloader; reshuffles every epoch and reproduces the same order on resume. - Add regression tests for resume parity, per-epoch reshuffle, and seed determinism. Note: this changes the fine-tuning data order (deterministic and resume-reproducible); SFT loss curves will shift versus the previous sequential ordering. Closes NVIDIA-NeMo#4565 Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

copy-pr-bot · 2026-06-30T18:46:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-07-01T04:04:14Z

/ok to test dd99eae

yaoyu-33 · 2026-07-01T16:36:35Z

/ok to test 4c996b0

Achyuthan-S · 2026-07-02T10:57:26Z

Hello @yaoyu-33 , it was great contributing. Thank you for the review.

If you have any other issues that i can contribute to and solve .., it would be great if you tag me to it whenever .

Thanks again !

Copilot AI review requested due to automatic review settings June 30, 2026 18:46

Copilot AI reviewed Jun 30, 2026

github-actions Bot added the community-request label Jun 30, 2026

Achyuthan-S mentioned this pull request Jun 30, 2026

[bug] Abnormal loss fluctuation when resuming training from checkpoint #4565

Open

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 30, 2026

yaoyu-33 force-pushed the fix/finetune-sampler-resume-shuffle branch from db4ae40 to 1c97cdb Compare July 1, 2026 00:44

fix(data): align batch sampler epoch accounting

dd99eae

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 force-pushed the fix/finetune-sampler-resume-shuffle branch from 1c97cdb to dd99eae Compare July 1, 2026 00:49

copy-pr-bot Bot temporarily deployed to public July 1, 2026 04:04 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 04:42 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 04:43 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 05:07 Inactive

Merge branch 'main' into fix/finetune-sampler-resume-shuffle

4c996b0

copy-pr-bot Bot temporarily deployed to public July 1, 2026 16:37 Inactive

copy-pr-bot Bot temporarily deployed to test July 1, 2026 16:37 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 17:33 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 17:58 Inactive

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601

fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601
Achyuthan-S wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
Achyuthan-S:fix/finetune-sampler-resume-shuffle

Achyuthan-S commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

yaoyu-33 commented Jul 1, 2026

Uh oh!

yaoyu-33 commented Jul 1, 2026

Uh oh!

Achyuthan-S commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Achyuthan-S commented Jun 30, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

yaoyu-33 commented Jul 1, 2026

Uh oh!

yaoyu-33 commented Jul 1, 2026

Uh oh!

Achyuthan-S commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants