Skip to content

fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601

Open
Achyuthan-S wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
Achyuthan-S:fix/finetune-sampler-resume-shuffle
Open

fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601
Achyuthan-S wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
Achyuthan-S:fix/finetune-sampler-resume-shuffle

Conversation

@Achyuthan-S

Copy link
Copy Markdown

What does this PR do ?

Fixes abnormal loss when resuming fine-tuning from a checkpoint (#4565). The
fine-tuning batch sampler froze consumed_samples and re-applied the resume
offset on every epoch, so after a mid-epoch resume each later epoch skipped the
start of the dataset and replayed only the tail — repeated data that depressed
the loss. This makes MegatronPretrainingBatchSampler epoch-aware (and adds
deterministic per-epoch reshuffling), mirroring MegatronPretrainingRandomSampler.

Changelog

  • data/samplers.py: MegatronPretrainingBatchSampler.__iter__ now advances
    consumed_samples as it yields and computes the within-epoch offset against the
    whole-global-batch multiple, so each cyclic_iter re-iteration starts a fresh
    full epoch instead of permanently re-applying the resume offset and replaying
    the tail.
  • data/samplers.py: add shuffle/seed params for deterministic, seed- and
    epoch-derived per-epoch reshuffling (resume-reproducible). Threaded through
    build_pretraining_data_loader. Both default to off → backward compatible.
  • data/loaders.py: enable shuffle for the fine-tuning (batch) train
    dataloader, seeded by the dataset seed. Eval/test and pretraining paths are
    unchanged.
  • tests/.../data/test_samplers.py: regression tests for resume-serves-full-epoch,
    epoch-boundary resume, per-epoch reshuffle, seed determinism, and shuffle resume
    parity. Existing batch-sampler assertions are unchanged.

Behavior change: fine-tuning data order now reshuffles each epoch
(deterministic and resume-reproducible). SFT loss curves will shift versus the
previous sequential order. New params are optional and default off.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? (sampler resume-parity + reshuffle regression tests)
  • Did you add or update any necessary documentation? (evaluated — no existing doc covers fine-tuning sampler/resume order; the behavior change is described above. Happy to add a docs note if maintainers prefer.)
  • Does the PR affect components that are optional to install? No
    • Reviewer: Does the PR have correct import guards for all optional libraries? N/A

Additional Information

…sume

MegatronPretrainingBatchSampler froze consumed_samples and re-applied the
resume offset on every cyclic_iter re-iteration. After a mid-epoch resume, each
subsequent epoch skipped the head of the dataset and replayed only the tail,
re-training the model on the same samples and depressing the loss (matching the
report: mid-epoch resume drops loss, epoch-boundary resume is fine).

- Advance consumed_samples as batches yield and compute the within-epoch offset
  against the whole-global-batch multiple, so each re-iteration starts a fresh
  full epoch instead of replaying the tail.
- Add deterministic, seed- and epoch-derived per-epoch reshuffling (shuffle/seed
  params), enabled for the fine-tuning 'batch' train dataloader; reshuffles every
  epoch and reproduces the same order on resume.
- Add regression tests for resume parity, per-epoch reshuffle, and seed
  determinism.

Note: this changes the fine-tuning data order (deterministic and
resume-reproducible); SFT loss curves will shift versus the previous sequential
ordering.

Closes NVIDIA-NeMo#4565

Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Copilot AI review requested due to automatic review settings June 30, 2026 18:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 30, 2026
@yaoyu-33 yaoyu-33 force-pushed the fix/finetune-sampler-resume-shuffle branch from db4ae40 to 1c97cdb Compare July 1, 2026 00:44
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 force-pushed the fix/finetune-sampler-resume-shuffle branch from 1c97cdb to dd99eae Compare July 1, 2026 00:49
@yaoyu-33

yaoyu-33 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

/ok to test dd99eae

@yaoyu-33

yaoyu-33 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

/ok to test 4c996b0

@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jul 2, 2026
@Achyuthan-S

Copy link
Copy Markdown
Author

Hello @yaoyu-33 , it was great contributing. Thank you for the review.

If you have any other issues that i can contribute to and solve .., it would be great if you tag me to it whenever .

Thanks again !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers bug Something isn't working community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Abnormal loss fluctuation when resuming training from checkpoint

4 participants