fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601
Open
Achyuthan-S wants to merge 3 commits into
Open
fix(data): make finetuning batch sampler epoch-aware on checkpoint resume#4601Achyuthan-S wants to merge 3 commits into
Achyuthan-S wants to merge 3 commits into
Conversation
…sume MegatronPretrainingBatchSampler froze consumed_samples and re-applied the resume offset on every cyclic_iter re-iteration. After a mid-epoch resume, each subsequent epoch skipped the head of the dataset and replayed only the tail, re-training the model on the same samples and depressing the loss (matching the report: mid-epoch resume drops loss, epoch-boundary resume is fine). - Advance consumed_samples as batches yield and compute the within-epoch offset against the whole-global-batch multiple, so each re-iteration starts a fresh full epoch instead of replaying the tail. - Add deterministic, seed- and epoch-derived per-epoch reshuffling (shuffle/seed params), enabled for the fine-tuning 'batch' train dataloader; reshuffles every epoch and reproduces the same order on resume. - Add regression tests for resume parity, per-epoch reshuffle, and seed determinism. Note: this changes the fine-tuning data order (deterministic and resume-reproducible); SFT loss curves will shift versus the previous sequential ordering. Closes NVIDIA-NeMo#4565 Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
db4ae40 to
1c97cdb
Compare
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
1c97cdb to
dd99eae
Compare
Contributor
|
/ok to test dd99eae |
Contributor
|
/ok to test 4c996b0 |
Author
|
Hello @yaoyu-33 , it was great contributing. Thank you for the review. If you have any other issues that i can contribute to and solve .., it would be great if you tag me to it whenever . Thanks again ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fixes abnormal loss when resuming fine-tuning from a checkpoint (#4565). The
fine-tuning batch sampler froze
consumed_samplesand re-applied the resumeoffset on every epoch, so after a mid-epoch resume each later epoch skipped the
start of the dataset and replayed only the tail — repeated data that depressed
the loss. This makes
MegatronPretrainingBatchSamplerepoch-aware (and addsdeterministic per-epoch reshuffling), mirroring
MegatronPretrainingRandomSampler.Changelog
data/samplers.py:MegatronPretrainingBatchSampler.__iter__now advancesconsumed_samplesas it yields and computes the within-epoch offset against thewhole-global-batch multiple, so each
cyclic_iterre-iteration starts a freshfull epoch instead of permanently re-applying the resume offset and replaying
the tail.
data/samplers.py: addshuffle/seedparams for deterministic, seed- andepoch-derived per-epoch reshuffling (resume-reproducible). Threaded through
build_pretraining_data_loader. Both default to off → backward compatible.data/loaders.py: enable shuffle for the fine-tuning (batch) traindataloader, seeded by the dataset seed. Eval/test and pretraining paths are
unchanged.
tests/.../data/test_samplers.py: regression tests for resume-serves-full-epoch,epoch-boundary resume, per-epoch reshuffle, seed determinism, and shuffle resume
parity. Existing batch-sampler assertions are unchanged.
Behavior change: fine-tuning data order now reshuffles each epoch
(deterministic and resume-reproducible). SFT loss curves will shift versus the
previous sequential order. New params are optional and default off.
GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
Additional Information
bug,area:data,needs-more-tests(this changes SFT data ordering).