feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608
Open
sailor1493 wants to merge 2 commits into
Open
feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608sailor1493 wants to merge 2 commits into
sailor1493 wants to merge 2 commits into
Conversation
…packing, scalable DP) Non-colocated MegatronMIMO training (e.g. Qwen3.5-VL: a separate vision encoder and language model wired by BridgeCommunicator) has a per-step data-parallel straggler: per-micro-batch vision load (patch count) is uneven across samples, so a DP rank that draws a heavy-image shard stalls the whole group every step. Main feature - Intra-microbatch Reordering (cf. DistTrain https://arxiv.org/abs/2408.04275, S5.2): rebalance each micro-batch's vision load across the module DP group by a per-sample cost all-gather + ragged all_to_all (GPU-resident, overlapped with compute), so per-rank vision load is even and the straggler tail is removed. Works with heterogeneous DP (vision_dp != language_dp) via a canonical n_groups pairing (vision replica r <-> language replica r) and a variable number of images per sample (0 / 1 / N). Auxiliary features (needed to implement the main feature) - Sequence Packing: Megatron-Core accepts packed sequences, but there was no logic to build and feed one; pack each language shard's real tokens into a single [1, T] THD sequence. - Scalable Data Parallelism: previously only vision DP=1 was supported (every DP worker reads the full global micro-batch and slices locally). Each rank now reads only its disjoint 1/dp shard, and vision DP>1 is supported. All three are off by default, gated by a new MegatronMIMOFeatureConfig. DP loss reduction is unchanged from non-scalable runs. Guarded with explicit errors: in-batch packing under PP>1, non-single (cyclic/batch) sampler, TP>1, and CP>1 (CP is also blocked upstream). Reorder under PP>1 runs but is experimental and not yet correct (vision/language mispairing under the per-stage DP groups; tied embeddings additionally hit the cross-PP embedding all-reduce); see docs/training/mimo-intra-microbatch-reorder.md. Signed-off-by: Yoonsik Kim <yoonsik.kim90@navercorp.com> Signed-off-by: Kayeon Song <kayeon.song@navercorp.com> Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>
- drop the _gather_shard pass-through wrapper; call _apply_sample_dispatch directly from split_microbatch - collapse the empty scalable_dp branch in the forward step - factor 8-byte alignment into _pad_to_align() and use math.prod for sizes No behavior change; reorder_buffer + intra_microbatch_pack unit suites pass (58). Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Main Feature
micro-batch's vision load across the module DP group by a per-sample cost all-gather + ragged
all_to_all(GPU-resident, overlapped with compute) so the per-step DP straggler is removed.Auxiliary Features (needed to implement the main feature)
build and feed a packed sequence; added it (
[1, T]THD).full global micro-batch and slices its own part locally → CPU/IO overhead). Now each rank reads
only its disjoint
1/dpshard, and vision DP>1 is supported.All three are off by default, gated by the new
MegatronMIMOFeatureConfig. No breaking change(purely additive: new config + example CLI flags; no existing key/flag/symbol changed).
Changelog
Config (
src/megatron/bridge/training/config.py)MegatronMIMOFeatureConfig:scalable_dp,intra_microbatch_reorder,overlap_intra_microbatch_reorder,reorder_window_size,pack_sequences_in_batch,cost coefficients (
cost_linear_vit,cost_linear_lm),pad_token_id.finalize()rejects negative coefficients/pad id,reorder_window_size < 1, and an all-zero costwhen reorder is enabled. Wired into
ConfigContainervia a newmimofield.Reorder engine (
src/megatron/bridge/data/megatron_mimo/)reorder_buffer.py(new): per-sample cost all-gather (Gloo) + raggedall_to_all(NCCL) on adedicated CUDA stream; ragged serialize/deserialize;
balanced_assignment(contiguous-block,het-DP canonical
n_groups);split_microbatch/merge_samples; variable-images-per-sample(
cu_img,empty_like_vision);W-micro-batch reorder window with cross-window prefetch overlap.input_ids(
cost = count(image_token) · spatial_merge_size²), identical on the vision and language modules andon every PP stage — so both modules derive the same assignment with no cross-module communication.
intra_microbatch_pack.py(new): pack each language shard's real tokens into a single[1, T]THD sequence (
pack_language_shard/assemble_packed_sequence); packsposition_ids/labels/loss_maskto the same[1, T]on every PP stage so the THD rotary is sized toT.dp_utils.py: scalable-DP sampling info (each rank reads its module-local shard); image-boundaryvision handling; non-scalable
vision_dp > 1explicitly raises (out of scope); colocatesingle-consumer helpers.
Training integration (
src/megatron/bridge/training/)megatron_mimo_step.py: thread the config through the forward step —scalable_dpskips theforward-time local slice (sampler already delivered the shard); optional in-batch sequence packing.
Keeps
input_idson every language PP stage when reorder or packing is active so per-sample lengthsare derivable on stages > 0. DP loss reduction is unchanged from non-scalable runs.
train_megatron_mimo.py: build/route the scalable-DP sampler and the reorder exchange from config.Data / packing (
src/megatron/bridge/data/)datasets/packing_utils.py: shared THD packing helper (placement plan across DP workers).Example (
examples/megatron_mimo/qwen35_vl/finetune_qwen35_vl.py)--scalable-dp,--intra-microbatch-reorder/--no-…,--pack-sequences-in-batch,--reorder-window-size, cost coefficients).NotImplementedErrorfordataloader_type != "single".PP > 1is supported on untied checkpoints (no packing/PP guard); see Known limitations.Docs (
docs/)training/mimo-intra-microbatch-reorder.md: feature, config, validation matrix, single-node throughput,gaps. Index entries:
index.md,training/README.md.Tests (
tests/)data/megatron_mimo/test_reorder_buffer.py(44),data/megatron_mimo/test_intra_microbatch_pack.py(14),
training/test_mimo_feature_config.py(14),data/datasets/test_packing_utils.py, plus updates totraining/megatron_mimo/test_megatron_mimo_step.py— all green.test_groups/megatron_mimo/test_reorder_exchange.py— on-device exchange smoke.Performance
Throughput, single 8×A100-80GB node, vision dp4 / language dp4, PP=1, TP=1, Qwen3.5-0.8B (VL) +
CORD-v2, seq 2048, sequence packing on, MBS/GBS 32 (= 8 examples/rank), patch-only cost.
500 iters, stats over
elapsed time per iterationwith the first 10 iters excluded (compile + theone-time side-NCCL
new_groupwarmup).PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,CUDA_DEVICE_MAX_CONNECTIONS=8.--no-intra-microbatch-reorder)--no-overlap-intra-microbatch-reorder(ms/iter.)
1/dpshard instead of every rank redundantly reading and decoding all 32 samples. mean 1,075 → 798 (1.35×),
and the full-read tail collapses: p99 2,342 → 910, p90 1,523 → 864.
packs its
mbs/dpsamples into one[1, T]THD sequence whoseT = Σimage-placeholder tokens, so anuneven per-rank image load skews
T. The all-to-all evens per-rank patch cost (balance probe: spread≈1.27× → ≈1.07×, ~18–24 of 32 samples exchanged):
inter-percentile spread (p10–p99) tightens 184 → 160 ms.
path:
--no-overlapmean 811 ≈ read-only 798 and behind overlapped 755. So at this config the balancing gain is real only because the exchange is hidden behind compute.buys single-digit %, not multiples. The reorder win grows with imbalance — larger DP (more ranks →
higher chance one rank draws a heavy shard), larger image-size variance, and larger per-rank batch.
Known limitations / out of scope
PP>1is supported with untied checkpoints. Reorder +PP>1and packing +PP>1are fixed andverified at
dp2/dp2/pp2(lm loss < 2, tracking the no-reorder PP=2 baseline). Tied-embedding +PP>1is still not working at the upstream, independent of this feature — use an untied checkpoint(
tie_word_embeddings=false, LM head = copy of the input embedding) forPP>1.singlesampler (guarded with an explicit error),and
TP>1is untested.CP>1is blocked upstream (bridge_communicatorasserts language-grid CP size 1).full on-device path was validated manually (16-GPU multi-node + 8×A100 single-node).
GitHub Actions CI
See the CI section
in the Contributing doc for how to trigger the CI.
Before your PR is "Ready for review"
Pre checks:
Signed-off-by: Yoonsik Kim yoonsik.kim90@navercorp.com
Signed-off-by: Kayeon Song kayeon.song@navercorp.com
Signed-off-by: Chanwoo Park chanwoo.park98@navercorp.com