feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP) by sailor1493 · Pull Request #4608 · NVIDIA-NeMo/Megatron-Bridge

sailor1493 · 2026-07-01T06:14:47Z

What does this PR do ?

close #4609

Main Feature

Intra-microbatch Reordering (DistTrain §5.2): rebalance each
micro-batch's vision load across the module DP group by a per-sample cost all-gather + ragged
all_to_all (GPU-resident, overlapped with compute) so the per-step DP straggler is removed.

Auxiliary Features (needed to implement the main feature)

Sequence Packing — Megatron Core accepts packed sequences, but there was no logic to actually
build and feed a packed sequence; added it ([1, T] THD).
Scalable Data Parallelism — previously only vision DP=1 was supported (every DP worker reads the
full global micro-batch and slices its own part locally → CPU/IO overhead). Now each rank reads
only its disjoint 1/dp shard, and vision DP>1 is supported.

All three are off by default, gated by the new MegatronMIMOFeatureConfig. No breaking change
(purely additive: new config + example CLI flags; no existing key/flag/symbol changed).

Changelog

Config (`src/megatron/bridge/training/config.py`)

Add MegatronMIMOFeatureConfig: scalable_dp, intra_microbatch_reorder,
overlap_intra_microbatch_reorder, reorder_window_size, pack_sequences_in_batch,
cost coefficients (cost_linear_vit, cost_linear_lm), pad_token_id.
finalize() rejects negative coefficients/pad id, reorder_window_size < 1, and an all-zero cost
when reorder is enabled. Wired into ConfigContainer via a new mimo field.

Reorder engine (`src/megatron/bridge/data/megatron_mimo/`)

reorder_buffer.py (new): per-sample cost all-gather (Gloo) + ragged all_to_all (NCCL) on a
dedicated CUDA stream; ragged serialize/deserialize; balanced_assignment (contiguous-block,
het-DP canonical n_groups); split_microbatch/merge_samples; variable-images-per-sample
(cu_img, empty_like_vision); W-micro-batch reorder window with cross-window prefetch overlap.
The per-sample cost is the module-independent image-placeholder token count in input_ids
(cost = count(image_token) · spatial_merge_size²), identical on the vision and language modules and
on every PP stage — so both modules derive the same assignment with no cross-module communication.
intra_microbatch_pack.py (new): pack each language shard's real tokens into a single [1, T]
THD sequence (pack_language_shard / assemble_packed_sequence); packs position_ids/labels/
loss_mask to the same [1, T] on every PP stage so the THD rotary is sized to T.
dp_utils.py: scalable-DP sampling info (each rank reads its module-local shard); image-boundary
vision handling; non-scalable vision_dp > 1 explicitly raises (out of scope); colocate
single-consumer helpers.

Training integration (`src/megatron/bridge/training/`)

megatron_mimo_step.py: thread the config through the forward step — scalable_dp skips the
forward-time local slice (sampler already delivered the shard); optional in-batch sequence packing.
Keeps input_ids on every language PP stage when reorder or packing is active so per-sample lengths
are derivable on stages > 0. DP loss reduction is unchanged from non-scalable runs.
train_megatron_mimo.py: build/route the scalable-DP sampler and the reorder exchange from config.

Data / packing (`src/megatron/bridge/data/`)

datasets/packing_utils.py: shared THD packing helper (placement plan across DP workers).

Example (`examples/megatron_mimo/qwen35_vl/finetune_qwen35_vl.py`)

Wire the config + CLI flags (--scalable-dp, --intra-microbatch-reorder/--no-…,
--pack-sequences-in-batch, --reorder-window-size, cost coefficients).
Guard the one remaining unsupported config: NotImplementedError for dataloader_type != "single".
PP > 1 is supported on untied checkpoints (no packing/PP guard); see Known limitations.

Docs (`docs/`)

training/mimo-intra-microbatch-reorder.md: feature, config, validation matrix, single-node throughput,
gaps. Index entries: index.md, training/README.md.

Tests (`tests/`)

Unit: data/megatron_mimo/test_reorder_buffer.py (44), data/megatron_mimo/test_intra_microbatch_pack.py
(14), training/test_mimo_feature_config.py (14), data/datasets/test_packing_utils.py, plus updates to
training/megatron_mimo/test_megatron_mimo_step.py — all green.
Functional (2-GPU): test_groups/megatron_mimo/test_reorder_exchange.py — on-device exchange smoke.

Performance

Throughput, single 8×A100-80GB node, vision dp4 / language dp4, PP=1, TP=1, Qwen3.5-0.8B (VL) +
CORD-v2, seq 2048, sequence packing on, MBS/GBS 32 (= 8 examples/rank), patch-only cost.
500 iters, stats over elapsed time per iteration with the first 10 iters excluded (compile + the
one-time side-NCCL new_group warmup). PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,
CUDA_DEVICE_MAX_CONNECTIONS=8.

Note: Due to PP bug regarding tied embedding, we turn of the tie_embedding option

config	p10	p25	p50	p75	p90	p99	mean	max
base + pack (full-batch read, no scalable-dp)	826	885	968	1089	1523	2342	1075	2659
scalable read only + pack (`--no-intra-microbatch-reorder`)	726	761	793	832	864	910	798	1493
scalable + reorder + pack (default, overlap)	703	736	752	771	786	863	755	1410
scalable + reorder + pack, `--no-overlap-intra-microbatch-reorder`	747	787	811	831	852	891	811	1496

(ms/iter.)

Scalable read (base → read-only) is the dominant win, as expected — each rank reads only its 1/dp
shard instead of every rank redundantly reading and decoding all 32 samples. mean 1,075 → 798 (1.35×),
and the full-read tail collapses: p99 2,342 → 910, p90 1,523 → 864.
Reorder pays off only with overlap, as a single-digit-% gain here. With packing, each language rank
packs its mbs/dp samples into one [1, T] THD sequence whose T = Σ image-placeholder tokens, so an
uneven per-rank image load skews T. The all-to-all evens per-rank patch cost (balance probe: spread
≈1.27× → ≈1.07×, ~18–24 of 32 samples exchanged):
- read-only → reorder + overlap: mean 798 → 755 (~5%), p90 864 → 786 (~9%), p99 910 → 863 (~5%);
  inter-percentile spread (p10–p99) tightens 184 → 160 ms.
- Overlap is what makes it net-positive. Without it, the synchronous all-to-all sits on the critical
  path: --no-overlap mean 811 ≈ read-only 798 and behind overlapped 755. So at this config the balancing gain is real only because the exchange is hidden behind compute.
- We believe the gain can be larger with more extreme settings - more images, larger image size etc., yet the CORD-v2 dataset was enough to show the proposed gain.
The per-rank imbalance here is mild (~1.3×, CORD-v2 natural image-size variance at dp4/dp4), so balancing
buys single-digit %, not multiples. The reorder win grows with imbalance — larger DP (more ranks →
higher chance one rank draws a heavy shard), larger image-size variance, and larger per-rank batch.

Known limitations / out of scope

PP>1 is supported with untied checkpoints. Reorder + PP>1 and packing + PP>1 are fixed and
verified at dp2/dp2/pp2 (lm loss < 2, tracking the no-reorder PP=2 baseline). Tied-embedding +
PP>1 is still not working at the upstream, independent of this feature — use an untied checkpoint
(tie_word_embeddings=false, LM head = copy of the input embedding) for PP>1.
Intra-microbatch reordering does not support a non-single sampler (guarded with an explicit error),
and TP>1 is untested. CP>1 is blocked upstream (bridge_communicator asserts language-grid CP size 1).
Distributed exchange needs ≥4 GPUs → CI coverage is the unit-emulated all-to-all + a 2-GPU smoke; the
full on-device path was validated manually (16-GPU multi-node + 8×A100 single-node).
Inter-microbatch reordering (DistTrain §5.3): a future follow-up; the transport is inter-ready in code, but the balancer stays intra-only. We leave this as out of scope for this PR.

GitHub Actions CI

See the CI section
in the Contributing doc for how to trigger the CI.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? (3 new unit suites + updates + 1 functional)
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? — No (no new optional deps).

Signed-off-by: Yoonsik Kim yoonsik.kim90@navercorp.com
Signed-off-by: Kayeon Song kayeon.song@navercorp.com
Signed-off-by: Chanwoo Park chanwoo.park98@navercorp.com

copy-pr-bot · 2026-07-01T06:14:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…packing, scalable DP) Non-colocated MegatronMIMO training (e.g. Qwen3.5-VL: a separate vision encoder and language model wired by BridgeCommunicator) has a per-step data-parallel straggler: per-micro-batch vision load (patch count) is uneven across samples, so a DP rank that draws a heavy-image shard stalls the whole group every step. Main feature - Intra-microbatch Reordering (cf. DistTrain https://arxiv.org/abs/2408.04275, S5.2): rebalance each micro-batch's vision load across the module DP group by a per-sample cost all-gather + ragged all_to_all (GPU-resident, overlapped with compute), so per-rank vision load is even and the straggler tail is removed. Works with heterogeneous DP (vision_dp != language_dp) via a canonical n_groups pairing (vision replica r <-> language replica r) and a variable number of images per sample (0 / 1 / N). Auxiliary features (needed to implement the main feature) - Sequence Packing: Megatron-Core accepts packed sequences, but there was no logic to build and feed one; pack each language shard's real tokens into a single [1, T] THD sequence. - Scalable Data Parallelism: previously only vision DP=1 was supported (every DP worker reads the full global micro-batch and slices locally). Each rank now reads only its disjoint 1/dp shard, and vision DP>1 is supported. All three are off by default, gated by a new MegatronMIMOFeatureConfig. DP loss reduction is unchanged from non-scalable runs. Guarded with explicit errors: in-batch packing under PP>1, non-single (cyclic/batch) sampler, TP>1, and CP>1 (CP is also blocked upstream). Reorder under PP>1 runs but is experimental and not yet correct (vision/language mispairing under the per-stage DP groups; tied embeddings additionally hit the cross-PP embedding all-reduce); see docs/training/mimo-intra-microbatch-reorder.md. Signed-off-by: Yoonsik Kim <yoonsik.kim90@navercorp.com> Signed-off-by: Kayeon Song <kayeon.song@navercorp.com> Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>

- drop the _gather_shard pass-through wrapper; call _apply_sample_dispatch directly from split_microbatch - collapse the empty scalable_dp branch in the forward step - factor 8-byte alignment into _pad_to_align() and use math.prod for sizes No behavior change; reorder_buffer + intra_microbatch_pack unit suites pass (58). Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>

github-actions Bot added the community-request label Jul 1, 2026

sailor1493 mentioned this pull request Jul 1, 2026

[feature] intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP) #4609

Open

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers feature New capabilities, enhancements, or enablement work full-test-suite waiting-on-customer Waiting on the original author to respond labels Jul 1, 2026

sailor1493 added 2 commits July 1, 2026 07:17

sailor1493 force-pushed the main branch from ddafed4 to 30e303c Compare July 1, 2026 07:21

liding-nv self-requested a review July 1, 2026 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608

feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608
sailor1493 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
sailor1493:main

sailor1493 commented Jul 1, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sailor1493 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Config (src/megatron/bridge/training/config.py)

Reorder engine (src/megatron/bridge/data/megatron_mimo/)

Training integration (src/megatron/bridge/training/)

Data / packing (src/megatron/bridge/data/)

Example (examples/megatron_mimo/qwen35_vl/finetune_qwen35_vl.py)

Docs (docs/)

Tests (tests/)

Performance

Known limitations / out of scope

GitHub Actions CI

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sailor1493 commented Jul 1, 2026 •

edited

Loading

Config (`src/megatron/bridge/training/config.py`)

Reorder engine (`src/megatron/bridge/data/megatron_mimo/`)

Training integration (`src/megatron/bridge/training/`)

Data / packing (`src/megatron/bridge/data/`)

Example (`examples/megatron_mimo/qwen35_vl/finetune_qwen35_vl.py`)

Docs (`docs/`)

Tests (`tests/`)