Skip to content

feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608

Open
sailor1493 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
sailor1493:main
Open

feat(data): intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)#4608
sailor1493 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
sailor1493:main

Conversation

@sailor1493

@sailor1493 sailor1493 commented Jul 1, 2026

Copy link
Copy Markdown

What does this PR do ?

close #4609

Main Feature

  • Intra-microbatch Reordering (DistTrain §5.2): rebalance each
    micro-batch's vision load across the module DP group by a per-sample cost all-gather + ragged
    all_to_all (GPU-resident, overlapped with compute) so the per-step DP straggler is removed.
image image

Auxiliary Features (needed to implement the main feature)

  • Sequence Packing — Megatron Core accepts packed sequences, but there was no logic to actually
    build and feed a packed sequence; added it ([1, T] THD).
  • Scalable Data Parallelism — previously only vision DP=1 was supported (every DP worker reads the
    full global micro-batch and slices its own part locally → CPU/IO overhead). Now each rank reads
    only its disjoint 1/dp shard, and vision DP>1 is supported.

All three are off by default, gated by the new MegatronMIMOFeatureConfig. No breaking change
(purely additive: new config + example CLI flags; no existing key/flag/symbol changed).

Changelog

Config (src/megatron/bridge/training/config.py)

  • Add MegatronMIMOFeatureConfig: scalable_dp, intra_microbatch_reorder,
    overlap_intra_microbatch_reorder, reorder_window_size, pack_sequences_in_batch,
    cost coefficients (cost_linear_vit, cost_linear_lm), pad_token_id.
  • finalize() rejects negative coefficients/pad id, reorder_window_size < 1, and an all-zero cost
    when reorder is enabled. Wired into ConfigContainer via a new mimo field.

Reorder engine (src/megatron/bridge/data/megatron_mimo/)

  • reorder_buffer.py (new): per-sample cost all-gather (Gloo) + ragged all_to_all (NCCL) on a
    dedicated CUDA stream; ragged serialize/deserialize; balanced_assignment (contiguous-block,
    het-DP canonical n_groups); split_microbatch/merge_samples; variable-images-per-sample
    (cu_img, empty_like_vision); W-micro-batch reorder window with cross-window prefetch overlap.
  • The per-sample cost is the module-independent image-placeholder token count in input_ids
    (cost = count(image_token) · spatial_merge_size²), identical on the vision and language modules and
    on every PP stage — so both modules derive the same assignment with no cross-module communication.
  • intra_microbatch_pack.py (new): pack each language shard's real tokens into a single [1, T]
    THD sequence (pack_language_shard / assemble_packed_sequence); packs position_ids/labels/
    loss_mask to the same [1, T] on every PP stage so the THD rotary is sized to T.
  • dp_utils.py: scalable-DP sampling info (each rank reads its module-local shard); image-boundary
    vision handling; non-scalable vision_dp > 1 explicitly raises (out of scope); colocate
    single-consumer helpers.

Training integration (src/megatron/bridge/training/)

  • megatron_mimo_step.py: thread the config through the forward step — scalable_dp skips the
    forward-time local slice (sampler already delivered the shard); optional in-batch sequence packing.
    Keeps input_ids on every language PP stage when reorder or packing is active so per-sample lengths
    are derivable on stages > 0. DP loss reduction is unchanged from non-scalable runs.
  • train_megatron_mimo.py: build/route the scalable-DP sampler and the reorder exchange from config.

Data / packing (src/megatron/bridge/data/)

  • datasets/packing_utils.py: shared THD packing helper (placement plan across DP workers).

Example (examples/megatron_mimo/qwen35_vl/finetune_qwen35_vl.py)

  • Wire the config + CLI flags (--scalable-dp, --intra-microbatch-reorder/--no-…,
    --pack-sequences-in-batch, --reorder-window-size, cost coefficients).
  • Guard the one remaining unsupported config: NotImplementedError for dataloader_type != "single".
    PP > 1 is supported on untied checkpoints (no packing/PP guard); see Known limitations.

Docs (docs/)

  • training/mimo-intra-microbatch-reorder.md: feature, config, validation matrix, single-node throughput,
    gaps. Index entries: index.md, training/README.md.

Tests (tests/)

  • Unit: data/megatron_mimo/test_reorder_buffer.py (44), data/megatron_mimo/test_intra_microbatch_pack.py
    (14), training/test_mimo_feature_config.py (14), data/datasets/test_packing_utils.py, plus updates to
    training/megatron_mimo/test_megatron_mimo_step.py — all green.
  • Functional (2-GPU): test_groups/megatron_mimo/test_reorder_exchange.py — on-device exchange smoke.

Performance

Throughput, single 8×A100-80GB node, vision dp4 / language dp4, PP=1, TP=1, Qwen3.5-0.8B (VL) +
CORD-v2, seq 2048, sequence packing on, MBS/GBS 32 (= 8 examples/rank), patch-only cost.
500 iters, stats over elapsed time per iteration with the first 10 iters excluded (compile + the
one-time side-NCCL new_group warmup). PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,
CUDA_DEVICE_MAX_CONNECTIONS=8.

Note: Due to PP bug regarding tied embedding, we turn of the tie_embedding option

config p10 p25 p50 p75 p90 p99 mean max
base + pack (full-batch read, no scalable-dp) 826 885 968 1089 1523 2342 1075 2659
scalable read only + pack (--no-intra-microbatch-reorder) 726 761 793 832 864 910 798 1493
scalable + reorder + pack (default, overlap) 703 736 752 771 786 863 755 1410
scalable + reorder + pack, --no-overlap-intra-microbatch-reorder 747 787 811 831 852 891 811 1496

(ms/iter.)

  1. Scalable read (base → read-only) is the dominant win, as expected — each rank reads only its 1/dp
    shard instead of every rank redundantly reading and decoding all 32 samples. mean 1,075 → 798 (1.35×),
    and the full-read tail collapses: p99 2,342 → 910, p90 1,523 → 864.
  2. Reorder pays off only with overlap, as a single-digit-% gain here. With packing, each language rank
    packs its mbs/dp samples into one [1, T] THD sequence whose T = Σ image-placeholder tokens, so an
    uneven per-rank image load skews T. The all-to-all evens per-rank patch cost (balance probe: spread
    ≈1.27× → ≈1.07×, ~18–24 of 32 samples exchanged):
    • read-only → reorder + overlap: mean 798 → 755 (~5%), p90 864 → 786 (~9%), p99 910 → 863 (~5%);
      inter-percentile spread (p10–p99) tightens 184 → 160 ms.
    • Overlap is what makes it net-positive. Without it, the synchronous all-to-all sits on the critical
      path: --no-overlap mean 811 ≈ read-only 798 and behind overlapped 755. So at this config the balancing gain is real only because the exchange is hidden behind compute.
    • We believe the gain can be larger with more extreme settings - more images, larger image size etc., yet the CORD-v2 dataset was enough to show the proposed gain.
  3. The per-rank imbalance here is mild (~1.3×, CORD-v2 natural image-size variance at dp4/dp4), so balancing
    buys single-digit %, not multiples. The reorder win grows with imbalance — larger DP (more ranks →
    higher chance one rank draws a heavy shard), larger image-size variance, and larger per-rank batch.

Known limitations / out of scope

  • PP>1 is supported with untied checkpoints. Reorder + PP>1 and packing + PP>1 are fixed and
    verified at dp2/dp2/pp2 (lm loss < 2, tracking the no-reorder PP=2 baseline). Tied-embedding +
    PP>1
    is still not working at the upstream, independent of this feature — use an untied checkpoint
    (tie_word_embeddings=false, LM head = copy of the input embedding) for PP>1.
  • Intra-microbatch reordering does not support a non-single sampler (guarded with an explicit error),
    and TP>1 is untested. CP>1 is blocked upstream (bridge_communicator asserts language-grid CP size 1).
  • Distributed exchange needs ≥4 GPUs → CI coverage is the unit-emulated all-to-all + a 2-GPU smoke; the
    full on-device path was validated manually (16-GPU multi-node + 8×A100 single-node).
  • Inter-microbatch reordering (DistTrain §5.3): a future follow-up; the transport is inter-ready in code, but the balancer stays intra-only. We leave this as out of scope for this PR.

GitHub Actions CI

See the CI section
in the Contributing doc for how to trigger the CI.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? (3 new unit suites + updates + 1 functional)
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? — No (no new optional deps).

Signed-off-by: Yoonsik Kim yoonsik.kim90@navercorp.com
Signed-off-by: Kayeon Song kayeon.song@navercorp.com
Signed-off-by: Chanwoo Park chanwoo.park98@navercorp.com

@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers feature New capabilities, enhancements, or enablement work full-test-suite waiting-on-customer Waiting on the original author to respond labels Jul 1, 2026
…packing, scalable DP)

Non-colocated MegatronMIMO training (e.g. Qwen3.5-VL: a separate vision encoder
and language model wired by BridgeCommunicator) has a per-step data-parallel
straggler: per-micro-batch vision load (patch count) is uneven across samples, so
a DP rank that draws a heavy-image shard stalls the whole group every step.

Main feature
- Intra-microbatch Reordering (cf. DistTrain https://arxiv.org/abs/2408.04275,
  S5.2): rebalance each micro-batch's vision load across the module DP group by a
  per-sample cost all-gather + ragged all_to_all (GPU-resident, overlapped with
  compute), so per-rank vision load is even and the straggler tail is removed.
  Works with heterogeneous DP (vision_dp != language_dp) via a canonical n_groups
  pairing (vision replica r <-> language replica r) and a variable number of
  images per sample (0 / 1 / N).

Auxiliary features (needed to implement the main feature)
- Sequence Packing: Megatron-Core accepts packed sequences, but there was no
  logic to build and feed one; pack each language shard's real tokens into a
  single [1, T] THD sequence.
- Scalable Data Parallelism: previously only vision DP=1 was supported (every DP
  worker reads the full global micro-batch and slices locally). Each rank now
  reads only its disjoint 1/dp shard, and vision DP>1 is supported.

All three are off by default, gated by a new MegatronMIMOFeatureConfig. DP loss
reduction is unchanged from non-scalable runs. Guarded with explicit errors:
in-batch packing under PP>1, non-single (cyclic/batch) sampler, TP>1, and CP>1
(CP is also blocked upstream). Reorder under PP>1 runs but is experimental and
not yet correct (vision/language mispairing under the per-stage DP groups; tied
embeddings additionally hit the cross-PP embedding all-reduce); see
docs/training/mimo-intra-microbatch-reorder.md.

Signed-off-by: Yoonsik Kim <yoonsik.kim90@navercorp.com>
Signed-off-by: Kayeon Song <kayeon.song@navercorp.com>
Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>
- drop the _gather_shard pass-through wrapper; call _apply_sample_dispatch
  directly from split_microbatch
- collapse the empty scalable_dp branch in the forward step
- factor 8-byte alignment into _pad_to_align() and use math.prod for sizes

No behavior change; reorder_buffer + intra_microbatch_pack unit suites
pass (58).

Signed-off-by: Chanwoo Park <chanwoo.park98@navercorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers community-request feature New capabilities, enhancements, or enablement work full-test-suite waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] intra-microbatch reordering for MegatronMIMO (+ sequence packing, scalable DP)

3 participants