[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes by zhongbozhu · Pull Request #5464 · NVIDIA/Megatron-LM

zhongbozhu · 2026-06-24T00:19:05Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Fix moe_single_grouped_weight with bf16, mxfp8, nvfp4 training with fp8/fp4 primary weight turned on or off.

Mirror PR to main: #5487

TODOs:

Validate more combinations of toggles in E2E testing

Unit tests with numerical checks passed, pending E2E validation. test_single_grouped_mxfp8_train_eval_train_matches_train_only is a newly introduced test targeting to test the reuse_grad_buff_for_mxfp8_param_ag rigorously, like adding checks for train-eval-train switches.

Unit test coverage matrix:

Precision	Primary Weight Path	Grad Accum Fusion	Comparison	Notes / Transformer Config
BF16	BF16 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8=None` `fp4=None` `gradient_accumulation_fusion=False`
BF16	BF16 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8=None` `fp4=None` `gradient_accumulation_fusion=True`
MXFP8	BF16 primary weight, MXFP8 compute	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=False` `reuse_grad_buf_for_mxfp8_param_ag=False` `gradient_accumulation_fusion=False`
MXFP8	BF16 primary weight, MXFP8 compute	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=False` `reuse_grad_buf_for_mxfp8_param_ag=False` `gradient_accumulation_fusion=True`
MXFP8	MXFP8 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=True` `reuse_grad_buf_for_mxfp8_param_ag=True` `gradient_accumulation_fusion=False`
MXFP8	MXFP8 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=True` `reuse_grad_buf_for_mxfp8_param_ag=True` `gradient_accumulation_fusion=True`
NVFP4	BF16 primary weight, NVFP4 compute	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=False` `gradient_accumulation_fusion=False`
NVFP4	BF16 primary weight, NVFP4 compute	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=False` `gradient_accumulation_fusion=True`
NVFP4	NVFP4 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=True` `gradient_accumulation_fusion=False`
NVFP4	NVFP4 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=True` `gradient_accumulation_fusion=True`

Env: 1 x gb200 node, 4 GPUs, the unit test only uses 2 parallel ranks.

Command:

torchrun --nproc_per_node=2 --log-dir /tmp/mcore-single-weight-ut --tee 0:3 --redirects 3 -m pytest -s -q tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py

[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_ddp_param_data_remap_data_ptr[bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_ddp_param_data_remap_data_ptr[nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_mxfp8_train_eval_train_matches_train_only
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-only-single]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-single-load-discrete]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-discrete-load-single]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-nvfp4]

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-06-24T00:19:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kunlunl · 2026-06-24T14:52:30Z

/claude strict-review

claude · 2026-06-24T14:57:43Z

Code Review Summary

CRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3

Overall Assessment

This is a well-structured fix for moe_single_grouped_weight across BF16, MXFP8, and NVFP4 with thorough test coverage. The data-flow through quantized/non-quantized param paths in the DDP buffer and distributed optimizer is correct. The register_grouped_linear_params refactor properly addresses the root cause (TE overwriting DDP-managed parameters with fresh meta tensors). The torch.no_grad() additions are necessary to prevent autograd tracking on buffer management ops with tensor subclasses.

Risk level: Low-Medium. The changes are narrowly scoped to the GroupedTensor integration paths and gated behind moe_single_grouped_weight. The FSDP guard is a good safeguard. The numerical parity tests cover the full precision × param-gather × grad-accum-fusion matrix.

Key Findings

IMPORTANT — Unused _unwrap_parameter_data on DistributedOptimizer (distrib_optimizer.py:1118-1121)
Added as a @staticmethod but never called. Duplicates the function in fp8_utils.py. Should be removed (inline comment posted with suggestion block).

IMPORTANT — is_nvfp4tensor not updated to unwrap Parameters (fp4_utils.py:58-60)
The PR updates is_float8tensor and is_mxfp8tensor to handle torch.nn.Parameter-wrapped TE subclasses via _is_instance_or_param_data, but is_nvfp4tensor still uses plain isinstance. This inconsistency could misclassify a Parameter-wrapped NVFP4Tensor in _param_uses_quantized_storage. The fix is straightforward — this file already imports _is_instance_or_param_data indirectly through the fp8_utils imports added in this PR. (Couldn't post inline since these lines aren't in the diff.)

Suggested fix:

def is_nvfp4tensor(tensor: torch.Tensor) -> bool:
    """Check if a tensor is a Transformer Engine NVFP4Tensor."""
    return HAVE_TE_FP4_TENSOR_CLASS and _is_instance_or_param_data(tensor, FP4_TENSOR_CLASS)

Suggestions (posted inline)

copy_tensor_to_quantized_param: document that the plain copy_ fallback relies on TE's overridden method
register_grouped_linear_params: consider clearing stale "weight" in the per-index branch for symmetry
modify_grouped_nvfp4_rowwise_storage: add comment explaining why member views are refreshed eagerly (vs. lazily in the MXFP8 counterpart)

kunlunl · 2026-06-24T14:57:59Z

                        bucket.layerwise_params_list[local_rank]
                    ).detach()
-                    local_slot_view.copy_(flat_local_params)
+                    with torch.no_grad():


Why this with torch.no_grad() needed?

removed, they are redundant I believe

we need to use torch.no_grad() when the mutation is intentional and should not affect gradients.
Looks like you removed this everywhere. Not sure if this matters, but:
I tried implementing this feature some time ago and I got below error in the past in mxfp8 reuse grad buffer case when doing some copying.
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation
Need to make sure unit test still pass after this change.

good catch, there is a bug in E2E test not captured in UT

should be resolved now

zhongbozhu · 2026-06-24T20:00:31Z

/ok to test 509c7a6

zhongbozhu · 2026-06-24T20:26:34Z

Note: GB200 unit test was added #5477 but not yet synced to dev

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: zhongboz <zhongboz@nvidia.com>

zhongbozhu · 2026-06-28T06:23:51Z

E2E test with Qwen3.5 VL 35B-A3B SFT - branch dev_fix_single_weight

Before this PR, moe single weight will simply diverge. now it converges well.

Performance benefit comes form lower CPU overhead when quantizing to MXFP8 in distributed optimizer. Plus that CUDA Graph can be hard to open for multimodal SFT as of today.

Green plot (before this PR) had grad norm spikes because if we have reuse_grad_buff_for_mxfp8_param_ag, the training step right after eval doesn't clear the param_data buffer because the all-gather was already done in eval - so it got skipped, but unfortunately the zero buffer operation was also skipped.

zhongbozhu · 2026-06-28T06:43:16Z

E2E performance benefit shown in Nsys - time spent in looping over moe weights in optimizer master weights and quantize to mxfp8, discrete weight vs. single weight

Discrete

Single

Signed-off-by: zhongboz <zhongboz@nvidia.com>

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: zhongboz <zhongboz@nvidia.com>

kunlunl · 2026-06-30T06:51:25Z

+            if single_grouped_bias:
+                op.register_parameter("bias", linear.get_parameter("bias"))
+                for idx in range(linear.num_gemms):
+                    op.register_parameter(f"bias{idx}", None)


Is None intended to be an observable inactive state here? register_parameter(name, None) excludes the entry from named_parameters(), but the attribute still exists.

import torch m = torch.nn.Module() m.register_parameter("bias0", None) print(hasattr(m, "bias0")) # True print(dict(m.named_parameters())) # {}

The existing test_make_fused_ops_attaches_single_grouped_bias_for_fc1 checks that bias0 is absent. If None is the intended representation, could that test assert bias0 is None; otherwise, should the stale attribute be removed?

kunlunl · 2026-06-30T07:12:52Z

+        # dispatched" state. The next forward pre-hook must run post-sync cleanup,
+        # especially when MXFP8 reuses grad_data as the param AG buffer.
+        for model_chunk in self.model_chunks:
+            model_chunk.reset_param_sync_dispatch_state()


Could this be reached with a not finished param_gather_handle?

Could this be reached with a not finished param_gather_handle?

Need to double check this one, not sure.

These two lines are needed to avoid the grad_norm spikes. Because if we have training - eval -training phases, we used to skip the AG for the first training step because eval will already do a forced sync with AG. But this will also skip the zero grad buf operation, so grad accumulation is done on dirty buffers.

This fix is simple, this will trigger a redundant AG, which will then clear the shared buffer for zeroing out the same buffer for gradient accumulation.

Actually there should be a better fix than this that skips the redundant AG while also eliminating grad norm spikes.

Maybe add an assertion inside the function to make sure the all-gather handle is None?

kunlunl · 2026-07-01T06:41:31Z

/ok to test 812f72d

zhongbozhu requested review from a team as code owners June 24, 2026 00:19

zhongbozhu requested review from WanZzzzzz and kunlunl June 24, 2026 00:31

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/optimizer/distrib_optimizer.py Outdated

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/fp8_utils.py

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/fp4_utils.py

kunlunl reviewed Jun 24, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:01 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:04 Inactive

zhongbozhu mentioned this pull request Jun 24, 2026

[Main] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes #5487

Open

7 tasks

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:13 Inactive

zhongbozhu force-pushed the dev_fix_single_weight branch from 7cae31a to 0456abf Compare June 24, 2026 20:19

WanZzzzzz approved these changes Jun 26, 2026

View reviewed changes

zhongbozhu added 6 commits June 26, 2026 16:18

fix single weight - first draft

3c3199a

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

update unit test

b183b83

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

fix for gradient_accumulation_fusion

8d1a8ff

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

checks all ranks

b532964

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

increase UT coverage

5b879dc

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

resolve comments

c24ff37

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

run UT in CI

79706bb

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 28, 2026 05:57 Inactive

zhongbozhu changed the title ~~[Dev] Fix moe single grouped weight feature with fp8 fp4 primary weight support~~ [Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes Jun 28, 2026

copy-pr-bot Bot temporarily deployed to public June 28, 2026 06:01 Inactive

lint

a73c86b

Signed-off-by: zhongboz <zhongboz@nvidia.com>

zhongbozhu added 2 commits June 28, 2026 18:10

include checkpointing to the unit test

faa033b

Signed-off-by: zhongboz <zhongboz@nvidia.com>

reapply NVIDIA#4994

a1f8ea2

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>