[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes#5464
[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes#5464zhongbozhu wants to merge 19 commits into
Conversation
|
/claude strict-review |
Code Review SummaryCRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3 Overall AssessmentThis is a well-structured fix for Risk level: Low-Medium. The changes are narrowly scoped to the GroupedTensor integration paths and gated behind Key FindingsIMPORTANT — Unused IMPORTANT — Suggested fix: def is_nvfp4tensor(tensor: torch.Tensor) -> bool:
"""Check if a tensor is a Transformer Engine NVFP4Tensor."""
return HAVE_TE_FP4_TENSOR_CLASS and _is_instance_or_param_data(tensor, FP4_TENSOR_CLASS)Suggestions (posted inline)
|
| bucket.layerwise_params_list[local_rank] | ||
| ).detach() | ||
| local_slot_view.copy_(flat_local_params) | ||
| with torch.no_grad(): |
There was a problem hiding this comment.
Why this with torch.no_grad() needed?
There was a problem hiding this comment.
removed, they are redundant I believe
There was a problem hiding this comment.
we need to use torch.no_grad() when the mutation is intentional and should not affect gradients.
Looks like you removed this everywhere. Not sure if this matters, but:
I tried implementing this feature some time ago and I got below error in the past in mxfp8 reuse grad buffer case when doing some copying.
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation
Need to make sure unit test still pass after this change.
There was a problem hiding this comment.
good catch, there is a bug in E2E test not captured in UT
There was a problem hiding this comment.
should be resolved now
|
/ok to test 509c7a6 |
7cae31a to
0456abf
Compare
|
Note: GB200 unit test was added #5477 but not yet synced to dev |
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
| if single_grouped_bias: | ||
| op.register_parameter("bias", linear.get_parameter("bias")) | ||
| for idx in range(linear.num_gemms): | ||
| op.register_parameter(f"bias{idx}", None) |
There was a problem hiding this comment.
Is None intended to be an observable inactive state here? register_parameter(name, None) excludes the entry from named_parameters(), but the attribute still exists.
import torch
m = torch.nn.Module()
m.register_parameter("bias0", None)
print(hasattr(m, "bias0")) # True
print(dict(m.named_parameters())) # {}The existing test_make_fused_ops_attaches_single_grouped_bias_for_fc1 checks that bias0 is absent. If None is the intended representation, could that test assert bias0 is None; otherwise, should the stale attribute be removed?
| # dispatched" state. The next forward pre-hook must run post-sync cleanup, | ||
| # especially when MXFP8 reuses grad_data as the param AG buffer. | ||
| for model_chunk in self.model_chunks: | ||
| model_chunk.reset_param_sync_dispatch_state() |
There was a problem hiding this comment.
Could this be reached with a not finished param_gather_handle?
There was a problem hiding this comment.
Could this be reached with a not finished param_gather_handle?
Need to double check this one, not sure.
These two lines are needed to avoid the grad_norm spikes. Because if we have training - eval -training phases, we used to skip the AG for the first training step because eval will already do a forced sync with AG. But this will also skip the zero grad buf operation, so grad accumulation is done on dirty buffers.
This fix is simple, this will trigger a redundant AG, which will then clear the shared buffer for zeroing out the same buffer for gradient accumulation.
There was a problem hiding this comment.
Actually there should be a better fix than this that skips the redundant AG while also eliminating grad norm spikes.
There was a problem hiding this comment.
Maybe add an assertion inside the function to make sure the all-gather handle is None?
|
/ok to test 812f72d |



What does this PR do ?
Fix
moe_single_grouped_weightwith bf16, mxfp8, nvfp4 training with fp8/fp4 primary weight turned on or off.Mirror PR to main: #5487
TODOs:
Unit tests with numerical checks passed, pending E2E validation.
test_single_grouped_mxfp8_train_eval_train_matches_train_onlyis a newly introduced test targeting to test thereuse_grad_buff_for_mxfp8_param_agrigorously, like adding checks for train-eval-train switches.Unit test coverage matrix:
bf16=Truefp8=Nonefp4=Nonegradient_accumulation_fusion=Falsebf16=Truefp8=Nonefp4=Nonegradient_accumulation_fusion=Truebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Falsereuse_grad_buf_for_mxfp8_param_ag=Falsegradient_accumulation_fusion=Falsebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Falsereuse_grad_buf_for_mxfp8_param_ag=Falsegradient_accumulation_fusion=Truebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Truereuse_grad_buf_for_mxfp8_param_ag=Truegradient_accumulation_fusion=Falsebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Truereuse_grad_buf_for_mxfp8_param_ag=Truegradient_accumulation_fusion=Truebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Falsegradient_accumulation_fusion=Falsebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Falsegradient_accumulation_fusion=Truebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Truegradient_accumulation_fusion=Falsebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Truegradient_accumulation_fusion=TrueEnv: 1 x gb200 node, 4 GPUs, the unit test only uses 2 parallel ranks.
Command:
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.