Description Problem
Issue with recompute MTP module. We initially thought it's a bug in overlap-grad-reduce, but disabling it will lead to a MTP loss gap between MXFP8 & BF16, but this is actually not related with FP8 recipe itself, but a bug in recompute. Learn more here: Fix MTP recompute crash with packed sequences NVIDIA/Megatron-LM#4593
Issue with MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag , when you have these three options open, you have to also enable overlap-param-gather. It has been root caused already and PR to fix is Fix mxfp8 param gather numerical issue when DP overlap is off NVIDIA/Megatron-LM#4769 . Note that 4769 is a PR to Megatron-LM dev branch, the PR to main is Fix mxfp8 param gather numerical issue when DP overlap is off NVIDIA/Megatron-LM#4800 .
Issue with HybridEP version that comes with Nemo26.04 container needs to be upgraded to address zero tokens for one rank corner case. Tested latest PR to HybridEP Optimization of the standalone permute path deepseek-ai/DeepEP#625 and it worked.
Issue with MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag during eval, this has been fixed by [MXFP8]Update param buffer before AG in eval #3727 and [MXFP8/FP4-param-gather] Post processing after forced param AG in eval NVIDIA/Megatron-LM#4563 . Note that 4563 is a PR to Megatron-LM dev branch, the PR to main is [MXFP8/FP4-param-gather] Post processing after forced param AG in eval NVIDIA/Megatron-LM#4562 .
Megatron-LM dev branch needs a small fix for the above issue "MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag during eval" if it's launched without using Megatron-bridge: [MXFP8] Mirror fixes in Mbridge for mxfp8 param gather NVIDIA/Megatron-LM#4818 . This PR doesn't need to be mirrored to main.
A more generalized fix for MXFP8 param gather by identifying a bug pattern above and refactor the code for a more general fix for the future: Generalized fix for mxfp8 param gather #3980 and Megatron-LM dev PR: [Dev] Generalized fix for mxfp8 param gather NVIDIA/Megatron-LM#4994 and Megatron-LM main PR is [Main] Generalized fix for mxfp8 param gather NVIDIA/Megatron-LM#5236 .
Fix Mcore MoE single grouped weights & fix grad norm spike for the training step after eval, ie. packing MoE experts into a single tensor of shape [E, N, K]: [Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes NVIDIA/Megatron-LM#5464 [Main] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes NVIDIA/Megatron-LM#5487
Minimal repro
Expected behavior
described above
Affected area
area:model
Regression?
Yes
Environment
No response
Logs
Reactions are currently unavailable
You can’t perform that action at this time.
Problem
[E, N, K]: [Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes NVIDIA/Megatron-LM#5464 [Main] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes NVIDIA/Megatron-LM#5487Minimal repro
Expected behavior
described above
Affected area
area:model
Regression?
Yes
Environment
No response
Logs