[bug] Investigate convergence of performance features with Qwen3.5 VL as proxy model

### Problem

- Issue with recompute MTP module. We initially thought it's a bug in overlap-grad-reduce, but disabling it will lead to a MTP loss gap between MXFP8 & BF16, but this is actually not related with FP8 recipe itself, but a bug in recompute. Learn more here: https://github.com/NVIDIA/Megatron-LM/pull/4593 
- Issue with MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag , when you have these three options open, you have to also enable overlap-param-gather. It has been root caused already and PR to fix is https://github.com/NVIDIA/Megatron-LM/pull/4769.  Note that 4769 is a PR to Megatron-LM dev branch, the PR to main is https://github.com/NVIDIA/Megatron-LM/pull/4800.
- Issue with HybridEP version that comes with Nemo26.04 container needs to be upgraded to address zero tokens for one rank corner case. Tested latest PR to HybridEP https://github.com/deepseek-ai/DeepEP/pull/625 and it worked. 
- Issue with MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag during eval, this has been fixed by https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/3727 and https://github.com/NVIDIA/Megatron-LM/pull/4563. Note that 4563 is a PR to Megatron-LM dev branch, the PR to main is https://github.com/NVIDIA/Megatron-LM/pull/4562.
- Megatron-LM dev branch needs a small fix for the above issue "MXFP8 + fp8_param_gather + reuse_grad_buf_for_mxfp8_param_ag during eval" if it's launched without using Megatron-bridge: https://github.com/NVIDIA/Megatron-LM/pull/4818. This PR doesn't need to be mirrored to main. 
- A more generalized fix for MXFP8 param gather by identifying a bug pattern above and refactor the code for a more general fix for the future: https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/3980 and Megatron-LM dev PR: https://github.com/NVIDIA/Megatron-LM/pull/4994 and Megatron-LM main PR is https://github.com/NVIDIA/Megatron-LM/pull/5236. 
- Fix Mcore MoE single grouped weights & fix grad norm spike for the training step after eval, ie. packing MoE experts into a single tensor of shape `[E, N, K]`: https://github.com/NVIDIA/Megatron-LM/pull/5464 https://github.com/NVIDIA/Megatron-LM/pull/5487

### Minimal repro

```shell
N/A
```

### Expected behavior

described above

### Affected area

area:model

### Regression?

Yes

### Environment

_No response_

### Logs

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Investigate convergence of performance features with Qwen3.5 VL as proxy model #3801

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[bug] Investigate convergence of performance features with Qwen3.5 VL as proxy model #3801

Description

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions