perf(qwen): enable full iteration cg for b200 b300 fp8 mx by rhmukundan · Pull Request #4606 · NVIDIA-NeMo/Megatron-Bridge

rhmukundan · 2026-06-30T23:59:16Z

Summary

Enable the FP8-MX full-iteration CUDA graph tuning stack for Qwen3 B200/B300 performance recipes on the r0.5.0 release branch.

This updates the B200/B300 FP8-MX recipe definitions to exercise the same optimization path already used by GB200/GB300 where applicable:

enables moe_a2a_overlap
enables full-iteration CUDA graphs with cuda_graph_scope=[]
enables CuteDSL fused grouped MLP
enables FP8 dot-product attention
sets VP=3 for Qwen3 235B B200/B300 FP8-MX V2 to match the GB200 recipe shape
sets moe_flex_dispatcher_backend="hybridep" for Qwen3 235B B200 FP8-MX V2
wires the B200 FP8-MX 30B and 235B recipe paths through set_full_iter_cg_configs()

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

copy-pr-bot · 2026-06-30T23:59:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-07-01T00:05:59Z

LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. A few observations (no blockers): (1) The moe_flex_dispatcher_backend=hybridep override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it (unlike B300 which gets it from V1). (2) The set_full_iter_cg_configs(cfg) call in the pretrain functions is complementary to the static base config fields, so there is no overlap or double-application concern. (3) The B300 paths for both 235B and 30B already had the set_full_iter_cg_configs gate - this PR fills in the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200(precision=fp8_mx, config_variant=v2), qwen3_235b_a22b_pretrain_config_b300(precision=fp8_mx, config_variant=v2), qwen3_30b_a3b_pretrain_config_b200(precision=fp8_mx, config_variant=v1). No perf tests impacted.

claude · 2026-07-01T00:08:23Z

LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. Observations (no blockers): (1) The moe_flex_dispatcher_backend override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it. (2) The set_full_iter_cg_configs call is complementary to static base config fields, no overlap. (3) B300 already had the gate, this fills the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200 fp8_mx v2, qwen3_235b_a22b_pretrain_config_b300 fp8_mx v2, qwen3_30b_a3b_pretrain_config_b200 fp8_mx v1. No perf tests impacted.

rhmukundan added 2 commits June 30, 2026 16:47

perf(qwen): enable full iteration cg for b200 b300 fp8 mx

1f2330a

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

perf(qwen): use hybridep for b200 235b fp8 mx

2867f00

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan requested a review from malay-nagda June 30, 2026 23:59

rhmukundan self-assigned this Jun 30, 2026

yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jul 1, 2026

yaoyu-33 previously approved these changes Jul 1, 2026

View reviewed changes

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jul 1, 2026

rhmukundan dismissed yaoyu-33’s stale review via fb1fe29 July 1, 2026 20:37

rhmukundan force-pushed the rmukundan/qwen3_b300_b200_full_iter_cg branch from fb1fe29 to 2867f00 Compare July 1, 2026 21:06

malay-nagda approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606

perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606
rhmukundan wants to merge 2 commits into
r0.5.0from
rmukundan/qwen3_b300_b200_full_iter_cg

rhmukundan commented Jun 30, 2026

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

claude Bot commented Jul 1, 2026

Uh oh!

claude Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rhmukundan commented Jun 30, 2026

Summary

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

claude Bot commented Jul 1, 2026

Uh oh!

claude Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants