Skip to content

perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606

Open
rhmukundan wants to merge 2 commits into
r0.5.0from
rmukundan/qwen3_b300_b200_full_iter_cg
Open

perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606
rhmukundan wants to merge 2 commits into
r0.5.0from
rmukundan/qwen3_b300_b200_full_iter_cg

Conversation

@rhmukundan

Copy link
Copy Markdown
Contributor

Summary

Enable the FP8-MX full-iteration CUDA graph tuning stack for Qwen3 B200/B300 performance recipes on the r0.5.0 release branch.

This updates the B200/B300 FP8-MX recipe definitions to exercise the same optimization path already used by GB200/GB300 where applicable:

  • enables moe_a2a_overlap
  • enables full-iteration CUDA graphs with cuda_graph_scope=[]
  • enables CuteDSL fused grouped MLP
  • enables FP8 dot-product attention
  • sets VP=3 for Qwen3 235B B200/B300 FP8-MX V2 to match the GB200 recipe shape
  • sets moe_flex_dispatcher_backend="hybridep" for Qwen3 235B B200 FP8-MX V2
  • wires the B200 FP8-MX 30B and 235B recipe paths through set_full_iter_cg_configs()

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rhmukundan rhmukundan requested a review from malay-nagda June 30, 2026 23:59
@rhmukundan rhmukundan self-assigned this Jun 30, 2026
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. A few observations (no blockers): (1) The moe_flex_dispatcher_backend=hybridep override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it (unlike B300 which gets it from V1). (2) The set_full_iter_cg_configs(cfg) call in the pretrain functions is complementary to the static base config fields, so there is no overlap or double-application concern. (3) The B300 paths for both 235B and 30B already had the set_full_iter_cg_configs gate - this PR fills in the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200(precision=fp8_mx, config_variant=v2), qwen3_235b_a22b_pretrain_config_b300(precision=fp8_mx, config_variant=v2), qwen3_30b_a3b_pretrain_config_b200(precision=fp8_mx, config_variant=v1). No perf tests impacted.

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. Observations (no blockers): (1) The moe_flex_dispatcher_backend override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it. (2) The set_full_iter_cg_configs call is complementary to static base config fields, no overlap. (3) B300 already had the gate, this fills the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200 fp8_mx v2, qwen3_235b_a22b_pretrain_config_b300 fp8_mx v2, qwen3_30b_a3b_pretrain_config_b200 fp8_mx v1. No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jul 1, 2026
yaoyu-33
yaoyu-33 previously approved these changes Jul 1, 2026
@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jul 1, 2026
@rhmukundan rhmukundan force-pushed the rmukundan/qwen3_b300_b200_full_iter_cg branch from fb1fe29 to 2867f00 Compare July 1, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants