perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606
perf(qwen): enable full iteration cg for b200 b300 fp8 mx#4606rhmukundan wants to merge 2 commits into
Conversation
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
|
LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. A few observations (no blockers): (1) The moe_flex_dispatcher_backend=hybridep override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it (unlike B300 which gets it from V1). (2) The set_full_iter_cg_configs(cfg) call in the pretrain functions is complementary to the static base config fields, so there is no overlap or double-application concern. (3) The B300 paths for both 235B and 30B already had the set_full_iter_cg_configs gate - this PR fills in the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200(precision=fp8_mx, config_variant=v2), qwen3_235b_a22b_pretrain_config_b300(precision=fp8_mx, config_variant=v2), qwen3_30b_a3b_pretrain_config_b200(precision=fp8_mx, config_variant=v1). No perf tests impacted. |
|
LGTM. The changes are consistent with the existing GB200/GB300 full-iteration CUDA graph patterns and the B300 30B config that was already updated. Observations (no blockers): (1) The moe_flex_dispatcher_backend override in QWEN3_235B_A22B_PRETRAIN_CONFIG_B200_FP8_MX_V2 is correct - the B200 FP8_CS parent chain does not inherit it. (2) The set_full_iter_cg_configs call is complementary to static base config fields, no overlap. (3) B300 already had the gate, this fills the B200 gap symmetrically. Suggested test cases: qwen3_235b_a22b_pretrain_config_b200 fp8_mx v2, qwen3_235b_a22b_pretrain_config_b300 fp8_mx v2, qwen3_30b_a3b_pretrain_config_b200 fp8_mx v1. No perf tests impacted. |
fb1fe29 to
2867f00
Compare
Summary
Enable the FP8-MX full-iteration CUDA graph tuning stack for Qwen3 B200/B300 performance recipes on the
r0.5.0release branch.This updates the B200/B300 FP8-MX recipe definitions to exercise the same optimization path already used by GB200/GB300 where applicable:
moe_a2a_overlapcuda_graph_scope=[]moe_flex_dispatcher_backend="hybridep"for Qwen3 235B B200 FP8-MX V2set_full_iter_cg_configs()