Add MiniMax-M3 BF16 support by wuhuikx · Pull Request #1238 · ROCm/ATOM

wuhuikx · 2026-06-16T11:57:32Z

server:

model_path=/shared/data/amd_int/models/MiniMax-M3/
export ATOM_USE_TRITON_MOE="${ATOM_USE_TRITON_MOE:-1}"

python -m atom.entrypoints.openai_server
--model $model_path
-tp 8 --server-port 8013 --trust-remote-code --gpu-memory-utilization 0.7
--block-size 128
--no-enable_prefix_caching
curl:

curl -X POST "http://localhost:8013/v1/completions"
-H "Content-Type: application/json"
-d '{
"prompt": "The capital of China is", "temperature": 0, "top_p": 1, "top_k": 1, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123, "max_tokens": 10}'
accuracy:

model_path=/shared/data/amd_int/models/MiniMax-M3/
lm_eval --model local-completions
--model_args model=$model_path,base_url=http://localhost:8013/v1/completions,num_concurrent=65,max_retries=1,tokenized_requests=False,trust_remote_code=True
--tasks gsm8k
--num_fewshot 5 2>&1|tee ./m3-accuracy.log
curl result:

{"id":"cmpl-b82583aec2994c5ea7371b291a716969","object":"text_completion","created":1781604871,"model":"/shared/data/amd_int/models/MiniMax-M3/","choices":[{"index":0,"text":" Beijing. The the capital of China is Beijing.","finish_reason":"max_tokens"}],"usage":{"prompt_tokens":5,"completion_tokens":10,"total_tokens":15,"ttft_s":0.0599,"tpot_s":0.0156,"latency_s":0.2005}
accuracy result:

local-completions ({'model': '/shared/data/amd_int/models/MiniMax-M3/', 'base_url': 'http://localhost:8013/v1/completions', 'num_concurrent': 65, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9196	±	0.0075
		strict-match	5	exact_match	↑	0.9189	±	0.0075

Test plan

git diff --check origin/main...HEAD
python3 -m py_compile $(git diff --name-only origin/main...HEAD -- '*.py')

Based on #1235.

Made with Cursor

Register MiniMaxM3Sparse model paths, model ops, and vLLM plugin integration for sparse attention and multimodal serving. Co-authored-by: Cursor <cursoragent@cursor.com>

Route MiniMax-M3 sparse attention through precision-safe compiled paths and remove obsolete debug or unreachable branches from the vLLM plugin implementation. Co-authored-by: Cursor <cursoragent@cursor.com>

Use the fused MiniMax-M3 QK-norm/RoPE preprocessing path while preserving ATOM sparse attention cache insertion semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Keep the MiniMax-M3 BF16 PR focused on the native ATOM server path by removing the vLLM plugin wiring from the branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Document the MiniMax-M3 BF16 server, benchmark, and GSM8K validation flow for the native ATOM path. Co-authored-by: Cursor <cursoragent@cursor.com>

Point the MiniMax-M3 BF16 recipe at the M3_mi355 AITER branch used for validation. Co-authored-by: Cursor <cursoragent@cursor.com>

Update the MiniMax-M3 BF16 recipe to pull rocm/atom-dev:latest. Co-authored-by: Cursor <cursoragent@cursor.com>

Document explicit CUDAGraph capture sizes in the MiniMax-M3 BF16 native ATOM server command. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: lirui927 <ruili@amd.com>

…ts (#1246) * feat(minimax_m3): route bf16 MoE through dedicated MiniMaxM3Bf16Experts Native (unquantized) MiniMax-M3 now runs its routed experts through the M3-owned MiniMaxM3Bf16Experts (custom triton GEMM kernels + SwiGLU-OAI via swiglu_oai_split), instead of the generic FusedMoE -> aiter CK fused_moe path. The CK fused_moe path needs an OAI-SwiGLU MoE kernel that only exists on the aiter M3_mi355 branch (CK-submodule patch); it is absent from aiter main. The dedicated experts depend only on triton + swiglu_oai_split (no aiter CK / no aiter fused_moe), so bf16 M3 runs correctly on stock aiter main. - Restore MiniMaxM3Bf16Experts + its triton kernels in atom/model_ops/minimax_m3/moe.py (removed by 8dd1a01 "remove redundant code"). - Restore the torch.ops.aiter.minimax_m3_bf16_experts_forward op registration in atom/model_ops/module_dispatch_ops.py. - MiniMaxM3MoE selects the dedicated bf16 experts for unquantized checkpoints and keeps FusedMoE for FP4/MXFP4 (which the dedicated path does not implement). The dedicated experts apply routed_scaling_factor internally and return un-reduced output; the decoder's fused all-reduce reduces it (matching the existing FusedMoE contract), so no extra all-reduce is added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(minimax_m3): keep shared experts standalone for bf16 weight loading bf16 routes routed experts through MiniMaxM3Bf16Experts with a standalone shared-experts MLP, but the loader was still fusing the checkpoint's block_sparse_moe.shared_experts.* weights into the routed slot (experts.N), leaving the standalone shared_experts.{gate_up_proj,down_proj}.weight at init (114/825 params unloaded across the 57 MoE layers). Set model.disable_fused_shared_loading=True for the bf16 (dedicated-experts) path so the loader keeps shared-expert weights standalone; FP4 keeps FusedMoE's fused-shared loading. Flag is set on both MiniMaxM3SparseForCausalLM and the text-only wrapper (the instance the weight loader inspects). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: lirui927 <ruili@amd.com>

aiter main upstreamed the MiniMax-M3 fused qk-norm + rope + kv-insert op as `fused_qknorm_idxrqknorm` (PR #3754) with a byte-identical signature; the old M3_mi355 name `fused_minimax_m3_qknorm_rope_kv_insert` does not exist on aiter main. The hasattr() capability gate therefore failed and ATOM silently fell back to the unfused qk-norm path. Rename all 3 references (the capability check + both call sites) so ATOM uses the fused kernel when running against aiter main. Pure rename — call arguments unchanged. Applies to both bf16 and FP4 M3 (attention preprocessing). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* aiter asm pa have right acc Signed-off-by: ganyi <ygan@amd.com> * gluon pa for decode Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>

Document the ASM paged attention env flag for FP4 serving, accuracy, and benchmark flows. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove the ASM paged attention environment override now that the default FP4 path has been validated. Co-authored-by: Cursor <cursoragent@cursor.com>

Update the FP4 GSM8K accuracy and concurrency sweep benchmark results from the latest PR validation run. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: xytpai <xytpai@foxmail.com>

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Co-authored-by: xytpai <xytpai@foxmail.com>

* fix(spec_decode): drop emitted tokens past the stop position A spec-decode verify step can accept tokens after EOS (the rejection sampler compares draft vs target argmax and does not inspect EOS). The internal length (num_tokens) was already truncated at the stop, but the online output path streams RequestOutput.output_tokens (an accumulation of new_tokens), not the num_tokens-truncated completion_token_ids. Mirror the truncation on new_tokens so post-stop tokens never reach the client. Without this, EAGLE leaks a trailing token after the answer; EOS is hidden by skip_special_tokens while the trailing content remains, regressing GSM8K flexible-extract (last-number) while strict-match stays correct. * feat(minimax_m3): native ATOM EAGLE3 speculative decoding Enable EAGLE3 draft speculative decoding for MiniMax-M3: - capture Eagle3 aux hidden states from each selected layer's own fused all-reduce residual (CUDAGraph-safe; avoids the NaN-prone ad-hoc collective) and expose set/get_aux_hidden_state_layers - map torchspec `layers.0.*` draft weights to `midlayer.*` - block-paged MHA draft slot mapping via prepare_mtp_decode, and skip the MLA token-granular flat-kv slot derivation for block_size>1 drafts - persistent sparse_prefix_lens buffer so CUDAGraph spec-verify reads live causal lengths (bound into the refactored sparse prefill metadata) - recipe + validated GSM8K parity vs non-spec MXFP4 * add quick allreduce int4 for eagle Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* support MiniMax-M3 MXFP8 Co-authored-by: Cursor <cursoragent@cursor.com> * add mxfp8 recipe --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Co-authored-by: xytpai <xytpai@foxmail.com>

XiaobingSuper and others added 13 commits June 13, 2026 07:28

feat(minimax_m3): add MiniMax M3 sparse model base support

803d7bd

Register MiniMaxM3Sparse model paths, model ops, and vLLM plugin integration for sparse attention and multimodal serving. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(minimax_m3): repair vLLM precision issue

be8abc6

Route MiniMax-M3 sparse attention through precision-safe compiled paths and remove obsolete debug or unreachable branches from the vLLM plugin implementation. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into lirui/minimax_m3_vllm_atom

8546d9c

adapt(minimax_m3): support PTPC FP8 expert checkpoints

418d96e

enable aiter fused_moe

1e49905

adapt(minimax_m3): use fused attention preprocessing

4f526f0

Use the fused MiniMax-M3 QK-norm/RoPE preprocessing path while preserving ATOM sparse attention cache insertion semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

adapt(minimax_m3): split sparse attention cache insert

c85c259

Co-authored-by: Cursor <cursoragent@cursor.com>

adapt(layernorm): defer fused Gemma allreduce RMSNorm

9753300

add atom m3 bf16 support

7e03f12

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

refactor(minimax_m3): drop vllm plugin changes

7d51f8a

Keep the MiniMax-M3 BF16 PR focused on the native ATOM server path by removing the vLLM plugin wiring from the branch. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): add bf16 serving recipe

4f10935

Document the MiniMax-M3 BF16 server, benchmark, and GSM8K validation flow for the native ATOM path. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): update aiter branch in recipe

fdab588

Point the MiniMax-M3 BF16 recipe at the M3_mi355 AITER branch used for validation. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): use latest atom-dev image

3f563a6

Update the MiniMax-M3 BF16 recipe to pull rocm/atom-dev:latest. Co-authored-by: Cursor <cursoragent@cursor.com>

wuhuikx marked this pull request as draft June 16, 2026 15:03

lirui927 and others added 16 commits June 16, 2026 10:43

docs(minimax_m3): add native cudagraph launch option

4dafc59

Document explicit CUDAGraph capture sizes in the MiniMax-M3 BF16 native ATOM server command. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): update serving benchmark shape

2de7091

docs(minimax_m3): add serving benchmark result

0eb1a30

remove redundant code

8dd1a01

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

fix exclude naming (#1240)

05a9792

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

docs(minimax_m3): add MXFP4 accuracy recipe

3428501

Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): reorganize recipe sections

5d432e0

Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): add FP4 serving benchmark result

f2677f0

Co-authored-by: Cursor <cursoragent@cursor.com>

add fused topk_gating (#1248)

77c5df0

Co-authored-by: lirui927 <ruili@amd.com>

docs(minimax_m3): refresh FP4 serving benchmark result

676979a

cache cos_sin_cache (#1262)

30e5c08

Co-authored-by: lirui927 <ruili@amd.com>

Update MiniMax-M3.md

4aa042a

Update MiniMax-M3.md

48bbac9

adopt asm pa for prefill and decode path for minimax m3 (#1263)

d9485f0

* aiter asm pa have right acc Signed-off-by: ganyi <ygan@amd.com> * gluon pa for decode Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>

lirui927 and others added 12 commits June 17, 2026 08:17

docs(minimax_m3): enable ASM PA in FP4 recipe

02bd776

Document the ASM paged attention env flag for FP4 serving, accuracy, and benchmark flows. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): drop ASM PA env from FP4 recipe

e07a868

Remove the ASM paged attention environment override now that the default FP4 path has been validated. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(minimax_m3): refresh FP4 TP4 validation results

5d42d49

Update the FP4 GSM8K accuracy and concurrency sweep benchmark results from the latest PR validation run. Co-authored-by: Cursor <cursoragent@cursor.com>

make bf16 gate (#1278)

2da3785

Co-authored-by: xytpai <xytpai@foxmail.com>

refactor(minimax_m3): consolidate sparse metadata helpers

2cf2e5c

add gemma_norm + allreduce fusion (#1279)

acc4a63

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

update recipe (#1291)

1c7edff

Co-authored-by: xytpai <xytpai@foxmail.com>

support mxfp8 path (#1292)

e67424a

* support MiniMax-M3 MXFP8 Co-authored-by: Cursor <cursoragent@cursor.com> * add mxfp8 recipe --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

update eagle perf into recipe

a93ffc4

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix TP8 accuracy issue for mxfp4 and mxfp8 (#1294)

5ee05af

Co-authored-by: xytpai <xytpai@foxmail.com>

update recipe

009cc18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax-M3 BF16 support#1238

Add MiniMax-M3 BF16 support#1238
wuhuikx wants to merge 41 commits into
mainfrom
wuhuikx/atom-m3-bf16-to-main

wuhuikx commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

wuhuikx commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wuhuikx commented Jun 16, 2026 •

edited

Loading