Skip to content

Add MiniMax-M3 BF16 support#1238

Draft
wuhuikx wants to merge 41 commits into
mainfrom
wuhuikx/atom-m3-bf16-to-main
Draft

Add MiniMax-M3 BF16 support#1238
wuhuikx wants to merge 41 commits into
mainfrom
wuhuikx/atom-m3-bf16-to-main

Conversation

@wuhuikx

@wuhuikx wuhuikx commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

server:

model_path=/shared/data/amd_int/models/MiniMax-M3/
export ATOM_USE_TRITON_MOE="${ATOM_USE_TRITON_MOE:-1}"

python -m atom.entrypoints.openai_server
--model $model_path
-tp 8 --server-port 8013 --trust-remote-code --gpu-memory-utilization 0.7
--block-size 128
--no-enable_prefix_caching
curl:

curl -X POST "http://localhost:8013/v1/completions"
-H "Content-Type: application/json"
-d '{
"prompt": "The capital of China is", "temperature": 0, "top_p": 1, "top_k": 1, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123, "max_tokens": 10}'
accuracy:

model_path=/shared/data/amd_int/models/MiniMax-M3/
lm_eval --model local-completions
--model_args model=$model_path,base_url=http://localhost:8013/v1/completions,num_concurrent=65,max_retries=1,tokenized_requests=False,trust_remote_code=True
--tasks gsm8k
--num_fewshot 5 2>&1|tee ./m3-accuracy.log
curl result:

{"id":"cmpl-b82583aec2994c5ea7371b291a716969","object":"text_completion","created":1781604871,"model":"/shared/data/amd_int/models/MiniMax-M3/","choices":[{"index":0,"text":" Beijing. The the capital of China is Beijing.","finish_reason":"max_tokens"}],"usage":{"prompt_tokens":5,"completion_tokens":10,"total_tokens":15,"ttft_s":0.0599,"tpot_s":0.0156,"latency_s":0.2005}
accuracy result:

local-completions ({'model': '/shared/data/amd_int/models/MiniMax-M3/', 'base_url': 'http://localhost:8013/v1/completions', 'num_concurrent': 65, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9196 ± 0.0075
strict-match 5 exact_match 0.9189 ± 0.0075

Test plan

  • git diff --check origin/main...HEAD
  • python3 -m py_compile $(git diff --name-only origin/main...HEAD -- '*.py')

Based on #1235.

Made with Cursor

XiaobingSuper and others added 13 commits June 13, 2026 07:28
Register MiniMaxM3Sparse model paths, model ops, and vLLM plugin integration for sparse attention and multimodal serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
Route MiniMax-M3 sparse attention through precision-safe compiled paths and remove obsolete debug or unreachable branches from the vLLM plugin implementation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use the fused MiniMax-M3 QK-norm/RoPE preprocessing path while preserving ATOM sparse attention cache insertion semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Keep the MiniMax-M3 BF16 PR focused on the native ATOM server path by removing the vLLM plugin wiring from the branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
Document the MiniMax-M3 BF16 server, benchmark, and GSM8K validation flow for the native ATOM path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Point the MiniMax-M3 BF16 recipe at the M3_mi355 AITER branch used for validation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Update the MiniMax-M3 BF16 recipe to pull rocm/atom-dev:latest.

Co-authored-by: Cursor <cursoragent@cursor.com>
@wuhuikx wuhuikx marked this pull request as draft June 16, 2026 15:03
lirui927 and others added 16 commits June 16, 2026 10:43
Document explicit CUDAGraph capture sizes in the MiniMax-M3 BF16 native ATOM server command.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: lirui927 <ruili@amd.com>
…ts (#1246)

* feat(minimax_m3): route bf16 MoE through dedicated MiniMaxM3Bf16Experts

Native (unquantized) MiniMax-M3 now runs its routed experts through the
M3-owned MiniMaxM3Bf16Experts (custom triton GEMM kernels + SwiGLU-OAI via
swiglu_oai_split), instead of the generic FusedMoE -> aiter CK fused_moe path.

The CK fused_moe path needs an OAI-SwiGLU MoE kernel that only exists on the
aiter M3_mi355 branch (CK-submodule patch); it is absent from aiter main. The
dedicated experts depend only on triton + swiglu_oai_split (no aiter CK / no
aiter fused_moe), so bf16 M3 runs correctly on stock aiter main.

- Restore MiniMaxM3Bf16Experts + its triton kernels in
  atom/model_ops/minimax_m3/moe.py (removed by 8dd1a01 "remove redundant code").
- Restore the torch.ops.aiter.minimax_m3_bf16_experts_forward op registration
  in atom/model_ops/module_dispatch_ops.py.
- MiniMaxM3MoE selects the dedicated bf16 experts for unquantized checkpoints
  and keeps FusedMoE for FP4/MXFP4 (which the dedicated path does not implement).

The dedicated experts apply routed_scaling_factor internally and return
un-reduced output; the decoder's fused all-reduce reduces it (matching the
existing FusedMoE contract), so no extra all-reduce is added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(minimax_m3): keep shared experts standalone for bf16 weight loading

bf16 routes routed experts through MiniMaxM3Bf16Experts with a standalone
shared-experts MLP, but the loader was still fusing the checkpoint's
block_sparse_moe.shared_experts.* weights into the routed slot (experts.N),
leaving the standalone shared_experts.{gate_up_proj,down_proj}.weight at init
(114/825 params unloaded across the 57 MoE layers).

Set model.disable_fused_shared_loading=True for the bf16 (dedicated-experts)
path so the loader keeps shared-expert weights standalone; FP4 keeps FusedMoE's
fused-shared loading. Flag is set on both MiniMaxM3SparseForCausalLM and the
text-only wrapper (the instance the weight loader inspects).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: lirui927 <ruili@amd.com>
aiter main upstreamed the MiniMax-M3 fused qk-norm + rope + kv-insert op as
`fused_qknorm_idxrqknorm` (PR #3754) with a byte-identical signature; the old
M3_mi355 name `fused_minimax_m3_qknorm_rope_kv_insert` does not exist on aiter
main. The hasattr() capability gate therefore failed and ATOM silently fell
back to the unfused qk-norm path.

Rename all 3 references (the capability check + both call sites) so ATOM uses
the fused kernel when running against aiter main. Pure rename — call arguments
unchanged. Applies to both bf16 and FP4 M3 (attention preprocessing).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* aiter asm pa have right acc

Signed-off-by: ganyi <ygan@amd.com>

* gluon pa for decode

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
lirui927 and others added 12 commits June 17, 2026 08:17
Document the ASM paged attention env flag for FP4 serving, accuracy, and benchmark flows.

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the ASM paged attention environment override now that the default FP4 path has been validated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Update the FP4 GSM8K accuracy and concurrency sweep benchmark results from the latest PR validation run.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
* fix(spec_decode): drop emitted tokens past the stop position

A spec-decode verify step can accept tokens after EOS (the rejection
sampler compares draft vs target argmax and does not inspect EOS). The
internal length (num_tokens) was already truncated at the stop, but the
online output path streams RequestOutput.output_tokens (an accumulation
of new_tokens), not the num_tokens-truncated completion_token_ids. Mirror
the truncation on new_tokens so post-stop tokens never reach the client.

Without this, EAGLE leaks a trailing token after the answer; EOS is hidden
by skip_special_tokens while the trailing content remains, regressing
GSM8K flexible-extract (last-number) while strict-match stays correct.

* feat(minimax_m3): native ATOM EAGLE3 speculative decoding

Enable EAGLE3 draft speculative decoding for MiniMax-M3:
- capture Eagle3 aux hidden states from each selected layer's own fused
  all-reduce residual (CUDAGraph-safe; avoids the NaN-prone ad-hoc
  collective) and expose set/get_aux_hidden_state_layers
- map torchspec `layers.0.*` draft weights to `midlayer.*`
- block-paged MHA draft slot mapping via prepare_mtp_decode, and skip the
  MLA token-granular flat-kv slot derivation for block_size>1 drafts
- persistent sparse_prefix_lens buffer so CUDAGraph spec-verify reads live
  causal lengths (bound into the refactored sparse prefill metadata)
- recipe + validated GSM8K parity vs non-spec MXFP4

* add quick allreduce int4 for eagle

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* support MiniMax-M3 MXFP8

Co-authored-by: Cursor <cursoragent@cursor.com>

* add mxfp8 recipe

---------

Co-authored-by: xytpai <xytpai@foxmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants