Add MiniMax-M3 BF16 support#1238
Draft
wuhuikx wants to merge 41 commits into
Draft
Conversation
Register MiniMaxM3Sparse model paths, model ops, and vLLM plugin integration for sparse attention and multimodal serving. Co-authored-by: Cursor <cursoragent@cursor.com>
Route MiniMax-M3 sparse attention through precision-safe compiled paths and remove obsolete debug or unreachable branches from the vLLM plugin implementation. Co-authored-by: Cursor <cursoragent@cursor.com>
Use the fused MiniMax-M3 QK-norm/RoPE preprocessing path while preserving ATOM sparse attention cache insertion semantics. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Keep the MiniMax-M3 BF16 PR focused on the native ATOM server path by removing the vLLM plugin wiring from the branch. Co-authored-by: Cursor <cursoragent@cursor.com>
Document the MiniMax-M3 BF16 server, benchmark, and GSM8K validation flow for the native ATOM path. Co-authored-by: Cursor <cursoragent@cursor.com>
Point the MiniMax-M3 BF16 recipe at the M3_mi355 AITER branch used for validation. Co-authored-by: Cursor <cursoragent@cursor.com>
Update the MiniMax-M3 BF16 recipe to pull rocm/atom-dev:latest. Co-authored-by: Cursor <cursoragent@cursor.com>
Document explicit CUDAGraph capture sizes in the MiniMax-M3 BF16 native ATOM server command. Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: lirui927 <ruili@amd.com>
…ts (#1246) * feat(minimax_m3): route bf16 MoE through dedicated MiniMaxM3Bf16Experts Native (unquantized) MiniMax-M3 now runs its routed experts through the M3-owned MiniMaxM3Bf16Experts (custom triton GEMM kernels + SwiGLU-OAI via swiglu_oai_split), instead of the generic FusedMoE -> aiter CK fused_moe path. The CK fused_moe path needs an OAI-SwiGLU MoE kernel that only exists on the aiter M3_mi355 branch (CK-submodule patch); it is absent from aiter main. The dedicated experts depend only on triton + swiglu_oai_split (no aiter CK / no aiter fused_moe), so bf16 M3 runs correctly on stock aiter main. - Restore MiniMaxM3Bf16Experts + its triton kernels in atom/model_ops/minimax_m3/moe.py (removed by 8dd1a01 "remove redundant code"). - Restore the torch.ops.aiter.minimax_m3_bf16_experts_forward op registration in atom/model_ops/module_dispatch_ops.py. - MiniMaxM3MoE selects the dedicated bf16 experts for unquantized checkpoints and keeps FusedMoE for FP4/MXFP4 (which the dedicated path does not implement). The dedicated experts apply routed_scaling_factor internally and return un-reduced output; the decoder's fused all-reduce reduces it (matching the existing FusedMoE contract), so no extra all-reduce is added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(minimax_m3): keep shared experts standalone for bf16 weight loading bf16 routes routed experts through MiniMaxM3Bf16Experts with a standalone shared-experts MLP, but the loader was still fusing the checkpoint's block_sparse_moe.shared_experts.* weights into the routed slot (experts.N), leaving the standalone shared_experts.{gate_up_proj,down_proj}.weight at init (114/825 params unloaded across the 57 MoE layers). Set model.disable_fused_shared_loading=True for the bf16 (dedicated-experts) path so the loader keeps shared-expert weights standalone; FP4 keeps FusedMoE's fused-shared loading. Flag is set on both MiniMaxM3SparseForCausalLM and the text-only wrapper (the instance the weight loader inspects). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: lirui927 <ruili@amd.com>
aiter main upstreamed the MiniMax-M3 fused qk-norm + rope + kv-insert op as `fused_qknorm_idxrqknorm` (PR #3754) with a byte-identical signature; the old M3_mi355 name `fused_minimax_m3_qknorm_rope_kv_insert` does not exist on aiter main. The hasattr() capability gate therefore failed and ATOM silently fell back to the unfused qk-norm path. Rename all 3 references (the capability check + both call sites) so ATOM uses the fused kernel when running against aiter main. Pure rename — call arguments unchanged. Applies to both bf16 and FP4 M3 (attention preprocessing). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* aiter asm pa have right acc Signed-off-by: ganyi <ygan@amd.com> * gluon pa for decode Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>
Document the ASM paged attention env flag for FP4 serving, accuracy, and benchmark flows. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the ASM paged attention environment override now that the default FP4 path has been validated. Co-authored-by: Cursor <cursoragent@cursor.com>
Update the FP4 GSM8K accuracy and concurrency sweep benchmark results from the latest PR validation run. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
* fix(spec_decode): drop emitted tokens past the stop position A spec-decode verify step can accept tokens after EOS (the rejection sampler compares draft vs target argmax and does not inspect EOS). The internal length (num_tokens) was already truncated at the stop, but the online output path streams RequestOutput.output_tokens (an accumulation of new_tokens), not the num_tokens-truncated completion_token_ids. Mirror the truncation on new_tokens so post-stop tokens never reach the client. Without this, EAGLE leaks a trailing token after the answer; EOS is hidden by skip_special_tokens while the trailing content remains, regressing GSM8K flexible-extract (last-number) while strict-match stays correct. * feat(minimax_m3): native ATOM EAGLE3 speculative decoding Enable EAGLE3 draft speculative decoding for MiniMax-M3: - capture Eagle3 aux hidden states from each selected layer's own fused all-reduce residual (CUDAGraph-safe; avoids the NaN-prone ad-hoc collective) and expose set/get_aux_hidden_state_layers - map torchspec `layers.0.*` draft weights to `midlayer.*` - block-paged MHA draft slot mapping via prepare_mtp_decode, and skip the MLA token-granular flat-kv slot derivation for block_size>1 drafts - persistent sparse_prefix_lens buffer so CUDAGraph spec-verify reads live causal lengths (bound into the refactored sparse prefill metadata) - recipe + validated GSM8K parity vs non-spec MXFP4 * add quick allreduce int4 for eagle Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* support MiniMax-M3 MXFP8 Co-authored-by: Cursor <cursoragent@cursor.com> * add mxfp8 recipe --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: xytpai <xytpai@foxmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
server:
model_path=/shared/data/amd_int/models/MiniMax-M3/
export ATOM_USE_TRITON_MOE="${ATOM_USE_TRITON_MOE:-1}"
python -m atom.entrypoints.openai_server
--model $model_path
-tp 8 --server-port 8013 --trust-remote-code --gpu-memory-utilization 0.7
--block-size 128
--no-enable_prefix_caching
curl:
curl -X POST "http://localhost:8013/v1/completions"
-H "Content-Type: application/json"
-d '{
"prompt": "The capital of China is", "temperature": 0, "top_p": 1, "top_k": 1, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123, "max_tokens": 10}'
accuracy:
model_path=/shared/data/amd_int/models/MiniMax-M3/
lm_eval --model local-completions
--model_args model=$model_path,base_url=http://localhost:8013/v1/completions,num_concurrent=65,max_retries=1,tokenized_requests=False,trust_remote_code=True
--tasks gsm8k
--num_fewshot 5 2>&1|tee ./m3-accuracy.log
curl result:
{"id":"cmpl-b82583aec2994c5ea7371b291a716969","object":"text_completion","created":1781604871,"model":"/shared/data/amd_int/models/MiniMax-M3/","choices":[{"index":0,"text":" Beijing. The the capital of China is Beijing.","finish_reason":"max_tokens"}],"usage":{"prompt_tokens":5,"completion_tokens":10,"total_tokens":15,"ttft_s":0.0599,"tpot_s":0.0156,"latency_s":0.2005}
accuracy result:
local-completions ({'model': '/shared/data/amd_int/models/MiniMax-M3/', 'base_url': 'http://localhost:8013/v1/completions', 'num_concurrent': 65, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1
Test plan
git diff --check origin/main...HEADpython3 -m py_compile $(git diff --name-only origin/main...HEAD -- '*.py')Based on #1235.
Made with Cursor