Feat: RTP-LLM plugin GLM5-FP8 support by zhaoan12-prc · Pull Request #1289 · ROCm/ATOM

zhaoan12-prc · 2026-06-18T13:25:10Z

Motivation

Add RTP + ATOM integration for GLM5-FP8 so GLM5 can run through the RTP plugin path with ATOM model loading and attention/MoE execution.

Technical Details

Added GLM5-FP8 RTP plugin integration on top of ATOM model construction and weight loading.
Wired RTP plugin attention/context handling for GLM5 MLA and sparse attention paths.
Kept RTP-specific import and patch behavior isolated under atom/plugin/rtpllm to avoid affecting the main ATOM execution path.
Added/updated RTP plugin tests for GLM5 lifecycle and sparse backend behavior.
RTP-side enablement is controlled through the GLM5 FP8 server environment, including:
- RTP_LLM_EXTERNAL_MODEL_PACKAGES=atom.plugin.rtpllm.models
- LOAD_PYTHON_MODEL=1
- ENABLE_CUDA_GRAPH=1
- FP8_KV_CACHE=1
- MODEL_TYPE=glm_5
- ACT_TYPE=BF16

Test Plan

Verified RTP + ATOM GLM5 with BF16 MLA.
Verified RTP + ATOM GLM5 with FP8 MLA.
Benchmarked pure ATOM FP8 MLA as the baseline for comparison.

Test Result

Verified GLM5 RTP + ATOM with both FP8 MLA and BF16 MLA.
Compared RTP + ATOM FP8 MLA against pure ATOM FP8 MLA baseline.
RTP + ATOM FP8 MLA performance is broadly on par with pure ATOM:
- RTP + ATOM FP8 MLA: TTFT 2290.07 ms, TPOT 34.67 ms
- Pure ATOM FP8 MLA: TTFT 2231.86 ms, TPOT 33.53 ms

Copilot

Pull request overview

Adds RTP-LLM plugin integration for GLM5-FP8 by introducing an ATOM-backed GLM5 wrapper, plus RTP-aware MLA attention/context handling (including sparse top-k plumbing) and associated contract/lifecycle tests. This extends ATOM’s existing plugin framework (prepare/config/rtpllm) so GLM5 can be constructed and loaded via ATOM while running through RTP’s external model path.

Changes:

Added a new ATOMGlm5Moe RTP model wrapper with ATOM-based model construction and plugin-mode weight loading.
Introduced RTP MLA attention adapter + sparse MLA backend contracts, and extended RTP forward-context metadata to support GLM5 MLA/sparse flows.
Added extensive RTP plugin tests covering GLM5 lifecycle, registration, patching/guards, and sparse backend/indexer contracts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/plugin/test_rtpllm_prepare_model.py	Adds GLM5-specific prepare_model behavior test (MLA patch re-application).
tests/plugin/test_rtpllm_model_wrapper.py	Extends RTP model wrapper registration test expectations to include GLM5.
tests/plugin/test_rtpllm_glm5_wrapper_lifecycle.py	New lifecycle tests for GLM5 wrapper load/create/runtime behaviors.
tests/plugin/test_rtpllm_glm5_sparse_backend_contract.py	New contract-executable tests for sparse MLA backend behavior and invariants.
tests/plugin/test_rtpllm_glm5_registration.py	New tests asserting GLM5 model registration + alias wiring.
tests/plugin/test_rtpllm_glm5_ownership.py	New tests asserting GLM5 ownership/bridge-mode static contracts.
tests/plugin/test_rtpllm_glm5_mla_patch.py	Guards ensuring old monkey-patch paths are gone; validates MLA patch symbol updates.
tests/plugin/test_rtpllm_glm5_mla_bridge_shape.py	Shape-level execution tests for MLA adapter boundary.
tests/plugin/test_rtpllm_glm5_mha_bridge_guard.py	Static guard tests preventing unwanted MHA/Qwen patch usage + sparse kernel import-time deps.
tests/plugin/test_rtpllm_glm5_indexer_contract.py	New contract tests for indexer/topk buffer behavior and sparse backend threading.
tests/plugin/test_rtpllm_forward_context_semantics.py	Extends forward-context semantics tests (block table recovery, slot mapping, MLA layer mapping).
atom/plugin/rtpllm/utils/forward_context.py	Adds MLA-aware layer mapping, physical/kernel block table handling, and richer plugin metadata building for RTP mode.
atom/plugin/rtpllm/utils/init.py	Exposes new forward-context variants (MLA + Qwen3.5 hybrid).
atom/plugin/rtpllm/models/qwen3_5.py	Switches Qwen3.5 RTP runtime to the hybrid forward-context; updates cg prewarm buffers.
atom/plugin/rtpllm/models/glm5.py	Adds new GLM5 RTP wrapper + runtime using ATOM model creation/loading and RTP forward context binding.
atom/plugin/rtpllm/models/base_model_wrapper.py	Registers the GLM5 wrapper in RTP’s model factory + HF-arch mapping.
atom/plugin/rtpllm/models/init.py	Makes RTP model wrapper imports resilient when `rtp_llm` is absent; wires GLM5 arch into `_ATOM_SUPPORTED_MODELS`.
atom/plugin/rtpllm/attention_backend/rtp_sparse_mla_backend.py	Introduces sparse MLA backend + custom op registration and topk consumption contract.
atom/plugin/rtpllm/attention_backend/rtp_mla_metadata.py	Adds GLM5 MLA metadata/ownership static contracts.
atom/plugin/rtpllm/attention_backend/rtp_mla_attention.py	Adds RTPMLAAttention adapter + patch hook to swap ATOM Attention symbol for MLA.
atom/plugin/rtpllm/attention_backend/init.py	Refactors attention_backend exports to include MLA/sparse, with lazy attribute loading.
atom/plugin/rtpllm/init.py	Adds package root to keep import side-effect free.
atom/plugin/prepare.py	Adds GLM5 RTP path hook to apply MLA attention patch during RTP plugin prepare.
atom/plugin/config.py	Adjusts RTP plugin max_num_batched_tokens sizing to respect model max length.

Comments suppressed due to low confidence (1)

atom/plugin/rtpllm/models/qwen3_5.py:401

RTPForwardContext._build_req_id_per_token() can require a prewarmed cg_bufs['seq_id_i32'] to stay allocation-free during CUDA-graph capture. This prewarm dict currently only includes the int64 seq_id, so capture would either allocate (if casting) or fail (if we enforce no allocations). Add an int32 seq_id_i32 buffer alongside seq_id.

        self._cg_meta_bufs: dict = {
            "query_start_loc": torch.arange(
                0, max_bs + 1, device=device, dtype=torch.int32
            ),
            "seq_id": torch.arange(0, max_bs, device=device, dtype=torch.int64),
            "block_col": torch.empty(max_bs, device=device, dtype=torch.int32),
            "block_col_i64": torch.empty(max_bs, device=device, dtype=torch.int64),
            "slot_base": torch.empty(max_bs, device=device, dtype=torch.int32),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings June 18, 2026 13:25

zhaoan12-prc marked this pull request as draft June 18, 2026 13:25

Copilot started reviewing on behalf of zhaoan12-prc June 18, 2026 13:25 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated

zhaoan12-prc requested review from Yuechguo and zejunchen-zejun June 19, 2026 05:58

zhaoan12-prc marked this pull request as ready for review June 19, 2026 06:00

Copilot AI review requested due to automatic review settings June 19, 2026 06:00

Copilot started reviewing on behalf of zhaoan12-prc June 19, 2026 06:00 View session

zhaoan12-prc marked this pull request as draft June 19, 2026 06:03

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated

Comment thread atom/plugin/rtpllm/models/glm5.py

zhaoan12-prc marked this pull request as ready for review June 19, 2026 07:57

Copilot AI review requested due to automatic review settings June 19, 2026 07:57

Copilot started reviewing on behalf of zhaoan12-prc June 19, 2026 07:57 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated

Comment thread atom/plugin/rtpllm/models/glm5.py

Comment thread atom/plugin/rtpllm/utils/forward_context.py

zhaoan12-prc added 15 commits June 19, 2026 17:50

feat: RTPLLM plugin GLM5 integration

8d2791e

feat: RTPLLM GLM5 enable cuda graph

bf1c92f

fix: RTP glm5 qwen35 cuda graph conflict

a49dc53

fix: RTP crash when long input_len > 16384

d9cbe9d

fix:[RTP] making GLM5 run true Sparse MLA

8281090

refactor: RTP glm5 code

8b92a5c

feat: RTP glm5 optimize sparse decode path

27be06e

refactor: RTP remove redundant envs

d6afeda

refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend

0afe687

fix: RTP GLM5 prefil reuse Sparse MLA metadata

b4997d6

fix: RTP GLM5 enable FP8 MLA path

d208756

feat: RTP GLM5 conflict issue after rebase

48089f9

fix: RTP plugin imports conflict after rebase main

d31dbb0

refactor: RTP GLM5 tests merge

21b8465

refactor: cleanup GLM5 RTP sparse MLA backend

a551185

zhaoan12-prc added 4 commits June 19, 2026 17:50

refactor: RTP remove redundant labels

d1ec87b

refactor: RTP GLM5 remove redundant code

8a441f9

refactor: RTP GLM5 remove mla redundant code

3540e0c

fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs

0a3d321

Copilot AI review requested due to automatic review settings June 19, 2026 09:50

zhaoan12-prc force-pushed the feat/rtp_atom_glm5_impl branch from 9d6ed82 to 0a3d321 Compare June 19, 2026 09:50

Copilot started reviewing on behalf of zhaoan12-prc June 19, 2026 09:50 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread atom/plugin/rtpllm/utils/forward_context.py

Comment thread atom/plugin/rtpllm/utils/forward_context.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: RTP-LLM plugin GLM5-FP8 support#1289

Feat: RTP-LLM plugin GLM5-FP8 support#1289
zhaoan12-prc wants to merge 19 commits into
ROCm:mainfrom
zhaoan12-prc:feat/rtp_atom_glm5_impl

zhaoan12-prc commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhaoan12-prc commented Jun 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants