Skip to content

Feat: RTP-LLM plugin GLM5-FP8 support#1289

Open
zhaoan12-prc wants to merge 19 commits into
ROCm:mainfrom
zhaoan12-prc:feat/rtp_atom_glm5_impl
Open

Feat: RTP-LLM plugin GLM5-FP8 support#1289
zhaoan12-prc wants to merge 19 commits into
ROCm:mainfrom
zhaoan12-prc:feat/rtp_atom_glm5_impl

Conversation

@zhaoan12-prc

Copy link
Copy Markdown
Contributor

Motivation

Add RTP + ATOM integration for GLM5-FP8 so GLM5 can run through the RTP plugin path with ATOM model loading and attention/MoE execution.

Technical Details

  • Added GLM5-FP8 RTP plugin integration on top of ATOM model construction and weight loading.
  • Wired RTP plugin attention/context handling for GLM5 MLA and sparse attention paths.
  • Kept RTP-specific import and patch behavior isolated under atom/plugin/rtpllm to avoid affecting the main ATOM execution path.
  • Added/updated RTP plugin tests for GLM5 lifecycle and sparse backend behavior.
  • RTP-side enablement is controlled through the GLM5 FP8 server environment, including:
    • RTP_LLM_EXTERNAL_MODEL_PACKAGES=atom.plugin.rtpllm.models
    • LOAD_PYTHON_MODEL=1
    • ENABLE_CUDA_GRAPH=1
    • FP8_KV_CACHE=1
    • MODEL_TYPE=glm_5
    • ACT_TYPE=BF16

Test Plan

  • Verified RTP + ATOM GLM5 with BF16 MLA.
  • Verified RTP + ATOM GLM5 with FP8 MLA.
  • Benchmarked pure ATOM FP8 MLA as the baseline for comparison.

Test Result

image
  • Verified GLM5 RTP + ATOM with both FP8 MLA and BF16 MLA.
  • Compared RTP + ATOM FP8 MLA against pure ATOM FP8 MLA baseline.
  • RTP + ATOM FP8 MLA performance is broadly on par with pure ATOM:
    • RTP + ATOM FP8 MLA: TTFT 2290.07 ms, TPOT 34.67 ms
    • Pure ATOM FP8 MLA: TTFT 2231.86 ms, TPOT 33.53 ms

Copilot AI review requested due to automatic review settings June 18, 2026 13:25
@zhaoan12-prc zhaoan12-prc marked this pull request as draft June 18, 2026 13:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds RTP-LLM plugin integration for GLM5-FP8 by introducing an ATOM-backed GLM5 wrapper, plus RTP-aware MLA attention/context handling (including sparse top-k plumbing) and associated contract/lifecycle tests. This extends ATOM’s existing plugin framework (prepare/config/rtpllm) so GLM5 can be constructed and loaded via ATOM while running through RTP’s external model path.

Changes:

  • Added a new ATOMGlm5Moe RTP model wrapper with ATOM-based model construction and plugin-mode weight loading.
  • Introduced RTP MLA attention adapter + sparse MLA backend contracts, and extended RTP forward-context metadata to support GLM5 MLA/sparse flows.
  • Added extensive RTP plugin tests covering GLM5 lifecycle, registration, patching/guards, and sparse backend/indexer contracts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/plugin/test_rtpllm_prepare_model.py Adds GLM5-specific prepare_model behavior test (MLA patch re-application).
tests/plugin/test_rtpllm_model_wrapper.py Extends RTP model wrapper registration test expectations to include GLM5.
tests/plugin/test_rtpllm_glm5_wrapper_lifecycle.py New lifecycle tests for GLM5 wrapper load/create/runtime behaviors.
tests/plugin/test_rtpllm_glm5_sparse_backend_contract.py New contract-executable tests for sparse MLA backend behavior and invariants.
tests/plugin/test_rtpllm_glm5_registration.py New tests asserting GLM5 model registration + alias wiring.
tests/plugin/test_rtpllm_glm5_ownership.py New tests asserting GLM5 ownership/bridge-mode static contracts.
tests/plugin/test_rtpllm_glm5_mla_patch.py Guards ensuring old monkey-patch paths are gone; validates MLA patch symbol updates.
tests/plugin/test_rtpllm_glm5_mla_bridge_shape.py Shape-level execution tests for MLA adapter boundary.
tests/plugin/test_rtpllm_glm5_mha_bridge_guard.py Static guard tests preventing unwanted MHA/Qwen patch usage + sparse kernel import-time deps.
tests/plugin/test_rtpllm_glm5_indexer_contract.py New contract tests for indexer/topk buffer behavior and sparse backend threading.
tests/plugin/test_rtpllm_forward_context_semantics.py Extends forward-context semantics tests (block table recovery, slot mapping, MLA layer mapping).
atom/plugin/rtpllm/utils/forward_context.py Adds MLA-aware layer mapping, physical/kernel block table handling, and richer plugin metadata building for RTP mode.
atom/plugin/rtpllm/utils/init.py Exposes new forward-context variants (MLA + Qwen3.5 hybrid).
atom/plugin/rtpllm/models/qwen3_5.py Switches Qwen3.5 RTP runtime to the hybrid forward-context; updates cg prewarm buffers.
atom/plugin/rtpllm/models/glm5.py Adds new GLM5 RTP wrapper + runtime using ATOM model creation/loading and RTP forward context binding.
atom/plugin/rtpllm/models/base_model_wrapper.py Registers the GLM5 wrapper in RTP’s model factory + HF-arch mapping.
atom/plugin/rtpllm/models/init.py Makes RTP model wrapper imports resilient when rtp_llm is absent; wires GLM5 arch into _ATOM_SUPPORTED_MODELS.
atom/plugin/rtpllm/attention_backend/rtp_sparse_mla_backend.py Introduces sparse MLA backend + custom op registration and topk consumption contract.
atom/plugin/rtpllm/attention_backend/rtp_mla_metadata.py Adds GLM5 MLA metadata/ownership static contracts.
atom/plugin/rtpllm/attention_backend/rtp_mla_attention.py Adds RTPMLAAttention adapter + patch hook to swap ATOM Attention symbol for MLA.
atom/plugin/rtpllm/attention_backend/init.py Refactors attention_backend exports to include MLA/sparse, with lazy attribute loading.
atom/plugin/rtpllm/init.py Adds package root to keep import side-effect free.
atom/plugin/prepare.py Adds GLM5 RTP path hook to apply MLA attention patch during RTP plugin prepare.
atom/plugin/config.py Adjusts RTP plugin max_num_batched_tokens sizing to respect model max length.
Comments suppressed due to low confidence (1)

atom/plugin/rtpllm/models/qwen3_5.py:401

  • RTPForwardContext._build_req_id_per_token() can require a prewarmed cg_bufs['seq_id_i32'] to stay allocation-free during CUDA-graph capture. This prewarm dict currently only includes the int64 seq_id, so capture would either allocate (if casting) or fail (if we enforce no allocations). Add an int32 seq_id_i32 buffer alongside seq_id.
        self._cg_meta_bufs: dict = {
            "query_start_loc": torch.arange(
                0, max_bs + 1, device=device, dtype=torch.int32
            ),
            "seq_id": torch.arange(0, max_bs, device=device, dtype=torch.int64),
            "block_col": torch.empty(max_bs, device=device, dtype=torch.int32),
            "block_col_i64": torch.empty(max_bs, device=device, dtype=torch.int64),
            "slot_base": torch.empty(max_bs, device=device, dtype=torch.int32),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated
@zhaoan12-prc zhaoan12-prc marked this pull request as ready for review June 19, 2026 06:00
Copilot AI review requested due to automatic review settings June 19, 2026 06:00
@zhaoan12-prc zhaoan12-prc marked this pull request as draft June 19, 2026 06:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated
Comment thread atom/plugin/rtpllm/models/glm5.py
@zhaoan12-prc zhaoan12-prc marked this pull request as ready for review June 19, 2026 07:57
Copilot AI review requested due to automatic review settings June 19, 2026 07:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Comment thread atom/plugin/rtpllm/utils/forward_context.py Outdated
Comment thread atom/plugin/rtpllm/models/glm5.py
Comment thread atom/plugin/rtpllm/utils/forward_context.py
Copilot AI review requested due to automatic review settings June 19, 2026 09:50
@zhaoan12-prc zhaoan12-prc force-pushed the feat/rtp_atom_glm5_impl branch from 9d6ed82 to 0a3d321 Compare June 19, 2026 09:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Comment thread atom/plugin/rtpllm/utils/forward_context.py
Comment thread atom/plugin/rtpllm/utils/forward_context.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants