Feat: RTP-LLM plugin GLM5-FP8 support#1289
Open
zhaoan12-prc wants to merge 19 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds RTP-LLM plugin integration for GLM5-FP8 by introducing an ATOM-backed GLM5 wrapper, plus RTP-aware MLA attention/context handling (including sparse top-k plumbing) and associated contract/lifecycle tests. This extends ATOM’s existing plugin framework (prepare/config/rtpllm) so GLM5 can be constructed and loaded via ATOM while running through RTP’s external model path.
Changes:
- Added a new
ATOMGlm5MoeRTP model wrapper with ATOM-based model construction and plugin-mode weight loading. - Introduced RTP MLA attention adapter + sparse MLA backend contracts, and extended RTP forward-context metadata to support GLM5 MLA/sparse flows.
- Added extensive RTP plugin tests covering GLM5 lifecycle, registration, patching/guards, and sparse backend/indexer contracts.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/plugin/test_rtpllm_prepare_model.py | Adds GLM5-specific prepare_model behavior test (MLA patch re-application). |
| tests/plugin/test_rtpllm_model_wrapper.py | Extends RTP model wrapper registration test expectations to include GLM5. |
| tests/plugin/test_rtpllm_glm5_wrapper_lifecycle.py | New lifecycle tests for GLM5 wrapper load/create/runtime behaviors. |
| tests/plugin/test_rtpllm_glm5_sparse_backend_contract.py | New contract-executable tests for sparse MLA backend behavior and invariants. |
| tests/plugin/test_rtpllm_glm5_registration.py | New tests asserting GLM5 model registration + alias wiring. |
| tests/plugin/test_rtpllm_glm5_ownership.py | New tests asserting GLM5 ownership/bridge-mode static contracts. |
| tests/plugin/test_rtpllm_glm5_mla_patch.py | Guards ensuring old monkey-patch paths are gone; validates MLA patch symbol updates. |
| tests/plugin/test_rtpllm_glm5_mla_bridge_shape.py | Shape-level execution tests for MLA adapter boundary. |
| tests/plugin/test_rtpllm_glm5_mha_bridge_guard.py | Static guard tests preventing unwanted MHA/Qwen patch usage + sparse kernel import-time deps. |
| tests/plugin/test_rtpllm_glm5_indexer_contract.py | New contract tests for indexer/topk buffer behavior and sparse backend threading. |
| tests/plugin/test_rtpllm_forward_context_semantics.py | Extends forward-context semantics tests (block table recovery, slot mapping, MLA layer mapping). |
| atom/plugin/rtpllm/utils/forward_context.py | Adds MLA-aware layer mapping, physical/kernel block table handling, and richer plugin metadata building for RTP mode. |
| atom/plugin/rtpllm/utils/init.py | Exposes new forward-context variants (MLA + Qwen3.5 hybrid). |
| atom/plugin/rtpllm/models/qwen3_5.py | Switches Qwen3.5 RTP runtime to the hybrid forward-context; updates cg prewarm buffers. |
| atom/plugin/rtpllm/models/glm5.py | Adds new GLM5 RTP wrapper + runtime using ATOM model creation/loading and RTP forward context binding. |
| atom/plugin/rtpllm/models/base_model_wrapper.py | Registers the GLM5 wrapper in RTP’s model factory + HF-arch mapping. |
| atom/plugin/rtpllm/models/init.py | Makes RTP model wrapper imports resilient when rtp_llm is absent; wires GLM5 arch into _ATOM_SUPPORTED_MODELS. |
| atom/plugin/rtpllm/attention_backend/rtp_sparse_mla_backend.py | Introduces sparse MLA backend + custom op registration and topk consumption contract. |
| atom/plugin/rtpllm/attention_backend/rtp_mla_metadata.py | Adds GLM5 MLA metadata/ownership static contracts. |
| atom/plugin/rtpllm/attention_backend/rtp_mla_attention.py | Adds RTPMLAAttention adapter + patch hook to swap ATOM Attention symbol for MLA. |
| atom/plugin/rtpllm/attention_backend/init.py | Refactors attention_backend exports to include MLA/sparse, with lazy attribute loading. |
| atom/plugin/rtpllm/init.py | Adds package root to keep import side-effect free. |
| atom/plugin/prepare.py | Adds GLM5 RTP path hook to apply MLA attention patch during RTP plugin prepare. |
| atom/plugin/config.py | Adjusts RTP plugin max_num_batched_tokens sizing to respect model max length. |
Comments suppressed due to low confidence (1)
atom/plugin/rtpllm/models/qwen3_5.py:401
RTPForwardContext._build_req_id_per_token()can require a prewarmedcg_bufs['seq_id_i32']to stay allocation-free during CUDA-graph capture. This prewarm dict currently only includes the int64seq_id, so capture would either allocate (if casting) or fail (if we enforce no allocations). Add an int32seq_id_i32buffer alongsideseq_id.
self._cg_meta_bufs: dict = {
"query_start_loc": torch.arange(
0, max_bs + 1, device=device, dtype=torch.int32
),
"seq_id": torch.arange(0, max_bs, device=device, dtype=torch.int64),
"block_col": torch.empty(max_bs, device=device, dtype=torch.int32),
"block_col_i64": torch.empty(max_bs, device=device, dtype=torch.int64),
"slot_base": torch.empty(max_bs, device=device, dtype=torch.int32),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9d6ed82 to
0a3d321
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add RTP + ATOM integration for GLM5-FP8 so GLM5 can run through the RTP plugin path with ATOM model loading and attention/MoE execution.
Technical Details
atom/plugin/rtpllmto avoid affecting the main ATOM execution path.RTP_LLM_EXTERNAL_MODEL_PACKAGES=atom.plugin.rtpllm.modelsLOAD_PYTHON_MODEL=1ENABLE_CUDA_GRAPH=1FP8_KV_CACHE=1MODEL_TYPE=glm_5ACT_TYPE=BF16Test Plan
Test Result