feat: Simulated KV cache QARL#3012
Draft
mxinO wants to merge 12 commits into
Draft
Conversation
ModelOpt already inserts and calibrates K/V fake quantizers in Megatron and vLLM, but Bridge did not export their scalar amax state. Add semantic Bridge mappings, strict rollout-side state loading, isolated FP8/NVFP4 recipes, and focused coverage so refits cannot silently retain dummy calibration. Constraint: Native vLLM KV-cache storage and deployment scales are out of scope. Rejected: Add per-model mappings in NeMo-RL | existing QKV mappings already own architecture-specific semantic prefixes. Confidence: medium Scope-risk: moderate Directive: Keep native/real KV-cache synchronization separate from this simulated-quant path. Tested: MBridge 68 unit tests; NeMo-RL 59 focused unit tests; FP8/NVFP4 live refit tests; two-step dense FP8 KV distillation; both repository pre-commit suites. Not-tested: Nano3 hybrid refit exposed an unresolved HF-to-vLLM naming difference and remains pending. Signed-off-by: Meng Xin <mxin@nvidia.com>
Nano3 exports K/V amax under a backbone root while vLLM stores the same attention state under model and an extra attn child. Normalize only that observed root alias and retain exact unique lookup so unrelated layer stacks cannot be selected silently. Constraint: Hybrid vLLM attention modules use model.layers.*.mixer.attn while Bridge emits backbone.layers.*.mixer. Rejected: Match every path below .layers. | could silently bind vision or secondary model stacks. Confidence: high Scope-risk: narrow Directive: Add explicit aliases only when a model demonstrates a distinct HF-to-vLLM root contract; do not restore generic suffix matching. Tested: 62 focused resolver/backend unit tests; NeMo-RL full pre-commit; Nano3 TP1/EP8 NVFP4-KV distillation smoke job 233795 with loss 0.008854 and grad norm 0.4612. Signed-off-by: Meng Xin <mxin@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Use vLLM's model-owned HF mapper and PP-missing-layer contract for K/V amax resolution while retaining only the ModelOpt-specific terminal attention adaptation. Extend the existing packed collective stream to preserve scalar calibrated amax state without changing its wire format. Constraint: vLLM owns HF-to-runtime naming and rollout pipeline-layer locality. Rejected: Add more model-specific aliases or padding to the collective stream | duplicates upstream model knowledge or changes the existing transport protocol. Confidence: high Scope-risk: moderate Directive: Keep K/V amax on the existing refit transport and fail closed for missing locally owned buffers. Tested: MBridge unit suites 68+68; NeMo-RL vLLM suite 69; packed-tensor suite 3; Ruff and Pyrefly; PP=2 smoke job 237613; Nano3 hybrid smoke job 237614. Not-tested: Integration against the eight origin/main commits newer than this branch base. Signed-off-by: Meng Xin <mxin@nvidia.com>
K/V calibration state follows the same repeated refit snapshot as GEMM activation amax. Expose its buffers to the existing vLLM loader and retain only the runtime Attention-name adjustment required by ModelOpt. Constraint: ModelOpt K/V amax is calibrated once and remains fixed during QARL training. Rejected: Maintain a separate K/V resolver and per-refit reset protocol | duplicates vLLM loading and PP handling already used by GEMM amax. Confidence: high Scope-risk: narrow Tested: 44 focused unit tests; FP8 and NVFP4 GPU refit integration job 237657; pre-commit on changed files Signed-off-by: Meng Xin <mxin@nvidia.com>
Input and K/V quantizer amax buffers use one suffix contract and one vLLM loader predicate. The K/V-only inner Attention rename remains the sole specialization. Constraint: Only enabled activation quantizers whose amax is exported by the policy belong in this suffix set. Confidence: high Scope-risk: narrow Tested: focused activation-amax loader unit test; pre-commit on changed files Signed-off-by: Meng Xin <mxin@nvidia.com>
Document the transport distinction that made scalar quantizer amax safe in aligned IPC/ZMQ refits but exposed shape and alignment constraints in the unpadded collective stream. Confidence: high Scope-risk: narrow Tested: pre-commit on packed_tensor.py Signed-off-by: Meng Xin <mxin@nvidia.com>
Place the scalar-shape and alignment explanation in restore_tensor's docstring because both behaviors are part of the helper's contract. Confidence: high Scope-risk: narrow Tested: pre-commit on packed_tensor.py Signed-off-by: Meng Xin <mxin@nvidia.com>
Explain why MBridge's HF-semantic K/V amax names need the ModelOpt inner Attention segment before entering the standard vLLM loader. Directive: Keep model-specific HF mapping and PP filtering in vLLM's normal loader. Confidence: high Scope-risk: narrow Tested: pre-commit on vllm_quant_backend.py Signed-off-by: Meng Xin <mxin@nvidia.com>
Merge the latest origin/main while preserving branch history and the existing simulated KV-cache QARL work. Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing. Confidence: high Scope-risk: moderate Tested: Branch-wide pre-commit and both MBridge mapping suites. Not-tested: Distributed KV refit at this exact commit; the following boundary-fix commit carries that verification. Signed-off-by: Meng Xin <mxin@nvidia.com>
The runtime-name guard treated the tail of self_attn as if the inner vLLM attn module were already present. Require a dot-delimited .attn. segment and cover the real self_attn input shape so K/V amax reaches the existing generic loader. Constraint: ModelOpt installs K/V BMM quantizers under vLLM attention inner modules. Rejected: Restore the dedicated K/V resolver | the shared activation-amax loader works once the runtime path is unambiguous. Confidence: high Scope-risk: narrow Tested: FP8 targeted distributed refit job 237882; full FP8/NVFP4 refit and focused unit job 237883; Ruff, formatting, and Pyrefly. Signed-off-by: Meng Xin <mxin@nvidia.com>
5 tasks
The upstream squash commit 605c8167 contains the reviewed KV-cache amax mapping previously consumed through the temporary b9f18759 branch. Repointing the submodule removes that temporary dependency while preserving the tested mapping and refit behavior. The lockfile was regenerated inside the NeMo-RL container and only consolidates the duplicate Starlette resolution. Constraint: The merged MBridge commit also advances its Megatron-LM submodule, so distributed refit compatibility must be verified before adoption. Rejected: Retain the temporary MBridge pin | the mapping is now available from upstream and the merged pin passes the focused compatibility suite. Confidence: high Scope-risk: moderate Directive: Regenerate uv.lock in the project container when changing the MBridge pin; do not edit it manually. Tested: MBridge quant mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer 3 passed; uv lock --check and focused pre-commit passed with Taplo skipped due its broken aarch64 source package. Signed-off-by: Meng Xin <mxin@nvidia.com>
NeMo-RL main at 298159e includes the latest QARL nightly scheduling and unrelated PPO updates. The merge applied cleanly while preserving the simulated KV-cache implementation and the merged MBridge KV mapping pin at 605c8167. Constraint: Preserve merge topology and upstream history; do not rebase or force-push this branch. Confidence: high Scope-risk: moderate Directive: Keep the MBridge pin at or beyond 605c8167 while simulated KV-cache refit depends on its merged amax mapping. Tested: MBridge mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer and recipe accounting 18 passed; uv lock check, Ruff, formatting, Pyrefly, recipe minimization, and diff checks passed. Not-tested: Full nightly training matrix was not rerun. Signed-off-by: Meng Xin <mxin@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information