feat: Simulated KV cache QARL by mxinO · Pull Request #3012 · NVIDIA-NeMo/RL

mxinO · 2026-06-30T11:37:11Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

depends on feat(quant): Add modelopt KV cache amax mapping. Megatron-Bridge#4591

ModelOpt already inserts and calibrates K/V fake quantizers in Megatron and vLLM, but Bridge did not export their scalar amax state. Add semantic Bridge mappings, strict rollout-side state loading, isolated FP8/NVFP4 recipes, and focused coverage so refits cannot silently retain dummy calibration. Constraint: Native vLLM KV-cache storage and deployment scales are out of scope. Rejected: Add per-model mappings in NeMo-RL | existing QKV mappings already own architecture-specific semantic prefixes. Confidence: medium Scope-risk: moderate Directive: Keep native/real KV-cache synchronization separate from this simulated-quant path. Tested: MBridge 68 unit tests; NeMo-RL 59 focused unit tests; FP8/NVFP4 live refit tests; two-step dense FP8 KV distillation; both repository pre-commit suites. Not-tested: Nano3 hybrid refit exposed an unresolved HF-to-vLLM naming difference and remains pending. Signed-off-by: Meng Xin <mxin@nvidia.com>

Nano3 exports K/V amax under a backbone root while vLLM stores the same attention state under model and an extra attn child. Normalize only that observed root alias and retain exact unique lookup so unrelated layer stacks cannot be selected silently. Constraint: Hybrid vLLM attention modules use model.layers.*.mixer.attn while Bridge emits backbone.layers.*.mixer. Rejected: Match every path below .layers. | could silently bind vision or secondary model stacks. Confidence: high Scope-risk: narrow Directive: Add explicit aliases only when a model demonstrates a distinct HF-to-vLLM root contract; do not restore generic suffix matching. Tested: 62 focused resolver/backend unit tests; NeMo-RL full pre-commit; Nano3 TP1/EP8 NVFP4-KV distillation smoke job 233795 with loss 0.008854 and grad norm 0.4612. Signed-off-by: Meng Xin <mxin@nvidia.com>

copy-pr-bot · 2026-06-30T11:37:15Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-06-30T11:38:14Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 82b593c (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Use vLLM's model-owned HF mapper and PP-missing-layer contract for K/V amax resolution while retaining only the ModelOpt-specific terminal attention adaptation. Extend the existing packed collective stream to preserve scalar calibrated amax state without changing its wire format. Constraint: vLLM owns HF-to-runtime naming and rollout pipeline-layer locality. Rejected: Add more model-specific aliases or padding to the collective stream | duplicates upstream model knowledge or changes the existing transport protocol. Confidence: high Scope-risk: moderate Directive: Keep K/V amax on the existing refit transport and fail closed for missing locally owned buffers. Tested: MBridge unit suites 68+68; NeMo-RL vLLM suite 69; packed-tensor suite 3; Ruff and Pyrefly; PP=2 smoke job 237613; Nano3 hybrid smoke job 237614. Not-tested: Integration against the eight origin/main commits newer than this branch base. Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T05:21:39Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 1eeace9 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

K/V calibration state follows the same repeated refit snapshot as GEMM activation amax. Expose its buffers to the existing vLLM loader and retain only the runtime Attention-name adjustment required by ModelOpt. Constraint: ModelOpt K/V amax is calibrated once and remains fixed during QARL training. Rejected: Maintain a separate K/V resolver and per-refit reset protocol | duplicates vLLM loading and PP handling already used by GEMM amax. Confidence: high Scope-risk: narrow Tested: 44 focused unit tests; FP8 and NVFP4 GPU refit integration job 237657; pre-commit on changed files Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T05:59:23Z

✅ Submodule Fast-Forward Check Results

Check based on commit: d5b1896 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Input and K/V quantizer amax buffers use one suffix contract and one vLLM loader predicate. The K/V-only inner Attention rename remains the sole specialization. Constraint: Only enabled activation quantizers whose amax is exported by the policy belong in this suffix set. Confidence: high Scope-risk: narrow Tested: focused activation-amax loader unit test; pre-commit on changed files Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T06:09:49Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 48717d5 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Document the transport distinction that made scalar quantizer amax safe in aligned IPC/ZMQ refits but exposed shape and alignment constraints in the unpadded collective stream. Confidence: high Scope-risk: narrow Tested: pre-commit on packed_tensor.py Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T06:35:16Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 28aeccc (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Place the scalar-shape and alignment explanation in restore_tensor's docstring because both behaviors are part of the helper's contract. Confidence: high Scope-risk: narrow Tested: pre-commit on packed_tensor.py Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T06:37:46Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 3ffa2d7 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Explain why MBridge's HF-semantic K/V amax names need the ModelOpt inner Attention segment before entering the standard vLLM loader. Directive: Keep model-specific HF mapping and PP filtering in vLLM's normal loader. Confidence: high Scope-risk: narrow Tested: pre-commit on vllm_quant_backend.py Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T06:41:09Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 39b2d81 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Merge the latest origin/main while preserving branch history and the existing simulated KV-cache QARL work. Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing. Confidence: high Scope-risk: moderate Tested: Branch-wide pre-commit and both MBridge mapping suites. Not-tested: Distributed KV refit at this exact commit; the following boundary-fix commit carries that verification. Signed-off-by: Meng Xin <mxin@nvidia.com>

The runtime-name guard treated the tail of self_attn as if the inner vLLM attn module were already present. Require a dot-delimited .attn. segment and cover the real self_attn input shape so K/V amax reaches the existing generic loader. Constraint: ModelOpt installs K/V BMM quantizers under vLLM attention inner modules. Rejected: Restore the dedicated K/V resolver | the shared activation-amax loader works once the runtime path is unambiguous. Confidence: high Scope-risk: narrow Tested: FP8 targeted distributed refit job 237882; full FP8/NVFP4 refit and focused unit job 237883; Ruff, formatting, and Pyrefly. Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-02T09:14:13Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 4cf2ef7 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

The upstream squash commit 605c8167 contains the reviewed KV-cache amax mapping previously consumed through the temporary b9f18759 branch. Repointing the submodule removes that temporary dependency while preserving the tested mapping and refit behavior. The lockfile was regenerated inside the NeMo-RL container and only consolidates the duplicate Starlette resolution. Constraint: The merged MBridge commit also advances its Megatron-LM submodule, so distributed refit compatibility must be verified before adoption. Rejected: Retain the temporary MBridge pin | the mapping is now available from upstream and the merged pin passes the focused compatibility suite. Confidence: high Scope-risk: moderate Directive: Regenerate uv.lock in the project container when changing the MBridge pin; do not edit it manually. Tested: MBridge quant mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer 3 passed; uv lock --check and focused pre-commit passed with Taplo skipped due its broken aarch64 source package. Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-03T06:56:35Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 7d611e9 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

NeMo-RL main at 298159e includes the latest QARL nightly scheduling and unrelated PPO updates. The merge applied cleanly while preserving the simulated KV-cache implementation and the merged MBridge KV mapping pin at 605c8167. Constraint: Preserve merge topology and upstream history; do not rebase or force-push this branch. Confidence: high Scope-risk: moderate Directive: Keep the MBridge pin at or beyond 605c8167 while simulated KV-cache refit depends on its merged amax mapping. Tested: MBridge mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer and recipe accounting 18 passed; uv lock check, Ruff, formatting, Pyrefly, recipe minimization, and diff checks passed. Not-tested: Full nightly training matrix was not rerun. Signed-off-by: Meng Xin <mxin@nvidia.com>

github-actions · 2026-07-03T07:14:34Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 7c4e34a (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

mxinO added 2 commits June 30, 2026 03:05

github-actions Bot added the Documentation Improvements or additions to documentation label Jun 30, 2026

mxinO added 2 commits July 2, 2026 02:12

mxinO changed the title ~~Mxin/simulated kv cache qarl~~ feat: Simulated kv cache QARL Jul 2, 2026

mxinO changed the title ~~feat: Simulated kv cache QARL~~ feat: Simulated KV cache QARL Jul 2, 2026

mxinO mentioned this pull request Jul 2, 2026

feat(quant): Add modelopt KV cache amax mapping. NVIDIA-NeMo/Megatron-Bridge#4591

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Simulated KV cache QARL#3012

feat: Simulated KV cache QARL#3012
mxinO wants to merge 12 commits into
mainfrom
mxin/simulated-kv-cache-qarl

mxinO commented Jun 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mxinO commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 2, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 3, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Jul 3, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mxinO commented Jun 30, 2026 •

edited

Loading