Skip to content

feat: Simulated KV cache QARL#3012

Draft
mxinO wants to merge 12 commits into
mainfrom
mxin/simulated-kv-cache-qarl
Draft

feat: Simulated KV cache QARL#3012
mxinO wants to merge 12 commits into
mainfrom
mxin/simulated-kv-cache-qarl

Conversation

@mxinO

@mxinO mxinO commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

mxinO added 2 commits June 30, 2026 03:05
ModelOpt already inserts and calibrates K/V fake quantizers in Megatron and vLLM, but Bridge did not export their scalar amax state. Add semantic Bridge mappings, strict rollout-side state loading, isolated FP8/NVFP4 recipes, and focused coverage so refits cannot silently retain dummy calibration.

Constraint: Native vLLM KV-cache storage and deployment scales are out of scope.

Rejected: Add per-model mappings in NeMo-RL | existing QKV mappings already own architecture-specific semantic prefixes.

Confidence: medium

Scope-risk: moderate

Directive: Keep native/real KV-cache synchronization separate from this simulated-quant path.

Tested: MBridge 68 unit tests; NeMo-RL 59 focused unit tests; FP8/NVFP4 live refit tests; two-step dense FP8 KV distillation; both repository pre-commit suites.

Not-tested: Nano3 hybrid refit exposed an unresolved HF-to-vLLM naming difference and remains pending.
Signed-off-by: Meng Xin <mxin@nvidia.com>
Nano3 exports K/V amax under a backbone root while vLLM stores the same attention state under model and an extra attn child. Normalize only that observed root alias and retain exact unique lookup so unrelated layer stacks cannot be selected silently.

Constraint: Hybrid vLLM attention modules use model.layers.*.mixer.attn while Bridge emits backbone.layers.*.mixer.

Rejected: Match every path below .layers. | could silently bind vision or secondary model stacks.

Confidence: high

Scope-risk: narrow

Directive: Add explicit aliases only when a model demonstrates a distinct HF-to-vLLM root contract; do not restore generic suffix matching.

Tested: 62 focused resolver/backend unit tests; NeMo-RL full pre-commit; Nano3 TP1/EP8 NVFP4-KV distillation smoke job 233795 with loss 0.008854 and grad norm 0.4612.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Jun 30, 2026
@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 82b593c (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Use vLLM's model-owned HF mapper and PP-missing-layer contract for K/V amax resolution while retaining only the ModelOpt-specific terminal attention adaptation. Extend the existing packed collective stream to preserve scalar calibrated amax state without changing its wire format.

Constraint: vLLM owns HF-to-runtime naming and rollout pipeline-layer locality.

Rejected: Add more model-specific aliases or padding to the collective stream | duplicates upstream model knowledge or changes the existing transport protocol.

Confidence: high

Scope-risk: moderate

Directive: Keep K/V amax on the existing refit transport and fail closed for missing locally owned buffers.

Tested: MBridge unit suites 68+68; NeMo-RL vLLM suite 69; packed-tensor suite 3; Ruff and Pyrefly; PP=2 smoke job 237613; Nano3 hybrid smoke job 237614.

Not-tested: Integration against the eight origin/main commits newer than this branch base.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 1eeace9 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

K/V calibration state follows the same repeated refit snapshot as GEMM activation amax. Expose its buffers to the existing vLLM loader and retain only the runtime Attention-name adjustment required by ModelOpt.

Constraint: ModelOpt K/V amax is calibrated once and remains fixed during QARL training.

Rejected: Maintain a separate K/V resolver and per-refit reset protocol | duplicates vLLM loading and PP handling already used by GEMM amax.

Confidence: high

Scope-risk: narrow

Tested: 44 focused unit tests; FP8 and NVFP4 GPU refit integration job 237657; pre-commit on changed files
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: d5b1896 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Input and K/V quantizer amax buffers use one suffix contract and one vLLM loader predicate. The K/V-only inner Attention rename remains the sole specialization.

Constraint: Only enabled activation quantizers whose amax is exported by the policy belong in this suffix set.

Confidence: high

Scope-risk: narrow

Tested: focused activation-amax loader unit test; pre-commit on changed files
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 48717d5 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Document the transport distinction that made scalar quantizer amax safe in aligned IPC/ZMQ refits but exposed shape and alignment constraints in the unpadded collective stream.

Confidence: high

Scope-risk: narrow

Tested: pre-commit on packed_tensor.py
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 28aeccc (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Place the scalar-shape and alignment explanation in restore_tensor's docstring because both behaviors are part of the helper's contract.

Confidence: high

Scope-risk: narrow

Tested: pre-commit on packed_tensor.py
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 3ffa2d7 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Explain why MBridge's HF-semantic K/V amax names need the ModelOpt inner Attention segment before entering the standard vLLM loader.

Directive: Keep model-specific HF mapping and PP filtering in vLLM's normal loader.

Confidence: high

Scope-risk: narrow

Tested: pre-commit on vllm_quant_backend.py
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 39b2d81 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

mxinO added 2 commits July 2, 2026 02:12
Merge the latest origin/main while preserving branch history and the existing simulated KV-cache QARL work.

Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing.

Confidence: high

Scope-risk: moderate

Tested: Branch-wide pre-commit and both MBridge mapping suites.

Not-tested: Distributed KV refit at this exact commit; the following boundary-fix commit carries that verification.
Signed-off-by: Meng Xin <mxin@nvidia.com>
The runtime-name guard treated the tail of self_attn as if the inner vLLM attn module were already present. Require a dot-delimited .attn. segment and cover the real self_attn input shape so K/V amax reaches the existing generic loader.

Constraint: ModelOpt installs K/V BMM quantizers under vLLM attention inner modules.

Rejected: Restore the dedicated K/V resolver | the shared activation-amax loader works once the runtime path is unambiguous.

Confidence: high

Scope-risk: narrow

Tested: FP8 targeted distributed refit job 237882; full FP8/NVFP4 refit and focused unit job 237883; Ruff, formatting, and Pyrefly.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 4cf2ef7 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@mxinO mxinO changed the title Mxin/simulated kv cache qarl feat: Simulated kv cache QARL Jul 2, 2026
@mxinO mxinO changed the title feat: Simulated kv cache QARL feat: Simulated KV cache QARL Jul 2, 2026
The upstream squash commit 605c8167 contains the reviewed KV-cache amax mapping previously consumed through the temporary b9f18759 branch. Repointing the submodule removes that temporary dependency while preserving the tested mapping and refit behavior. The lockfile was regenerated inside the NeMo-RL container and only consolidates the duplicate Starlette resolution.

Constraint: The merged MBridge commit also advances its Megatron-LM submodule, so distributed refit compatibility must be verified before adoption.
Rejected: Retain the temporary MBridge pin | the mapping is now available from upstream and the merged pin passes the focused compatibility suite.
Confidence: high
Scope-risk: moderate
Directive: Regenerate uv.lock in the project container when changing the MBridge pin; do not edit it manually.
Tested: MBridge quant mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer 3 passed; uv lock --check and focused pre-commit passed with Taplo skipped due its broken aarch64 source package.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 7d611e9 (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

NeMo-RL main at 298159e includes the latest QARL nightly scheduling and unrelated PPO updates. The merge applied cleanly while preserving the simulated KV-cache implementation and the merged MBridge KV mapping pin at 605c8167.

Constraint: Preserve merge topology and upstream history; do not rebase or force-push this branch.
Confidence: high
Scope-risk: moderate
Directive: Keep the MBridge pin at or beyond 605c8167 while simulated KV-cache refit depends on its merged amax mapping.
Tested: MBridge mapping 68 passed; NeMo-RL FP8/NVFP4 refit and ModelOpt config 46 passed; packed transfer and recipe accounting 18 passed; uv lock check, Ruff, formatting, Pyrefly, recipe minimization, and diff checks passed.
Not-tested: Full nightly training matrix was not rerun.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 7c4e34a (PR #3012 from mxin/simulated-kv-cache-qarl)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant