Skip to content

feat(quant): Add modelopt KV cache amax mapping.#4591

Open
mxinO wants to merge 3 commits into
mainfrom
mxin/simulated-kv-cache-qarl
Open

feat(quant): Add modelopt KV cache amax mapping.#4591
mxinO wants to merge 3 commits into
mainfrom
mxin/simulated-kv-cache-qarl

Conversation

@mxinO

@mxinO mxinO commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add modelopt kv cache amax mapping. This is used for nemo-RL simulated kv cache quantization.

Changelog

Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol.

Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks.

Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Used for NVIDIA-NeMo/RL#3012

Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol.

Constraint: Shared/tied-KV mappings with missing HF projections have no general rollout naming contract.

Rejected: Per-model KV mapping lists | duplicate existing QKV naming knowledge and do not scale.

Rejected: Native vLLM FP8 KV scales | real-runtime cache formats are outside this simulated-quant branch.

Confidence: high

Scope-risk: moderate

Directive: Define and test an explicit semantic contract before enabling mappings that allow missing HF names.

Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks.

Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Document that automatic KV-cache amax derivation currently targets only the main policy model. Speculative draft models and MTP layers remain excluded until their runtime quantizer destinations and refit contract are supported.

Confidence: high

Scope-risk: narrow

Tested: Pre-commit hooks for quant_mapping.py; git diff --check.

Not-tested: Runtime behavior is unchanged by this comment-only update.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO added the area:quant Quantization (PTQ, QAT, FP8 recipes) label Jul 2, 2026
Merge the latest origin/main while preserving branch history and the existing simulated KV mapping delta.

Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing.

Confidence: high

Scope-risk: moderate

Tested: 68 standalone mapping tests, 68 NeMo-RL-pinned mapping tests, and pre-commit on branch-owned files.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO changed the title Preserve calibrated KV state across QARL refits Add modelopt KV cache amax mapping. Jul 2, 2026
@mxinO mxinO marked this pull request as ready for review July 2, 2026 09:22
@mxinO mxinO changed the title Add modelopt KV cache amax mapping. feat(quant): Add modelopt KV cache amax mapping. Jul 2, 2026
@mxinO mxinO requested a review from yaoyu-33 July 2, 2026 09:23
@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LGTM - clean, focused change that derives K/V BMM quantizer amax mappings from eligible fused-QKV mappings, with thorough unit coverage.

Verified while reviewing:

  • derive_kv_bmm_amax_map filters strictly to QKVMapping; ConcatenatedQKVMapping is a sibling class (not a subclass), so fused vision QKV blocks are correctly excluded.
  • Bias mappings are correctly ignored because _derive_qkv_megatron_parent only matches the .self_attention.linear_qkv.weight suffix.
  • Derived mappings are AmaxMapping (replicated, allow_hf_name_mismatch=True), so no TP chunking is applied to these scalars.

No correctness or coverage gaps found.

Suggested test cases:

  • TestDeriveKvBmmAmaxMap::test_derives_kv_bmm_amax_mappings_from_qkv_mapping
  • TestDeriveKvBmmAmaxMap::test_preserves_wildcards_and_language_model_prefixes
  • TestDeriveKvBmmAmaxMap::test_skips_disallowed_qkv_shapes
  • TestDeriveKvBmmAmaxMap::test_skips_mappings_that_allow_missing_hf_projections
  • TestQuantMappingRegistryIntegration::test_quant_mappings_disabled_by_default
  • TestQuantMappingRegistryIntegration::test_kv_bmm_amax_forward_lookup
  • TestQuantMappingRegistryIntegration::test_kv_bmm_amax_reverse_lookup
  • TestQuantMappingRegistryIntegration::test_kv_bmm_amax_coexists_with_weight_and_input_quantizer_mappings
  • TestKvBmmQuantMappingPrefixes::test_registry_preserves_prefixes_and_wildcards

No perf tests impacted.

@yaoyu-33 yaoyu-33 added feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:quant Quantization (PTQ, QAT, FP8 recipes) feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants