feat(quant): Add modelopt KV cache amax mapping. by mxinO · Pull Request #4591 · NVIDIA-NeMo/Megatron-Bridge

mxinO · 2026-06-30T11:35:54Z

What does this PR do ?

Add modelopt kv cache amax mapping. This is used for nemo-RL simulated kv cache quantization.

Changelog

Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol.

Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks.

Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Used for NVIDIA-NeMo/RL#3012

Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol. Constraint: Shared/tied-KV mappings with missing HF projections have no general rollout naming contract. Rejected: Per-model KV mapping lists | duplicate existing QKV naming knowledge and do not scale. Rejected: Native vLLM FP8 KV scales | real-runtime cache formats are outside this simulated-quant branch. Confidence: high Scope-risk: moderate Directive: Define and test an explicit semantic contract before enabling mappings that allow missing HF names. Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks. Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test. Signed-off-by: Meng Xin <mxin@nvidia.com>

copy-pr-bot · 2026-06-30T11:35:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Document that automatic KV-cache amax derivation currently targets only the main policy model. Speculative draft models and MTP layers remain excluded until their runtime quantizer destinations and refit contract are supported. Confidence: high Scope-risk: narrow Tested: Pre-commit hooks for quant_mapping.py; git diff --check. Not-tested: Runtime behavior is unchanged by this comment-only update. Signed-off-by: Meng Xin <mxin@nvidia.com>

Merge the latest origin/main while preserving branch history and the existing simulated KV mapping delta. Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing. Confidence: high Scope-risk: moderate Tested: 68 standalone mapping tests, 68 NeMo-RL-pinned mapping tests, and pre-commit on branch-owned files. Signed-off-by: Meng Xin <mxin@nvidia.com>

claude · 2026-07-02T09:25:48Z

LGTM - clean, focused change that derives K/V BMM quantizer amax mappings from eligible fused-QKV mappings, with thorough unit coverage.

Verified while reviewing:

derive_kv_bmm_amax_map filters strictly to QKVMapping; ConcatenatedQKVMapping is a sibling class (not a subclass), so fused vision QKV blocks are correctly excluded.
Bias mappings are correctly ignored because _derive_qkv_megatron_parent only matches the .self_attention.linear_qkv.weight suffix.
Derived mappings are AmaxMapping (replicated, allow_hf_name_mismatch=True), so no TP chunking is applied to these scalars.

No correctness or coverage gaps found.

Suggested test cases:

TestDeriveKvBmmAmaxMap::test_derives_kv_bmm_amax_mappings_from_qkv_mapping
TestDeriveKvBmmAmaxMap::test_preserves_wildcards_and_language_model_prefixes
TestDeriveKvBmmAmaxMap::test_skips_disallowed_qkv_shapes
TestDeriveKvBmmAmaxMap::test_skips_mappings_that_allow_missing_hf_projections
TestQuantMappingRegistryIntegration::test_quant_mappings_disabled_by_default
TestQuantMappingRegistryIntegration::test_kv_bmm_amax_forward_lookup
TestQuantMappingRegistryIntegration::test_kv_bmm_amax_reverse_lookup
TestQuantMappingRegistryIntegration::test_kv_bmm_amax_coexists_with_weight_and_input_quantizer_mappings
TestKvBmmQuantMappingPrefixes::test_registry_preserves_prefixes_and_wildcards

No perf tests impacted.

mxinO added the area:quant Quantization (PTQ, QAT, FP8 recipes) label Jul 2, 2026

mxinO changed the title ~~Preserve calibrated KV state across QARL refits~~ Add modelopt KV cache amax mapping. Jul 2, 2026

mxinO marked this pull request as ready for review July 2, 2026 09:22

mxinO changed the title ~~Add modelopt KV cache amax mapping.~~ feat(quant): Add modelopt KV cache amax mapping. Jul 2, 2026

copy-pr-bot Bot temporarily deployed to public July 2, 2026 09:23 Inactive

mxinO requested a review from yaoyu-33 July 2, 2026 09:23

copy-pr-bot Bot temporarily deployed to test July 2, 2026 09:23 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 09:32 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 09:58 Inactive

yaoyu-33 added feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(quant): Add modelopt KV cache amax mapping.#4591

feat(quant): Add modelopt KV cache amax mapping.#4591
mxinO wants to merge 3 commits into
mainfrom
mxin/simulated-kv-cache-qarl

mxinO commented Jun 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

claude Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mxinO commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

claude Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mxinO commented Jun 30, 2026 •

edited

Loading