feat(quant): Add modelopt KV cache amax mapping.#4591
Open
mxinO wants to merge 3 commits into
Open
Conversation
Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol. Constraint: Shared/tied-KV mappings with missing HF projections have no general rollout naming contract. Rejected: Per-model KV mapping lists | duplicate existing QKV naming knowledge and do not scale. Rejected: Native vLLM FP8 KV scales | real-runtime cache formats are outside this simulated-quant branch. Confidence: high Scope-risk: moderate Directive: Define and test an explicit semantic contract before enabling mappings that allow missing HF names. Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks. Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test. Signed-off-by: Meng Xin <mxin@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Document that automatic KV-cache amax derivation currently targets only the main policy model. Speculative draft models and MTP layers remain excluded until their runtime quantizer destinations and refit contract are supported. Confidence: high Scope-risk: narrow Tested: Pre-commit hooks for quant_mapping.py; git diff --check. Not-tested: Runtime behavior is unchanged by this comment-only update. Signed-off-by: Meng Xin <mxin@nvidia.com>
Merge the latest origin/main while preserving branch history and the existing simulated KV mapping delta. Constraint: Upstream updates must use merge commits rather than rebasing or force-pushing. Confidence: high Scope-risk: moderate Tested: 68 standalone mapping tests, 68 NeMo-RL-pinned mapping tests, and pre-commit on branch-owned files. Signed-off-by: Meng Xin <mxin@nvidia.com>
Contributor
|
LGTM - clean, focused change that derives K/V BMM quantizer amax mappings from eligible fused-QKV mappings, with thorough unit coverage. Verified while reviewing:
No correctness or coverage gaps found. Suggested test cases:
No perf tests impacted. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add modelopt kv cache amax mapping. This is used for nemo-RL simulated kv cache quantization.
Changelog
Simulated ModelOpt K/V quantizers keep scalar calibration state outside mapped linear weights. Derive replicated semantic mappings from conventional QKV mappings so the existing conversion stream carries that state without a second protocol.
Tested: 68 focused quant-mapping tests in the QARL CUDA/Transformer Engine container; focused ruff and format checks.
Not-tested: Distributed multi-rank conversion is covered by existing replicated mapping machinery, not a new dedicated topology test.
GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Used for NVIDIA-NeMo/RL#3012