Remove redundant get_kbit_device_map()#6158
Open
qgallouedec wants to merge 8 commits into
Open
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 53275a2. Configure here.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

QLoRA scripts set
model_init_kwargs["device_map"] = get_kbit_device_map()(which returns{"": local_process_index}) before loading the model. This is redundant: it never changes where the model lands. This PR removes the call everywhere and deletes the helper itself.Scope
examples/scripts/(24 files) andtrl/scripts/(5 files): drop thedevice_mapline + import.tests/test_grpo_trainer.py: drop thedevice_map=get_kbit_device_map()kwarg + import.trl/trainer/utils.py: deleteget_kbit_device_map.trl/__init__.py,trl/trainer/__init__.py: remove the public export (sofrom trl import get_kbit_device_mapno longer resolves — minor breaking change for a thin internal helper).Background
A bitsandbytes-quantized model is placed at load time and can't be moved afterward, so QLoRA needed an explicit
device_map. Born in #373 (2023) as inlinedevice_map = {"": 0}on a quantized reward model, generalized to{"": local_process_index}for DDP in #725, then extracted intoget_kbit_device_map()in #1176. A real 2023-era placement problem — which the trainer path now handles on its own (see below).Why it's redundant
Two cases, exhaustively:
MULTI_GPU/DEEPSPEED) — the trainer overridesmodel_init_kwargs["device_map"] = Nonebeforefrom_pretrained, so the value fromget_kbit_device_map()is discarded regardless.transformersauto-places quantized weights on the current CUDA device, which is exactly the{"": 0}the helper would have set.How it was demonstrated
Setup: 8×H100,
Qwen/Qwen2.5-0.5B,--load_in_4bit, bitsandbytes 0.49.2. Every check was run on bothtransformers==4.56.2(the minimum TRL supports) and5.13(current), with identical outcomes.Single-GPU load, no
device_map:from_pretrained(quantization_config=BitsAndBytesConfig(load_in_4bit=True))with nodevice_map: parameters land oncuda:0(not CPU) and a forward pass succeeds. Rules out the "withoutdevice_mapit falls back to CPU and breaks" assumption.DDP placement, no
device_map: 2-process run where the trainer forcesdevice_map=None: each rank still loads onto its own local GPU (rank 0 →cuda:0, rank 1 →cuda:1) and training steps run. Per-rank placement does not depend onget_kbit_device_map().Real script, with vs without the line:
trl/scripts/sft.pyon 2 GPUs (accelerate launch,--load_in_4bit --use_peft), baseline vs. a copy with the line removed. It runs in both cases and the losses are identical step-for-step:train_lossNote
Most edited scripts are trainer-loaded or pass the dict through
model_init_kwargs(the path covered by checks 2–3). A few load directly viafrom_pretrained(e.g.sft_video_llm.py, the GRPO VLM test); these are covered by check 1 for single-GPU and are reasoned-safe under DDP (the per-rank device is set before the model loads).Note
Medium Risk
Minor public API removal and reliance on transformers/trainer device placement for quantized loads; behavior was validated in the PR but QLoRA + multi-GPU remains sensitive.
Overview
Removes the
get_kbit_device_map()helper and stops settingdevice_mapfrom QLoRA example/CLI scripts whenquantization_configis present.quantization_configis still passed unchanged; placement is left to transformers on single process and to trainers that forcedevice_map=Noneunder MULTI_GPU / DEEPSPEED.from trl import get_kbit_device_mapis no longer exported (minor breaking change for external callers).DistillationTrainerandGOLDTrainernow explicitly setdevice_map=Nonewhen loading student/teacher from a path in distributed runs, matching the pattern already used in core trainers like DPO / GRPO.Reviewed by Cursor Bugbot for commit 627c405. Bugbot is set up for automated code reviews on this repo. Configure here.