Skip to content

fix: don't re-flatten vLLM server completion_ids in Online DPO#6146

Open
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/online-dpo-vllm-completion-ids
Open

fix: don't re-flatten vLLM server completion_ids in Online DPO#6146
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/online-dpo-vllm-completion-ids

Conversation

@vineethsaivs

@vineethsaivs vineethsaivs commented Jun 23, 2026

Copy link
Copy Markdown

What does this PR do?

OnlineDPOTrainer._generate_vllm_server re-flattened the vLLM client's completion_ids:

completion_ids = [[comp_id] for prompt_completions in completion_ids for comp_id in prompt_completions]

VLLMClient.generate(...)["completion_ids"] already returns a list[list[int]] with one token-id list per completion (the same shape the colocate path produces and that GRPO uses). The comprehension iterated over every completion and every token, so a completion like [101, 102, 103] became three single-token "completions" [101], [102], [103]. Downstream that makes completion_mask [1, 0, 0, ...] for every row and throws off the per-process row count. The colocate path (_generate_vllm_colocate) and GRPOTrainer never apply this extra flatten, so server-mode Online DPO was the only inconsistent path. This removes the re-flatten so completion_ids is passed through unchanged, and adds a CPU regression test.

This supersedes the abandoned draft #5516 (a WIP with the same one-line removal, untouched since April); I commented on the issue first to coordinate.

Fixes #5514

Before submitting

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.


Note

Low Risk
Small behavioral fix in experimental Online DPO vLLM server path only; aligns with colocate behavior and is covered by a targeted unit test.

Overview
Fixes vLLM server-mode Online DPO generation where _generate_vllm_server incorrectly re-flattened VLLMClient.generate’s completion_ids. The client already returns list[list[int]] (one full token sequence per completion, same as colocate/GRPO); the removed comprehension treated each token as its own completion, breaking completion_mask and per-process row counts.

_generate_vllm_server now passes completion_ids through unchanged after the main-process generate call. A CPU regression test (test_generate_vllm_server_preserves_completion_token_lists) stubs gather/broadcast and the vLLM client to assert multi-token completions stay intact.

Reviewed by Cursor Bugbot for commit 6a9f32a. Bugbot is set up for automated code reviews on this repo. Configure here.

OnlineDPOTrainer._generate_vllm_server re-flattened the completion_ids
returned by the vLLM client, which is already a list[list[int]] with one
token-id list per completion (the same shape the colocate path and GRPO
produce). The comprehension iterated over every completion and every
token, turning each token into its own single-token completion, which
corrupts the completion mask and the per-process row count. Remove the
re-flatten so completion_ids is passed through unchanged.

Added a CPU regression test that mocks the vLLM client and the distributed
gather/broadcast and asserts each completion is preserved.

Fixes huggingface#5514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice

1 participant