fix: don't re-flatten vLLM server completion_ids in Online DPO by vineethsaivs · Pull Request #6146 · huggingface/trl

vineethsaivs · 2026-06-23T04:45:52Z

What does this PR do?

OnlineDPOTrainer._generate_vllm_server re-flattened the vLLM client's completion_ids:

completion_ids = [[comp_id] for prompt_completions in completion_ids for comp_id in prompt_completions]

VLLMClient.generate(...)["completion_ids"] already returns a list[list[int]] with one token-id list per completion (the same shape the colocate path produces and that GRPO uses). The comprehension iterated over every completion and every token, so a completion like [101, 102, 103] became three single-token "completions" [101], [102], [103]. Downstream that makes completion_mask [1, 0, 0, ...] for every row and throws off the per-process row count. The colocate path (_generate_vllm_colocate) and GRPOTrainer never apply this extra flatten, so server-mode Online DPO was the only inconsistent path. This removes the re-flatten so completion_ids is passed through unchanged, and adds a CPU regression test.

This supersedes the abandoned draft #5516 (a WIP with the same one-line removal, untouched since April); I commented on the issue first to coordinate.

Fixes #5514

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. See OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice #5514.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.

Note

Low Risk
Small behavioral fix in experimental Online DPO vLLM server path only; aligns with colocate behavior and is covered by a targeted unit test.

Overview
Fixes vLLM server-mode Online DPO generation where _generate_vllm_server incorrectly re-flattened VLLMClient.generate’s completion_ids. The client already returns list[list[int]] (one full token sequence per completion, same as colocate/GRPO); the removed comprehension treated each token as its own completion, breaking completion_mask and per-process row counts.

_generate_vllm_server now passes completion_ids through unchanged after the main-process generate call. A CPU regression test (test_generate_vllm_server_preserves_completion_token_lists) stubs gather/broadcast and the vLLM client to assert multi-token completions stay intact.

^{Reviewed by Cursor Bugbot for commit 6a9f32a. Bugbot is set up for automated code reviews on this repo. Configure here.}

OnlineDPOTrainer._generate_vllm_server re-flattened the completion_ids returned by the vLLM client, which is already a list[list[int]] with one token-id list per completion (the same shape the colocate path and GRPO produce). The comprehension iterated over every completion and every token, turning each token into its own single-token completion, which corrupts the completion mask and the per-process row count. Remove the re-flatten so completion_ids is passed through unchanged. Added a CPU regression test that mocks the vLLM client and the distributed gather/broadcast and asserts each completion is preserved. Fixes huggingface#5514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: don't re-flatten vLLM server completion_ids in Online DPO#6146

fix: don't re-flatten vLLM server completion_ids in Online DPO#6146
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/online-dpo-vllm-completion-ids

vineethsaivs commented Jun 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vineethsaivs commented Jun 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vineethsaivs commented Jun 23, 2026 •

edited by cursor Bot

Loading