Skip to content

fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788

Open
jonathan-meiri wants to merge 1 commit into
NVIDIA:mainfrom
jonathan-meiri:prefer-distinct-physical-gpu-on-tiebreak
Open

fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788
jonathan-meiri wants to merge 1 commit into
NVIDIA:mainfrom
jonathan-meiri:prefer-distinct-physical-gpu-on-tiebreak

Conversation

@jonathan-meiri

Copy link
Copy Markdown

Summary

distributedAlloc is designed to spread allocations across replicated physical GPUs evenly, but a tie in its sort key — easy to reach once any prior pod has consumed a slot on one of the GPUs — falls through to sort.Slice's arbitrary order. In practice this means the loop keeps picking the next slot on the GPU it just picked from, instead of rotating to a sibling physical GPU that still has capacity.

Contributed by @Meiri28 on behalf of @runatom-ai.

What's here

Two commits, TDD-style:

  1. A failing test describing the bug — 2 physical GPUs, 1 slot on GPU-1 already allocated to another pod, new pod requests 2 slots. Today both slots deterministically land on GPU-0; the test asserts 1-from-each.
  2. The fix — introduce a small pickedFrom map and consult it as a tie-break sort key. The existing ordering remains primary. Behavior is unchanged whenever the primary sort key differs; the change only touches the previously-arbitrary tied case.

Commits are DCO-signed.

@copy-pr-bot

copy-pr-bot Bot commented May 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dims

dims commented May 31, 2026

Copy link
Copy Markdown
Contributor

cc @cdesiniotis @tariq1890

@tariq1890

Copy link
Copy Markdown
Contributor

@jonathan-meiri Can you squash your commit history?

… physical GPUs

distributedAlloc sorts candidate replicas by 'used' (total - available)
per underlying physical device. When two physical devices end up with
the same 'used' count after some picks, sort.Slice is unstable and
breaks the tie arbitrarily, which in practice causes the loop to keep
picking from whichever device's slots happened to land first in the
candidate list rather than rotating to a sibling physical device.

That manifests when the available pool starts uneven (e.g., one slot
on GPU-1 has already been allocated to another pod). The function name
and docstring promise an even spread across replicated GPUs; the
tie-break failure deterministically concentrates the new pod's slots
on the GPU that had more available replicas, leaving the other
physical device(s) untouched.

Introduce a pickedFrom map tracking how many slots have been taken
from each physical device during this allocation, and consult it as a
tie-break sort key. The existing 'used' ordering remains primary;
when two devices tie on that, the one we have not touched (or have
touched the least) this allocation comes first. Behavior is unchanged
whenever 'used' counts differ, including for fresh state and for
cases where only one physical device has free slots.

A new test in internal/rm/allocate_test.go captures the bug
deterministically: 2 physical GPUs, 1 slot on GPU-1 already allocated
to another pod, new pod requests 2 slots. Without the fix the test
fails (both slots land on GPU-0); with the fix it passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com>
Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
@jonathan-meiri jonathan-meiri force-pushed the prefer-distinct-physical-gpu-on-tiebreak branch from 781587b to faa21d2 Compare June 1, 2026 16:29
@jonathan-meiri

Copy link
Copy Markdown
Author

@tariq1890 done 😄

@rajatchopra

rajatchopra commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The changes look good to me. However, I feel the entire loop should be optimized. Currently, n needed allocations sort the gpus by available capacity n times. One sort should be enough and then we can decide on distributed allocation or a packed one, or a topology driven one. Also see #1621

@jonathan-meiri

Copy link
Copy Markdown
Author

Thanks @rajatchopra — both points well taken.

On the strategy framework: agreed that's a useful direction. I see #1621 is exploring similar ground; once this lands I'd be glad to help extend that work or contribute additional strategies as follow-ups.

On the loop: you're right, a heap keyed by (used, pickedFrom) brings it to O(n log m). I drafted it on top of this branch in #1826 (opened as draft, not asking for review yet). Happy to keep it as a separate follow-up after this lands, or fold it in here if you'd prefer.

jonathan-meiri pushed a commit to jonathan-meiri/k8s-device-plugin that referenced this pull request Jun 2, 2026
Follow-up to the tie-break fix in PR NVIDIA#1788.

The previous implementation sorted the full candidate list inside the
allocation loop, paying O(n log n) per iteration for n iterations and
giving O(n² log n) overall. Since all annotated replicas from the same
underlying physical device share the same sort key, sorting at the
replica granularity is wasted work — only m (the number of distinct
physical devices contributing candidates) needs to be reordered.

Refactor to:
  - Bucket candidates by their underlying physical device into a small
    gpuAllocState per device, holding `used`, `pickedFrom`, and the
    remaining annotated-ID candidates from that device.
  - Initialize a min-heap of these states ordered primarily by `used`
    (so devices with the fewest already-allocated replicas come first)
    and tie-broken by `pickedFrom` (so devices we have not touched in
    the current allocation are preferred when used counts match).
  - On each iteration pop the best device, take one of its remaining
    replicas, increment its counters, and push it back if more remain.

Total cost drops to O(n log m). The tie-break semantics from PR NVIDIA#1788
are preserved unchanged; existing tests still pass without modification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com>
Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
@wkd-woo

wkd-woo commented Jun 7, 2026

Copy link
Copy Markdown

Author of #1621 here :) nice catch — I had not noticed the tie case falling through to sort order.

Our PRs touch the same loop in allocate.go, so whichever lands second will need a rebase. Happy to handle that on my side if yours goes in first. Your tie-break should slot into the refactored shape of #1621 pretty naturally too: replicaCount can carry a per-allocation picked count, and the distributed comparator just falls back to it on ties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants