fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788
fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788jonathan-meiri wants to merge 1 commit into
Conversation
|
@jonathan-meiri Can you squash your commit history? |
… physical GPUs distributedAlloc sorts candidate replicas by 'used' (total - available) per underlying physical device. When two physical devices end up with the same 'used' count after some picks, sort.Slice is unstable and breaks the tie arbitrarily, which in practice causes the loop to keep picking from whichever device's slots happened to land first in the candidate list rather than rotating to a sibling physical device. That manifests when the available pool starts uneven (e.g., one slot on GPU-1 has already been allocated to another pod). The function name and docstring promise an even spread across replicated GPUs; the tie-break failure deterministically concentrates the new pod's slots on the GPU that had more available replicas, leaving the other physical device(s) untouched. Introduce a pickedFrom map tracking how many slots have been taken from each physical device during this allocation, and consult it as a tie-break sort key. The existing 'used' ordering remains primary; when two devices tie on that, the one we have not touched (or have touched the least) this allocation comes first. Behavior is unchanged whenever 'used' counts differ, including for fresh state and for cases where only one physical device has free slots. A new test in internal/rm/allocate_test.go captures the bug deterministically: 2 physical GPUs, 1 slot on GPU-1 already allocated to another pod, new pod requests 2 slots. Without the fix the test fails (both slots land on GPU-0); with the fix it passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
781587b to
faa21d2
Compare
|
@tariq1890 done 😄 |
|
The changes look good to me. However, I feel the entire loop should be optimized. Currently, |
|
Thanks @rajatchopra — both points well taken. On the strategy framework: agreed that's a useful direction. I see #1621 is exploring similar ground; once this lands I'd be glad to help extend that work or contribute additional strategies as follow-ups. On the loop: you're right, a heap keyed by |
Follow-up to the tie-break fix in PR NVIDIA#1788. The previous implementation sorted the full candidate list inside the allocation loop, paying O(n log n) per iteration for n iterations and giving O(n² log n) overall. Since all annotated replicas from the same underlying physical device share the same sort key, sorting at the replica granularity is wasted work — only m (the number of distinct physical devices contributing candidates) needs to be reordered. Refactor to: - Bucket candidates by their underlying physical device into a small gpuAllocState per device, holding `used`, `pickedFrom`, and the remaining annotated-ID candidates from that device. - Initialize a min-heap of these states ordered primarily by `used` (so devices with the fewest already-allocated replicas come first) and tie-broken by `pickedFrom` (so devices we have not touched in the current allocation are preferred when used counts match). - On each iteration pop the best device, take one of its remaining replicas, increment its counters, and push it back if more remain. Total cost drops to O(n log m). The tie-break semantics from PR NVIDIA#1788 are preserved unchanged; existing tests still pass without modification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
|
Author of #1621 here :) nice catch — I had not noticed the tie case falling through to sort order. Our PRs touch the same loop in |
Summary
distributedAllocis designed to spread allocations across replicated physical GPUs evenly, but a tie in its sort key — easy to reach once any prior pod has consumed a slot on one of the GPUs — falls through tosort.Slice's arbitrary order. In practice this means the loop keeps picking the next slot on the GPU it just picked from, instead of rotating to a sibling physical GPU that still has capacity.Contributed by @Meiri28 on behalf of @runatom-ai.
What's here
Two commits, TDD-style:
pickedFrommap and consult it as a tie-break sort key. The existing ordering remains primary. Behavior is unchanged whenever the primary sort key differs; the change only touches the previously-arbitrary tied case.Commits are DCO-signed.