perf(rm): replace per-iteration sort in distributedAlloc with a min-heap#1826
Draft
jonathan-meiri wants to merge 2 commits into
Draft
perf(rm): replace per-iteration sort in distributedAlloc with a min-heap#1826jonathan-meiri wants to merge 2 commits into
jonathan-meiri wants to merge 2 commits into
Conversation
… physical GPUs distributedAlloc sorts candidate replicas by 'used' (total - available) per underlying physical device. When two physical devices end up with the same 'used' count after some picks, sort.Slice is unstable and breaks the tie arbitrarily, which in practice causes the loop to keep picking from whichever device's slots happened to land first in the candidate list rather than rotating to a sibling physical device. That manifests when the available pool starts uneven (e.g., one slot on GPU-1 has already been allocated to another pod). The function name and docstring promise an even spread across replicated GPUs; the tie-break failure deterministically concentrates the new pod's slots on the GPU that had more available replicas, leaving the other physical device(s) untouched. Introduce a pickedFrom map tracking how many slots have been taken from each physical device during this allocation, and consult it as a tie-break sort key. The existing 'used' ordering remains primary; when two devices tie on that, the one we have not touched (or have touched the least) this allocation comes first. Behavior is unchanged whenever 'used' counts differ, including for fresh state and for cases where only one physical device has free slots. A new test in internal/rm/allocate_test.go captures the bug deterministically: 2 physical GPUs, 1 slot on GPU-1 already allocated to another pod, new pod requests 2 slots. Without the fix the test fails (both slots land on GPU-0); with the fix it passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
Follow-up to the tie-break fix in PR NVIDIA#1788. The previous implementation sorted the full candidate list inside the allocation loop, paying O(n log n) per iteration for n iterations and giving O(n² log n) overall. Since all annotated replicas from the same underlying physical device share the same sort key, sorting at the replica granularity is wasted work — only m (the number of distinct physical devices contributing candidates) needs to be reordered. Refactor to: - Bucket candidates by their underlying physical device into a small gpuAllocState per device, holding `used`, `pickedFrom`, and the remaining annotated-ID candidates from that device. - Initialize a min-heap of these states ordered primarily by `used` (so devices with the fewest already-allocated replicas come first) and tie-broken by `pickedFrom` (so devices we have not touched in the current allocation are preferred when used counts match). - On each iteration pop the best device, take one of its remaining replicas, increment its counters, and push it back if more remain. Total cost drops to O(n log m). The tie-break semantics from PR NVIDIA#1788 are preserved unchanged; existing tests still pass without modification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
82b1b23 to
b9da143
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up optimization stacked on #1788. Until #1788 lands, this PR's cumulative diff includes that commit too — the new work here is in the second commit (the heap refactor). Not for review until #1788 lands.
Replaces the per-iteration
sort.SliceindistributedAllocwith a min-heap keyed by(used, pickedFrom). Brings the loop fromO(n² log n)toO(n log m)wherenis replicas requested andmis the number of physical GPUs touched in this allocation. Same correctness as #1788; same tests pass.Practically,
nandmare small in real configurations and the wall-clock impact is invisible — this is structural cleanliness, not a hot-path speedup.Opened as draft so it doesn't enter the review queue alongside #1788. Happy to mark it ready as a separate follow-up after #1788 lands, or fold the change into #1788 if that's preferred.
Contributed by @Meiri28 on behalf of @runatom-ai.