Skip to content

api/config/v1: keep user-specified rename/devices for shared resources#1825

Open
jonyhy96 wants to merge 3 commits into
NVIDIA:mainfrom
jonyhy96:feature/perdevice-uuid-timeslicing
Open

api/config/v1: keep user-specified rename/devices for shared resources#1825
jonyhy96 wants to merge 3 commits into
NVIDIA:mainfrom
jonyhy96:feature/perdevice-uuid-timeslicing

Conversation

@jonyhy96

@jonyhy96 jonyhy96 commented Jun 2, 2026

Copy link
Copy Markdown

disableResoureRenaming currently resets the Rename and Devices fields of every entry under sharing.timeSlicing.resources (and sharing.mps.resources) to their defaults. That makes it impossible to:

  • pin a shared resource to a specific subset of GPU UUIDs (per-UUID time-slicing), and
  • expose two resource names on the same node, e.g. nvidia.com/gpu for full cards and nvidia.com/gpu.shared for the time-sliced subset.

Preserve both fields and emit a warning instead, so that user intent is honored. Downstream resource-manager code already supports per-UUID device lists and custom rename targets, so no additional plumbing is required.

`disableResoureRenaming` currently resets the `Rename` and `Devices`
fields of every entry under `sharing.timeSlicing.resources` (and
`sharing.mps.resources`) to their defaults. That makes it impossible to:

  * pin a shared resource to a specific subset of GPU UUIDs
    (per-UUID time-slicing), and
  * expose two resource names on the same node, e.g.
    `nvidia.com/gpu` for full cards and `nvidia.com/gpu.shared` for
    the time-sliced subset.

Preserve both fields and emit a warning instead, so that user intent
is honored. Downstream resource-manager code already supports
per-UUID device lists and custom rename targets, so no additional
plumbing is required.

Signed-off-by: haoyun <haoyun.96@bytedance.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rajatchopra

Copy link
Copy Markdown
Contributor

@jonyhy96 Can you provide a corresponding unit test case here? Possibly one that shows how two resource names are used within a node.

Adds TestDisableResoureRenamingKeepsUserSpec covering:

  * per-UUID rename preserved when renameByDefault=false
  * per-UUID devices list preserved when renameByDefault=true
  * two resource names on a single node: nvidia.com/gpu for full cards
    and nvidia.com/gpu.shared for a UUID-selected subset (the use case
    requested by the reviewer)
  * nil receiver is a no-op

This exercises the behavior change in disableResoureRenaming so that
user-specified rename / devices fields survive the call, enabling
per-UUID time-slicing alongside an unmodified nvidia.com/gpu resource
on the same node.

Signed-off-by: haoyun <haoyun.96@bytedance.com>
@jonyhy96 jonyhy96 force-pushed the feature/perdevice-uuid-timeslicing branch from b5d1686 to 85954f6 Compare June 25, 2026 06:41
@jonyhy96

Copy link
Copy Markdown
Author

@rajatchopra Thanks for the review! I've pushed a unit test TestDisableResoureRenamingKeepsUserSpec in api/config/v1/replicas_test.go (commit 85954f6) covering four cases:

  1. Per-UUID rename preserved when renameByDefault=false.

  2. Per-UUID devices list preserved when renameByDefault=true with the default rename target.

  3. Two resource names on a single node — exactly the scenario you asked about: nvidia.com/gpu for full cards and nvidia.com/gpu.shared carrying only a UUID-selected subset (devices: ["GPU-"]). Any GPU on the node that isn't listed stays on the original nvidia.com/gpu resource name, so the two resource names coexist on the same node.

  4. nil receiver no-op for defensive coverage.

All four sub-tests pass locally (go test ./api/config/v1/... is green). PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants