feat(vmpool): add VirtualMachinePool for group VM management by fl64 · Pull Request #2572 · deckhouse/virtualization

fl64 · 2026-07-02T09:00:02Z

Description

DVP has no primitive to manage a group of identical virtual machines whose count changes over time. Every "I need N identical VMs and the number varies" scenario — CI runner fleets, VDI desktop pools — is solved with orchestration outside the platform: users write their own controller/scripts that create and delete VirtualMachines, watch their number, recreate lost ones and clean up after them. This duplicates logic and is error-prone around races and node failures.

This PR introduces VirtualMachinePool (paid editions only, EE/SE+): a namespaced resource that declaratively keeps a requested number of identical VMs and integrates with kubectl scale, HPA and KEDA through the standard scale subresource. Its template is an ordinary VirtualMachineSpec, so a replica is no different from a manually created VM.

This is a draft. The feature is delivered incrementally within this single PR; phases land as separate commits. Already implemented:

CRD VirtualMachinePool with the scale and status subresources, gated behind the VirtualMachinePool module feature gate (default off, locked off in CE).

Controller that maintains the replica count: creates replicas from the template, replaces disappeared ones, scales down (youngest-first for now), and reports status (replicas, readyReplicas, selector, Available/Progressing). It is cache-lag-safe via a ReplicaSet-style expectations tracker, so a lagging informer cache cannot double-create anonymous replicas.

Planned in later phases of this PR: scaleDownPolicy + a /scale guard webhook, addressed scale-down (scaleDownWith), in-place template propagation, and reusable disks.

One implementation note: the controller ships only in paid editions (compiled under the EE build tag), while the CRD/API is installed in every edition; the feature gate stays locked off in CE, so the resource simply does nothing there.

Why do we need it, and what problem does it solve?

Two mass scenarios suffer most: CI/CD runners (GitLab Runner autoscaling expects a backend that can "give me N more" and reclaim idle ones) and VDI pools (warm desktops that self-heal on node failure). Without a group primitive, DVP cannot serve these natively and each team reinvents the orchestration, usually with bugs in race and failure handling. VirtualMachinePool gives users a native, declarative backend for autoscaling fleets of VMs without writing their own replica controller.

What is the expected result?

With the VirtualMachinePool feature gate enabled (EE/SE+):

Create a VirtualMachinePool with spec.replicas: N and a spec.virtualMachineTemplate — the controller converges the number of VirtualMachines to N.
kubectl scale virtualmachinepool/<name> --replicas=M (or HPA/KEDA) scales the pool to M.
Deleting or losing a replica triggers a replacement once the old object is gone; a member in Stopped is kept, not duplicated.
kubectl get virtualmachinepool and .status report replicas / readyReplicas and the Available / Progressing conditions.

Checklist

The code is covered by unit tests.
e2e tests passed.
Documentation updated according to the changes.
Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: vmpool
type: feature
summary: "Add VirtualMachinePool (EE/SE+) for declarative group management of virtual machines, scalable via the standard scale subresource, HPA and KEDA."
impact_level: low

Introduce the VirtualMachinePool API type (namespaced, group virtualization.deckhouse.io/v1alpha2) with the scale and status subresources, generated deepcopy/client/lister/informer code and the CRD manifest. Gate the resource behind the VirtualMachinePool module feature gate (EE/SE+, default off; locked off in CE). No controller behaviour yet — the type and gate are the scaffold for the pool controller. Part of the VirtualMachinePool implementation (ADR: architecture-decision-records dvp/2026-06-29-vmpool.md). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the VirtualMachinePool controller skeleton behind the EE build tag (//go:build EE) and the VirtualMachinePool feature gate: handler-chain reconciler with an empty chain and a primary watch on the resource. It is wired into the controller manager through build-tagged enterprise shims (setup_enterprise_{ee,ce}.go); the CE build compiles a no-op. No reconcile behaviour yet — replica maintenance, template propagation and reusable disks land in the follow-up slices. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

… tag EE is the default shipped edition (werf.inc.yaml builds with -tags $MODULE_EDITION, default EE), but the unit-test task ran ginkgo without a build tag, so //go:build EE code was never exercised by the unit suite. Run ginkgo with --tags EE so enterprise code and its tests are covered. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add an in-memory, thread-safe expectations tracker (EE) modelled on the Kubernetes ReplicaSet UIDTrackingControllerExpectations: creations are counted, deletions tracked by UID, with a TTL safety valve. The pool reconciler will use it to avoid double-creating anonymous replicas while the informer cache lags behind a Create/Delete. Covered by unit tests (race-clean). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Implement the pool's core reconcile: list members by the managed pool-uid label + controllerRef, create missing replicas from the template (managed labels + controller ownerReference, GenerateName naming) and remove surplus ones, then publish status (replicas, readyReplicas, selector, Available/Progressing conditions). Every create/delete is guarded by the expectations tracker, and a member VirtualMachine watcher re-enqueues the owning pool and records observed creations/deletions — so a lagging informer cache cannot double-create anonymous replicas. Terminating members count toward a scale-down (invariant 2), so a replica already leaving is not over-replaced. Covered by unit tests (fake client, race-clean). The controller stays behind //go:build EE and the feature gate. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the required spec.scaleDownPolicy enum (NewestFirst / OldestFirst / Explicit) and honour it when the pool is scaled down anonymously via the scale subresource: NewestFirst removes the youngest replicas first, OldestFirst the oldest, and Explicit removes nothing anonymously (such pools shrink only by addressed removal). The scale-subresource guard that rejects anonymous shrink under Explicit is added next. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add a validating webhook on the virtualmachinepools/scale subresource that rejects a replicas decrease when the pool's scaleDownPolicy is Explicit, pointing the user to scaleDownWith for addressed removal. Growth and no-op scale updates are always allowed. The webhook is registered only in EE builds and self-gates on the VirtualMachinePool feature gate; its ValidatingWebhookConfiguration entry is rendered only when the gate is enabled. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the VirtualMachinePool meta object and the VirtualMachinePoolScaleDownWith body type (targets to remove) to the subresources.virtualization.deckhouse.io API group, with generated deepcopy/conversion/openapi. This is the type surface for the addressed scale-down handle; the aggregated-apiserver REST storage and wiring follow. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Register the virtualmachinepools resource and its scaleDownWith subresource in the existing aggregated apiserver (group subresources.virtualization.deckhouse.io). The handler validates that every target belongs to the pool, deletes them and atomically decrements spec.replicas on the main resource — bypassing the /scale guard, which is what lets Explicit pools shrink by address. The meta-object itself is not served (Get returns NotFound). Enterprise-only: the REST/storage live under //go:build EE and are wired into the apiserver group through a build-tagged hook; the CE build adds nothing. A write-capable client is threaded from the apiserver config. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Let the aggregated apiserver's service account get/update VirtualMachinePool (the scaleDownWith handler decrements spec.replicas) and reach the pool subresources. Grant the Editor cluster role management of VirtualMachinePool, its scale subresource (kubectl scale / HPA) and the scaleDownWith handle for addressed removal. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the template-hash label (revision marker, not part of the member selector) stamped on every created replica, and report the rollout in status: desiredTemplateHash, updatedReplicas and the Synced condition (True once all live replicas are on the current virtualMachineTemplate). This makes the rollout observable at pool level. In-place patching of existing replicas on a template change follows. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add a template handler that patches each live replica's spec to the current virtualMachineTemplate and marks it on the new revision once applied. Re-patching is avoided with a patched-template-hash annotation (not a spec diff, which the apiserver mutates by defaulting), and the template-hash label is advanced only when the replica is not awaiting a restart, so status.updatedReplicas / restartPendingReplicas and the Synced condition (RolloutInProgress vs RestartPendingApproval) reflect what has effectively landed. Hot/cold is decided by the VM layer. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Replace time.Unix(1_700_000_000, 0) with time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) in the pool tests — same deterministic clock, but self-explanatory. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Replace the inline dates with a single documented package-level referenceTime var per test package, and drop the clock/when aliases. A comment states the value is arbitrary — tests use only relative offsets and never read the wall clock — so the real-world date is irrelevant. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add spec.virtualDiskTemplates: each entry describes a per-replica disk with a reclaim policy — Delete (default; the disk belongs to its VirtualMachine and is removed with it) or Retain (the disk belongs to the pool, outlives the replica and is reused on scale-up), plus keep (warm buffer) and ttl for Retain disks. This is the schema for reusable disks; the reconcile behaviour (creation, reuse selection, GC) follows. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add an idempotent, self-healing disks handler: for every live member it ensures each Delete-policy virtualDiskTemplate disk exists (owned by the VirtualMachine, named <vm>-<template>, so it cascades away with the replica) and is referenced in the member's blockDeviceRefs. Also fix the template handler to merge block device refs when it patches a member's spec, so per-replica disk refs the pool attached are not wiped by a template change. Retain (reusable) disks come next. Covered by unit tests, including that a template patch keeps disk refs. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Extend the disks handler to Retain-policy templates: a member reuses a free pool-owned disk of the template (Ready and referenced by no live member) or, if none is free, gets a newly created pool-owned disk (named <pool>-<template>-<rand>) that outlives the replica. A per-pass guard prevents handing the same free disk to two members in one reconcile; the authoritative in-use signal is the members' blockDeviceRefs, not the platform InUse condition. Covered by unit tests (create, reuse-free, skip-busy). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The disks handler now ages free Retain disks: it stamps a free-since annotation when a disk leaves every member's blockDeviceRefs (the authoritative free signal — the platform InUse condition is unreliable, it flips on Stop) and clears it on reuse. Disks outside the warm buffer (keep newest) and older than the ttl are deleted with a resourceVersion precondition. free-since is persisted on the disk so the ttl survives controller restarts (in-memory timing would reset every restart and leak disks). Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the fallback for reuse-disk collisions: if two live members reference the same pool-owned disk (a cross-pass race after a controller restart), detach it from all but the keeper (the member with BlockDevicesReady, or the lexicographically smallest name) so the others get a fresh disk on the next reconcile — the in-pass guard already prevents the common case. Also add edge-case tests: a Stopped member is counted and neither replaced nor duplicated (invariant 4); nil replicas mean zero; a non-Ready free disk is not reused; free-since is cleared on reuse; disks are not managed for a Terminating member. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The virtualization-controller service account could not list/watch VirtualMachinePool, so the pool controller failed to start its watch and never reconciled. Add virtualmachinepools (+ status, + finalizers) to the controller ClusterRole. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The virtualization-api binary was built without -tags $MODULE_EDITION, so the EE-only aggregated-apiserver registration (compiled under //go:build EE) was dropped and the virtualmachinepools/scaleDownWith subresource returned 404. Build the apiserver with the edition tag like the controller, so the enterprise subresource is served in EE builds. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Reuse-disk selection required Ready, so a freshly created disk (still WaitForFirstConsumer / provisioning) was never considered free and a new one was created on every reconcile until the first bound — creating a burst of surplus disks. Reuse any free pool-owned disk, preferring a Ready one but otherwise attaching a still-provisioning one (attaching is what makes a WaitForFirstConsumer disk bind), and create a new disk only when none is free. Failed/terminating disks are skipped. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

…data The template metadata embedded metav1.ObjectMeta, which controller-gen renders as an opaque object, so setting template.metadata.labels was rejected by strict decoding. Use a curated metadata struct with labels and annotations so the CRD schema exposes them. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Emit ReplicaSet-style events on the VirtualMachinePool so scaling is visible in kubectl describe / kubectl get events: SuccessfulCreate / FailedCreate on replica creation and SuccessfulDelete / FailedDelete on removal. FailedCreate surfaces admission errors (e.g. an invalid template) directly on the pool instead of only in controller logs. Messages follow the user-facing text conventions (English, full resource names, no internals). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Assert SuccessfulCreate is emitted per created replica, and that a failed creation emits FailedCreate and un-does the expectation (via an interceptor client that rejects Create) so the pool is not wedged. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add tests for the HTTP Connect handler (rejects an empty targets list with BadRequest; removes the target and reports success on a valid body) and for scaleDown returning NotFound when the pool is absent. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The pool has no minReplicas/maxUnavailable, so the Available condition means all desired replicas are ready — rename its reasons MinimumReplicasAvailable/Unavailable to AllReplicasReady/ InsufficientReadyReplicas. Broaden the Progressing reason Scaling to ReplicasProgressing (it also covers replacing a lost replica) and make the messages state the situation plainly. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Reference a virtualDiskTemplates disk in the member template's blockDeviceRefs by its template name (a placeholder); the disks handler now resolves it in place — Delete -> <vm>-<template>, Retain -> a reuse disk — instead of always appending. This lets a pool express a per-replica writable root/boot disk with the correct position in the boot order, exactly like an ordinary VirtualMachine. Also sync the in-memory member after each attach so a member with two or more disk templates no longer clobbers earlier refs (and their order) within one reconcile pass. On template rollout the member's resolved refs are preserved instead of re-copying the template placeholders, which would dangle and duplicate. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The pool example used a shared image as the boot device and defined only a cache disk, so it had no per-replica writable root — misleading for the main use cases (CI runners, VDI). Reference a per-replica root VirtualDisk by its template name in boot order, and document how the underlying VM and disk names are formed. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

VirtualMachinePool is a distinct new resource; add "vmpool" to the changelog allowed_sections (both the PR check and the milestone aggregation) so its changelog entries validate instead of failing with "unknown section". Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Show how to autoscale a pool with an HPA on CPU (the pool publishes status.selector, so metrics are read from the replicas directly), note custom and external metrics / KEDA, and warn that scaleDownPolicy: Explicit lets an autoscaler scale up only. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The CRD had no doc-ru companion, failing the DMT linter and the doc-changes validation ("translation file is missing"). Add crds/doc-ru-virtualmachinepools.yaml covering the pool-specific fields (replicas, scaleDownPolicy, virtualMachineTemplate, virtualDiskTemplates, reclaim, status). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

… tag Paid-edition features in this module (VolumeMigration, USB, the hotplug ones) are ordinary Apache-licensed code compiled in every edition and gated at runtime by their feature gate; only tiny edition shims use //go:build EE. VirtualMachinePool was the odd one out — its controller and apiserver code was //go:build EE, which kept it out of the default (CE) unit-test run and forced the whole suite to build with -tags EE, in turn breaking edition-default tests elsewhere (e.g. vmop's locked-feature case). Align it with the rest: drop the EE build tag (Apache headers), rely on the existing VirtualMachinePool feature gate (locked off in CE) that SetupController and the scale webhook already check, collapse the ee/ce setup and apiserver-install shims into single files, and revert the test:unit and virtualization-api build tweaks. No behaviour change in EE; the code is simply inert in CE. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Two reconcile behaviours for changes to virtualDiskTemplates: - Resize: when a template's requested size grows, every existing disk of that template is grown to match (increase only; storage cannot shrink). - Removal: when a template is deleted from the spec (as opposed to a disk merely freed from a scaled-down replica, which is kept for reuse and aged out by ttl), its disks are removed — free ones straight away, attached ones after a hot-unplug. A disk that is a running replica's boot (first) device cannot be hot-unplugged, so it is left until the replica is recreated. Also sync the in-memory member after detach (as done for attach) so removing several disks from one member in a single pass does not clobber earlier edits, and drop unused test-helper parameters the linter now sees (vmpool test code is linted in CE since the build tag was removed). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

…a root disk A compact base e2e that shows the feature works end to end: a pool of two tiny VMs (1 core / 5% coreFraction / 512Mi / alpine image) each gets its own Delete-policy root disk, both reach Running, and scaling to three converges. Skipped when the VirtualMachinePool feature gate is disabled. Heavier scenarios (reuse cycle, rollout, resize, HPA) are left for follow-up. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

When a disk template was removed from the spec, prune could delete the disk even if detaching it from a running replica had failed (a resourceVersion conflict on the busy VM). The disk then went Terminating while the VM still referenced it, so the VM hung on "waiting for block device ... is terminating". Delete a removed-template disk only once no live member references it (all detaches succeeded and it is not a boot device). Make detach conflict-safe by re-reading the member and retrying, so a busy running VM no longer loses the race. Found during in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

github-actions Bot assigned fl64 Jul 2, 2026

fl64 added this to the v1.10.0 milestone Jul 2, 2026

fl64 added 28 commits July 3, 2026 00:24

test(vmpool): use a readable fixed date instead of a raw unix timestamp

6c9513f

Replace time.Unix(1_700_000_000, 0) with time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) in the pool tests — same deterministic clock, but self-explanatory. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

fl64 added 5 commits July 3, 2026 00:24

fl64 force-pushed the feat/vmpool/implementation branch from 9111ad6 to 03d7987 Compare July 2, 2026 21:47

fl64 added 3 commits July 3, 2026 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vmpool): add VirtualMachinePool for group VM management#2572

feat(vmpool): add VirtualMachinePool for group VM management#2572
fl64 wants to merge 36 commits into
mainfrom
feat/vmpool/implementation

fl64 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fl64 commented Jul 2, 2026

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

Changelog entries

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant