Skip to content

feat(vmpool): add VirtualMachinePool for group VM management#2572

Draft
fl64 wants to merge 36 commits into
mainfrom
feat/vmpool/implementation
Draft

feat(vmpool): add VirtualMachinePool for group VM management#2572
fl64 wants to merge 36 commits into
mainfrom
feat/vmpool/implementation

Conversation

@fl64

@fl64 fl64 commented Jul 2, 2026

Copy link
Copy Markdown
Member

Description

DVP has no primitive to manage a group of identical virtual machines whose count changes over time. Every "I need N identical VMs and the number varies" scenario — CI runner fleets, VDI desktop pools — is solved with orchestration outside the platform: users write their own controller/scripts that create and delete VirtualMachines, watch their number, recreate lost ones and clean up after them. This duplicates logic and is error-prone around races and node failures.

This PR introduces VirtualMachinePool (paid editions only, EE/SE+): a namespaced resource that declaratively keeps a requested number of identical VMs and integrates with kubectl scale, HPA and KEDA through the standard scale subresource. Its template is an ordinary VirtualMachineSpec, so a replica is no different from a manually created VM.

This is a draft. The feature is delivered incrementally within this single PR; phases land as separate commits. Already implemented:

  • CRD VirtualMachinePool with the scale and status subresources, gated behind the VirtualMachinePool module feature gate (default off, locked off in CE).
  • Controller that maintains the replica count: creates replicas from the template, replaces disappeared ones, scales down (youngest-first for now), and reports status (replicas, readyReplicas, selector, Available/Progressing). It is cache-lag-safe via a ReplicaSet-style expectations tracker, so a lagging informer cache cannot double-create anonymous replicas.

Planned in later phases of this PR: scaleDownPolicy + a /scale guard webhook, addressed scale-down (scaleDownWith), in-place template propagation, and reusable disks.

One implementation note: the controller ships only in paid editions (compiled under the EE build tag), while the CRD/API is installed in every edition; the feature gate stays locked off in CE, so the resource simply does nothing there.

Why do we need it, and what problem does it solve?

Two mass scenarios suffer most: CI/CD runners (GitLab Runner autoscaling expects a backend that can "give me N more" and reclaim idle ones) and VDI pools (warm desktops that self-heal on node failure). Without a group primitive, DVP cannot serve these natively and each team reinvents the orchestration, usually with bugs in race and failure handling. VirtualMachinePool gives users a native, declarative backend for autoscaling fleets of VMs without writing their own replica controller.

What is the expected result?

With the VirtualMachinePool feature gate enabled (EE/SE+):

  1. Create a VirtualMachinePool with spec.replicas: N and a spec.virtualMachineTemplate — the controller converges the number of VirtualMachines to N.
  2. kubectl scale virtualmachinepool/<name> --replicas=M (or HPA/KEDA) scales the pool to M.
  3. Deleting or losing a replica triggers a replacement once the old object is gone; a member in Stopped is kept, not duplicated.
  4. kubectl get virtualmachinepool and .status report replicas / readyReplicas and the Available / Progressing conditions.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: vmpool
type: feature
summary: "Add VirtualMachinePool (EE/SE+) for declarative group management of virtual machines, scalable via the standard scale subresource, HPA and KEDA."
impact_level: low

@fl64 fl64 added this to the v1.10.0 milestone Jul 2, 2026
fl64 added 28 commits July 3, 2026 00:24
Introduce the VirtualMachinePool API type (namespaced, group
virtualization.deckhouse.io/v1alpha2) with the scale and status
subresources, generated deepcopy/client/lister/informer code and the
CRD manifest. Gate the resource behind the VirtualMachinePool module
feature gate (EE/SE+, default off; locked off in CE). No controller
behaviour yet — the type and gate are the scaffold for the pool
controller.

Part of the VirtualMachinePool implementation (ADR: architecture-decision-records dvp/2026-06-29-vmpool.md).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add the VirtualMachinePool controller skeleton behind the EE build tag
(//go:build EE) and the VirtualMachinePool feature gate: handler-chain
reconciler with an empty chain and a primary watch on the resource. It
is wired into the controller manager through build-tagged enterprise
shims (setup_enterprise_{ee,ce}.go); the CE build compiles a no-op.

No reconcile behaviour yet — replica maintenance, template propagation
and reusable disks land in the follow-up slices.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
… tag

EE is the default shipped edition (werf.inc.yaml builds with
-tags $MODULE_EDITION, default EE), but the unit-test task ran ginkgo
without a build tag, so //go:build EE code was never exercised by the
unit suite. Run ginkgo with --tags EE so enterprise code and its tests
are covered.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add an in-memory, thread-safe expectations tracker (EE) modelled on the
Kubernetes ReplicaSet UIDTrackingControllerExpectations: creations are
counted, deletions tracked by UID, with a TTL safety valve. The pool
reconciler will use it to avoid double-creating anonymous replicas while
the informer cache lags behind a Create/Delete. Covered by unit tests
(race-clean).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Implement the pool's core reconcile: list members by the managed
pool-uid label + controllerRef, create missing replicas from the
template (managed labels + controller ownerReference, GenerateName
naming) and remove surplus ones, then publish status (replicas,
readyReplicas, selector, Available/Progressing conditions).

Every create/delete is guarded by the expectations tracker, and a
member VirtualMachine watcher re-enqueues the owning pool and records
observed creations/deletions — so a lagging informer cache cannot
double-create anonymous replicas. Terminating members count toward a
scale-down (invariant 2), so a replica already leaving is not
over-replaced. Covered by unit tests (fake client, race-clean).

The controller stays behind //go:build EE and the feature gate.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add the required spec.scaleDownPolicy enum (NewestFirst / OldestFirst /
Explicit) and honour it when the pool is scaled down anonymously via the
scale subresource: NewestFirst removes the youngest replicas first,
OldestFirst the oldest, and Explicit removes nothing anonymously (such
pools shrink only by addressed removal). The scale-subresource guard
that rejects anonymous shrink under Explicit is added next. Covered by
unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add a validating webhook on the virtualmachinepools/scale subresource
that rejects a replicas decrease when the pool's scaleDownPolicy is
Explicit, pointing the user to scaleDownWith for addressed removal.
Growth and no-op scale updates are always allowed. The webhook is
registered only in EE builds and self-gates on the VirtualMachinePool
feature gate; its ValidatingWebhookConfiguration entry is rendered only
when the gate is enabled. Covered by unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add the VirtualMachinePool meta object and the VirtualMachinePoolScaleDownWith
body type (targets to remove) to the subresources.virtualization.deckhouse.io
API group, with generated deepcopy/conversion/openapi. This is the type
surface for the addressed scale-down handle; the aggregated-apiserver REST
storage and wiring follow.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Register the virtualmachinepools resource and its scaleDownWith
subresource in the existing aggregated apiserver (group
subresources.virtualization.deckhouse.io). The handler validates that
every target belongs to the pool, deletes them and atomically
decrements spec.replicas on the main resource — bypassing the /scale
guard, which is what lets Explicit pools shrink by address. The
meta-object itself is not served (Get returns NotFound).

Enterprise-only: the REST/storage live under //go:build EE and are
wired into the apiserver group through a build-tagged hook; the CE
build adds nothing. A write-capable client is threaded from the
apiserver config. Covered by unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Let the aggregated apiserver's service account get/update
VirtualMachinePool (the scaleDownWith handler decrements spec.replicas)
and reach the pool subresources. Grant the Editor cluster role
management of VirtualMachinePool, its scale subresource (kubectl scale /
HPA) and the scaleDownWith handle for addressed removal.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add the template-hash label (revision marker, not part of the member
selector) stamped on every created replica, and report the rollout in
status: desiredTemplateHash, updatedReplicas and the Synced condition
(True once all live replicas are on the current virtualMachineTemplate).
This makes the rollout observable at pool level. In-place patching of
existing replicas on a template change follows. Covered by unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add a template handler that patches each live replica's spec to the
current virtualMachineTemplate and marks it on the new revision once
applied. Re-patching is avoided with a patched-template-hash annotation
(not a spec diff, which the apiserver mutates by defaulting), and the
template-hash label is advanced only when the replica is not awaiting a
restart, so status.updatedReplicas / restartPendingReplicas and the
Synced condition (RolloutInProgress vs RestartPendingApproval) reflect
what has effectively landed. Hot/cold is decided by the VM layer.
Covered by unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Replace time.Unix(1_700_000_000, 0) with
time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) in the pool tests — same
deterministic clock, but self-explanatory.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Replace the inline dates with a single documented package-level
referenceTime var per test package, and drop the clock/when aliases. A
comment states the value is arbitrary — tests use only relative offsets
and never read the wall clock — so the real-world date is irrelevant.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add spec.virtualDiskTemplates: each entry describes a per-replica disk
with a reclaim policy — Delete (default; the disk belongs to its
VirtualMachine and is removed with it) or Retain (the disk belongs to
the pool, outlives the replica and is reused on scale-up), plus keep
(warm buffer) and ttl for Retain disks. This is the schema for reusable
disks; the reconcile behaviour (creation, reuse selection, GC) follows.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add an idempotent, self-healing disks handler: for every live member it
ensures each Delete-policy virtualDiskTemplate disk exists (owned by the
VirtualMachine, named <vm>-<template>, so it cascades away with the
replica) and is referenced in the member's blockDeviceRefs.

Also fix the template handler to merge block device refs when it patches
a member's spec, so per-replica disk refs the pool attached are not
wiped by a template change. Retain (reusable) disks come next. Covered
by unit tests, including that a template patch keeps disk refs.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Extend the disks handler to Retain-policy templates: a member reuses a
free pool-owned disk of the template (Ready and referenced by no live
member) or, if none is free, gets a newly created pool-owned disk
(named <pool>-<template>-<rand>) that outlives the replica. A per-pass
guard prevents handing the same free disk to two members in one
reconcile; the authoritative in-use signal is the members'
blockDeviceRefs, not the platform InUse condition. Covered by unit
tests (create, reuse-free, skip-busy).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
The disks handler now ages free Retain disks: it stamps a free-since
annotation when a disk leaves every member's blockDeviceRefs (the
authoritative free signal — the platform InUse condition is unreliable,
it flips on Stop) and clears it on reuse. Disks outside the warm buffer
(keep newest) and older than the ttl are deleted with a resourceVersion
precondition. free-since is persisted on the disk so the ttl survives
controller restarts (in-memory timing would reset every restart and
leak disks). Covered by unit tests.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add the fallback for reuse-disk collisions: if two live members reference
the same pool-owned disk (a cross-pass race after a controller restart),
detach it from all but the keeper (the member with BlockDevicesReady, or
the lexicographically smallest name) so the others get a fresh disk on
the next reconcile — the in-pass guard already prevents the common case.

Also add edge-case tests: a Stopped member is counted and neither
replaced nor duplicated (invariant 4); nil replicas mean zero; a
non-Ready free disk is not reused; free-since is cleared on reuse;
disks are not managed for a Terminating member.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
The virtualization-controller service account could not list/watch
VirtualMachinePool, so the pool controller failed to start its watch and
never reconciled. Add virtualmachinepools (+ status, + finalizers) to
the controller ClusterRole. Found by in-cluster testing.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
The virtualization-api binary was built without -tags $MODULE_EDITION,
so the EE-only aggregated-apiserver registration (compiled under
//go:build EE) was dropped and the virtualmachinepools/scaleDownWith
subresource returned 404. Build the apiserver with the edition tag like
the controller, so the enterprise subresource is served in EE builds.
Found by in-cluster testing.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Reuse-disk selection required Ready, so a freshly created disk (still
WaitForFirstConsumer / provisioning) was never considered free and a new
one was created on every reconcile until the first bound — creating a
burst of surplus disks. Reuse any free pool-owned disk, preferring a
Ready one but otherwise attaching a still-provisioning one (attaching is
what makes a WaitForFirstConsumer disk bind), and create a new disk only
when none is free. Failed/terminating disks are skipped. Found by
in-cluster testing.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
…data

The template metadata embedded metav1.ObjectMeta, which controller-gen
renders as an opaque object, so setting template.metadata.labels was
rejected by strict decoding. Use a curated metadata struct with labels
and annotations so the CRD schema exposes them. Found by in-cluster
testing.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Emit ReplicaSet-style events on the VirtualMachinePool so scaling is
visible in kubectl describe / kubectl get events: SuccessfulCreate /
FailedCreate on replica creation and SuccessfulDelete / FailedDelete on
removal. FailedCreate surfaces admission errors (e.g. an invalid
template) directly on the pool instead of only in controller logs.
Messages follow the user-facing text conventions (English, full
resource names, no internals).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Assert SuccessfulCreate is emitted per created replica, and that a
failed creation emits FailedCreate and un-does the expectation (via an
interceptor client that rejects Create) so the pool is not wedged.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Add tests for the HTTP Connect handler (rejects an empty targets list
with BadRequest; removes the target and reports success on a valid body)
and for scaleDown returning NotFound when the pool is absent.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
The pool has no minReplicas/maxUnavailable, so the Available condition
means all desired replicas are ready — rename its reasons
MinimumReplicasAvailable/Unavailable to AllReplicasReady/
InsufficientReadyReplicas. Broaden the Progressing reason Scaling to
ReplicasProgressing (it also covers replacing a lost replica) and make
the messages state the situation plainly.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Reference a virtualDiskTemplates disk in the member template's blockDeviceRefs
by its template name (a placeholder); the disks handler now resolves it in
place — Delete -> <vm>-<template>, Retain -> a reuse disk — instead of always
appending. This lets a pool express a per-replica writable root/boot disk with
the correct position in the boot order, exactly like an ordinary VirtualMachine.

Also sync the in-memory member after each attach so a member with two or more
disk templates no longer clobbers earlier refs (and their order) within one
reconcile pass. On template rollout the member's resolved refs are preserved
instead of re-copying the template placeholders, which would dangle and
duplicate.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
fl64 added 5 commits July 3, 2026 00:24
The pool example used a shared image as the boot device and defined only a
cache disk, so it had no per-replica writable root — misleading for the main
use cases (CI runners, VDI). Reference a per-replica root VirtualDisk by its
template name in boot order, and document how the underlying VM and disk names
are formed.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
VirtualMachinePool is a distinct new resource; add "vmpool" to the changelog
allowed_sections (both the PR check and the milestone aggregation) so its
changelog entries validate instead of failing with "unknown section".

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Show how to autoscale a pool with an HPA on CPU (the pool publishes
status.selector, so metrics are read from the replicas directly), note custom
and external metrics / KEDA, and warn that scaleDownPolicy: Explicit lets an
autoscaler scale up only.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
The CRD had no doc-ru companion, failing the DMT linter and the doc-changes
validation ("translation file is missing"). Add crds/doc-ru-virtualmachinepools.yaml
covering the pool-specific fields (replicas, scaleDownPolicy, virtualMachineTemplate,
virtualDiskTemplates, reclaim, status).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
… tag

Paid-edition features in this module (VolumeMigration, USB, the hotplug ones)
are ordinary Apache-licensed code compiled in every edition and gated at runtime
by their feature gate; only tiny edition shims use //go:build EE. VirtualMachinePool
was the odd one out — its controller and apiserver code was //go:build EE, which
kept it out of the default (CE) unit-test run and forced the whole suite to build
with -tags EE, in turn breaking edition-default tests elsewhere (e.g. vmop's
locked-feature case).

Align it with the rest: drop the EE build tag (Apache headers), rely on the
existing VirtualMachinePool feature gate (locked off in CE) that SetupController
and the scale webhook already check, collapse the ee/ce setup and apiserver-install
shims into single files, and revert the test:unit and virtualization-api build
tweaks. No behaviour change in EE; the code is simply inert in CE.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
@fl64 fl64 force-pushed the feat/vmpool/implementation branch from 9111ad6 to 03d7987 Compare July 2, 2026 21:47
fl64 added 3 commits July 3, 2026 01:23
Two reconcile behaviours for changes to virtualDiskTemplates:

- Resize: when a template's requested size grows, every existing disk of that
  template is grown to match (increase only; storage cannot shrink).
- Removal: when a template is deleted from the spec (as opposed to a disk merely
  freed from a scaled-down replica, which is kept for reuse and aged out by ttl),
  its disks are removed — free ones straight away, attached ones after a
  hot-unplug. A disk that is a running replica's boot (first) device cannot be
  hot-unplugged, so it is left until the replica is recreated.

Also sync the in-memory member after detach (as done for attach) so removing
several disks from one member in a single pass does not clobber earlier edits,
and drop unused test-helper parameters the linter now sees (vmpool test code is
linted in CE since the build tag was removed).

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
…a root disk

A compact base e2e that shows the feature works end to end: a pool of two tiny
VMs (1 core / 5% coreFraction / 512Mi / alpine image) each gets its own
Delete-policy root disk, both reach Running, and scaling to three converges.
Skipped when the VirtualMachinePool feature gate is disabled. Heavier scenarios
(reuse cycle, rollout, resize, HPA) are left for follow-up.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
When a disk template was removed from the spec, prune could delete the disk even
if detaching it from a running replica had failed (a resourceVersion conflict on
the busy VM). The disk then went Terminating while the VM still referenced it, so
the VM hung on "waiting for block device ... is terminating".

Delete a removed-template disk only once no live member references it (all
detaches succeeded and it is not a boot device). Make detach conflict-safe by
re-reading the member and retrying, so a busy running VM no longer loses the race.
Found during in-cluster testing.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant