Skip to content

Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines#1111

Open
thiyyakat wants to merge 19 commits into
gardener:rel-v0.62from
thiyyakat:preservation-hot-fix
Open

Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines#1111
thiyyakat wants to merge 19 commits into
gardener:rel-v0.62from
thiyyakat:preservation-hot-fix

Conversation

@thiyyakat

@thiyyakat thiyyakat commented Jun 24, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:
The previous implementation had several correctness issues in the machine preservation flow:

  1. Unconditional uncordon on every reconcile: uncordonNodeIfCordoned was called whenever IsMachineActive returned true and a node name was present. This meant any Running machine with a cordoned backing node — even one cordoned for reasons unrelated to preservation (e.g. Cluster Autoscaler scale-down) — would be unconditionally uncordoned on every reconcile pass.
  2. No gate for non-preservation-bound machines: Every machine was processed through the full preservation logic on every reconcile, even machines with no preservation state at all (no annotations, no PreserveExpiryTime). This caused unnecessary API calls on non-preserved machines.
  3. Stale clone causing conflict errors: The LastAppliedNodePreserveValueAnnotation sync in the defer compared annotations on a clone that had already been updated via UpdateStatus calls inside preserveMachine/stopPreservationIfActive. Because UpdateStatus returns the server's stored metadata (not the in-memory mutation), the annotation diff was never detected and the sync was silently dropped. On paths where the sync did fire, it was using a stale ResourceVersion, producing spurious conflict errors.

This PR fixes all of the above by:

  • Introducing preserveStateInfo struct populated by getPreserveStateInfo at the start of each reconcile, reading node and machine annotation state in a single pass and storing it for the rest of the function. This eliminates the double lister read and decouples the annotation sync from the defer.
  • Adding an explicit isMachinePreservationBound gate: machines with no preservation state (no PreserveExpiryTime, no preserve annotation on machine or node, no LastAppliedNodePreserveValueAnnotation) skip the entire preservation flow and return LongRetry, nil immediately.
  • Moving the LastAppliedNodePreserveValueAnnotation sync to an explicit updatePreserveAnnotationOnMachine call at the end of the function via shouldAnnotationsBeUpdatedOnMachine, rather than a defer that compared potentially stale annotation maps.
  • Moving uncordon logic into stopPreservationIfActive (step 4), so it fires only when preservation is actively being stopped for a Running machine — not on every reconcile of any active machine. The manageMachinePreservation path retains a narrower uncordon call gated on !preservationStopped && PreserveExpiryTime != nil && Phase == Running to handle the case where preserveMachine is called on a Running machine (e.g. preserve=now set by user).
  • Adding IsNotFound handling in the defer so transient not-found errors during preservation (e.g. machine deleted mid-reconcile) return LongRetryrather than ShortRetry.
  • Propagating the updated node object from removePreservationRelatedAnnotationsOnNode (now returns *corev1.Node) so that the subsequent uncordon step in stopPreservationIfActive operates on a current node object rather than a potentially stale one.
  • Removing the redundant PreventAutoPreserveAnnotationValues set in machineutils (it was identical to AllowedPreserveAnnotationValues minus ""); manageAutoPreservationOfFailedMachines in machineset.go now uses AllowedPreserveAnnotationValues directly.

Which issue(s) this PR fixes:
Fixes #1110

Special notes for your reviewer:

Release note:

Fix preservation reconcile loop to avoid unconditionally uncordoning nodes unrelated to machine preservation, prevent spurious writes on non-preserved machines, and eliminate conflict errors caused by stale machine object comparison in the defer annotation sync.

@thiyyakat thiyyakat requested a review from a team as a code owner June 24, 2026 10:19
@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 24, 2026
@thiyyakat thiyyakat changed the title Preservation hot fix Fix preservation reconciliation to prevent deviation from regular flow for non-preservation-bound machines Jun 24, 2026
@thiyyakat thiyyakat added the kind/bug Bug label Jun 24, 2026
@gardener-prow gardener-prow Bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 24, 2026
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from 063f8db to f8d53be Compare June 24, 2026 10:23
Comment thread pkg/util/provider/machineutils/utils.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after we changed literal value we should change the name to the constant to also PreserveMachineAnnotationValueAutoPreserved instead. This package will be exposed to consumers like DWD so best to get semantic name changes finalized.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 200ce00

@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from f8d53be to 200ce00 Compare June 24, 2026 10:57
preservationBound := c.isMachinePreservationBound(preserveInfo)
if !preservationBound {
// we clear the error here to prevent preservation logic from interfering with non-preservation-bound machines
err = nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to set err to nil here?
I see err gets populated before this only when calling c.getPreserveStateInfo(), and even then if err != nil, we return

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The earlier return defeats the purpose of this. I will probably need to buffer the error until we can determine whether or not the machine is preservation-bound. Thanks

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7. The error from getPreserveStateInfo is now buffered in getErr and is only returned if machine is preservation-bound.

Comment on lines +794 to +796
} else if err != nil {
return
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here. err is populated only by c.getPreserveStateInfo(), and there we return if err != nil

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

nodeValue string
machineValue string
lastAppliedNodeValue string
PreserveExpiryTimeSet bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need not be exported

Suggested change
PreserveExpiryTimeSet bool
preserveExpiryTimeSet bool

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7.

_, err := c.clearMachinePreserveExpiryTime(ctx, machine)
klog.Warningf("Node %q of machine %q not found. Proceeding to stop preservation on machine.", nodeName, machine.Name)
// Node not found, proceed to delete annotations and clear preserveExpiryTime on machine
machine, err = c.removePreserveAnnotationOnMachine(ctx, machine)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is not gated with removePreservationAnnotations? I see that every other invocation is removePreserveAnnotationOnMachine() within this function is gated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

// PreventAutoPreserveAnnotationValues contains the values to check if a machine is already annotated for preservation,
// in which case it should not be auto-preserved.
var PreventAutoPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValuePreservedByMCM, PreserveMachineAnnotationValueFalse)
var AllowedPreserveAnnotationValues = sets.New(PreserveMachineAnnotationValueNow, PreserveMachineAnnotationValueWhenFailed, PreserveMachineAnnotationValueAutoPreserved, PreserveMachineAnnotationValueFalse)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "" removed intentionally?

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! There is no use for preserve="". The user can either delete the annotation, or set it to false.

machineAnnotationValue: machineutils.PreserveMachineAnnotationValuePreservedByMCM,
},
}),
Entry("when node is annotated and preservation times out, should stop preservation", testCase{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test removed? This looks like a valid case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test following this one covers this test case and also checks for annotation removal.

retry: machineutils.LongRetry,
},
}),
Entry("when invalid preserve annotation is added on node of un-preserved machine, should do nothing ", testCase{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a valid case too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test following this one is a duplicate of this one.

return true
}

func (c *controller) getPreserveStateInfo(machine *v1alpha1.Machine) (*preserveStateInfo, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider writing some unit tests for the new helper methods that you introduced

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

if err != nil {
if apierrors.IsConflict(err) {
if apierrors.IsNotFound(err) {
klog.Warningf("Error during preservation flow:%v", err.Error())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
klog.Warningf("Error during preservation flow:%v", err.Error())
klog.Warningf("Error during preservation flow: %v", err.Error())

@thiyyakat thiyyakat Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cb105e7

@gagan16k gagan16k left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

Comment thread pkg/util/provider/machinecontroller/machine_util.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/controller/deployment_machineset_util.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
if err != nil {
return err
}
func (c *controller) uncordonNodeIfCordoned(ctx context.Context, node *corev1.Node) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a check for ClusterAutoscalerScaleDownDisabledAnnotationByMCMKey annotation before node operations, similar to how it is done in removePreservationRelatedAnnotationsOnNode()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to gate the uncordoning? Do you feel the preservation-bound check and Running phase checks are insufficient?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth adding, as the more we can limit CA interactions with this the better. However, could ask for a second opinion

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two may not be related though. If we do what you suggest, then there is a chance that the user has set scale-down-disabled on the node in which case we would never uncordon the node even after it recovers from Failed to Running.

The check in removePreservationRelatedAnnotationsOnNode was done to ensure that we don't clear the scale-down-disabled annotation if it has been set by the user.

We limit CA interaction by :

  1. By-passing preservation logic if machine is not preservation-bound
  2. Ensuring CA does not scale down a preserved node. We ensure preserved nodes have the scale-down-disabled annotation and in CA's ForceDeleteNode, we skip nodes if they have the scale-down-disabled annotation.

@aaronfern do you think this is sufficient?

@gardener-prow gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 25, 2026
@gardener-prow

gardener-prow Bot commented Jun 25, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from thiyyakat. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- nits
- if not found, machine annotation value must be enforced
- if machine annotation value is invalid, log and continue assuming it does not exist
- gate `removePreserveAnnotationOnMachine` when node is not found
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from 69b9eb6 to cb105e7 Compare June 25, 2026 13:08

@gagan16k gagan16k left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, PTAL

Comment thread pkg/util/provider/machinecontroller/machine.go
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine.go
// preservation of the machine object.
effectivePreserveValue := getEffectivePreservationAnnotations(preserveInfo, getErr)

var removeAnnotations, preservationStopped bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two variables required?
preservationStopped can be removed, as CurrentStatus.PreserveExpiryTime is removed by stopPreservationIfActive, and the check for the time is enough in L853 IMO, but with removeAnnotations it is not as obvious and the changes might be more involved.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeAnnotations serves 2 purposes:

  1. It indicates to stopPreservationIfActive whether the machine and not should be unannotated - in the cases where machine preservation expires, or auto-preserved machine has recovered to Running.
  2. if annotations have been removed by stopPreservationIfActive, updatePreserveAnnotationOnMachine should not run because it will again set LastAppliedNodePreserveValueAnnotation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeAnnotations should stay as, if it is removed, stopPreservationIfActive will need to actively check this.preservationStopped can be removed (Based on discussion)

if err != nil {
return err
}
func (c *controller) uncordonNodeIfCordoned(ctx context.Context, node *corev1.Node) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth adding, as the more we can limit CA interactions with this the better. However, could ask for a second opinion

Comment thread pkg/util/provider/machinecontroller/machine.go
@thiyyakat thiyyakat force-pushed the preservation-hot-fix branch from d2f0d8f to 91b9476 Compare June 26, 2026 03:17

@r4mek r4mek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have posted few comments, some of which are more related to code cleanup which you can choose to do now or defer to a later PR.

machineValue string
lastAppliedNodeValue string
preserveExpiryTimeSet bool
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since "" annotation value is not valid anymore, we can remove nodeAnnotated and machineAnnotated booleans. Annotation value is "" implies that annotation is not set.

Also, can we rename nodeValue, machineValue to nodeAnnotationValue, machineAnnotationValue resp.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to distinguish between node/machine being annotated invalidly with "", or whether the annotation is unset(valid- means deletion of annotation), is by checking nodeAnnotated/machineAnnotated.

Comment thread pkg/util/provider/machinecontroller/machine.go
}
klog.Warningf("Couldn't find node %q for machine %q", nodeName, machine.Name)
err = nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this check right after isMachinePreservationBound() call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to check if a machine is preservation-bound at all before we return an error. This is being done so that we don't alter the flow for machines that are not preservation-bound. This is the current issue that the PR is trying to solve.

delete(clonedMachineAnnotations, machineutils.PreserveMachineAnnotationKey)
clonedMachineAnnotations[machineutils.LastAppliedNodePreserveValueAnnotationKey] = nodeAnnotationValue
return nodeAnnotationValue, clonedMachineAnnotations
if info.nodeValue == "" && info.lastAppliedNodeValue == "" {

@r4mek r4mek Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to also check for lastAppliedNodeValue?

@thiyyakat thiyyakat Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the comment above (sorry, it's long). It gives an example scenario of why lastAppliedNodeValue is needed. If MCM goes down, we won't be able to tell if node was never annotated or if node annotation was deleted.

// removal of preserveExpiryTime is the last step of stopping preservation
// therefore, if preserveExpiryTime is not set, machine is not preserved
nodeName := machine.Labels[v1alpha1.NodeLabelKey]
if machine.Status.CurrentStatus.PreserveExpiryTime == nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the info.preserveExpiryTimeSet and we already check for it in the parent function. So, no need for this.

If we still want to be check it again then shouldn't we clear the annotations before returning?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info.preserveExpiryTimeSet is only being used to check if a machine is preservation-bound. We do not pass it to stopPreservationIfActive.

klog.V(2).Infof("Preservation of machine %q with no backing node has stopped.", machine.Name)
return true, nil
return machine, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extract this block outside, so that we always remove preserve annotation from machine and clear clear machine expiry time. Later we can return if nodeName == nil

Then we won't have to do the same operation again if node is NotFound.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will raise a separate PR for any refactoring needed. Since the purpose of this is to play it safe and quickly deliver a fix without introducing any new regressions, I would like to leave this as-is for now.

}
// Step 2: remove annotations from node
err = c.removePreservationRelatedAnnotationsOnNode(ctx, updatedNode, removePreservationAnnotations)
updatedNode, err = c.removePreservationRelatedAnnotationsOnNode(ctx, updatedNode, removePreservationAnnotations)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this and the condition update on node as a single operation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot unfortunately. Annotations are updated using Update and conditions using UpdateStatus.

}

// manageMachinePreservation manages machine preservation based on the preserve annotation values on the node and machine objects.
func (c *controller) manageMachinePreservation(ctx context.Context, machine *v1alpha1.Machine) (retry machineutils.RetryPeriod, err error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can split this function such that one handles the case where nodeName is set and other where nodeName is not set.

@thiyyakat thiyyakat Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will raise a separate PR for refactoring. Since the purpose of this is to play it safe and quickly deliver a fix without introducing any new regressions, I would like to leave this as-is for now.

klog.Errorf("error draining preserved node %q for machine %q : %v", nodeName, machine.Name, drainErr)
return drainErr
return nil, drainErr
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will reach this codepath as we are gonna skip the if needsUpdate { if we have drainErr. So this block can be shifted after c.drainPreservedNode() call.

Hence, we would be returning without an error even if drain was unsuccessful.

@thiyyakat thiyyakat Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we skip if needsUpdate{ if we have drainErr that is not nil? In computeNewNodePreservedCondition we set needsUpdate to true if it is the first time that drain is failing.

…encing happens in future edits. Add removed test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants