effective-creation-timeout & new metrics for use by dependency-watchdog by elankath · Pull Request #1104 · gardener/machine-controller-manager

elankath · 2026-05-26T18:45:42Z

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #1103
Fixes #1098

Special notes for your reviewer:

Release note:

Support node.machine.sapcloud.io/effective-creation-timeout on MachineDeployment to override shoot configured machineControllerManager.machineCreationTimeout for use by dependency-watchdog

Support new metrics: machine_create_duration_seconds, machine_initialize_duration_seconds, machine_join_duration_seconds,machine_drain_duration_seconds,machine_delete_duration_seconds

gardener-prow · 2026-05-26T18:46:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from elankath. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elankath · 2026-06-10T08:46:31Z

Test Log-1

Testing done with virtual provider. Real provider test logs will be attached to PR's raised for the specific providers.

Effective Machine Creation Timeout

1. Check MCD

k get mcd -n shoot--i034796--aw                                              
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1            2         2                        4m5s
shoot--i034796--aw-weapon-z1           2         2                        4m5s

2. Add effective-creation-timeout annotation to MCD

k annotate mcd --overwrite shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw node.machine.sapcloud.io/effective-creation-timeout="2m"
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 annotated

2. Check annotation propagated to MachineSet

k get mcs -n shoot--i034796--aw                                              git:main*
NAME                                 DESIRED   CURRENT   READY   AGE
shoot--i034796--aw-fruit-z1-79c97    2         2                 7m20s
shoot--i034796--aw-weapon-z1-c998d   2         2                 7m20s


k get mcs -n shoot--i034796--aw shoot--i034796--aw-weapon-z1-c998d -o custom-columns=NAME:.metadata.name,ANNOTATIONS:.metadata.annotations
NAME                                 ANNOTATIONS
shoot--i034796--aw-weapon-z1-c998d   map[bingo:lingo deployment.kubernetes.io/desired-replicas:2 deployment.kubernetes.io/max-replicas:3 deployment.kubernetes.io/revision:1 deployment.kubernetes.io/revision-history:1,1 greeting:howdy machine.sapcloud.io/last-deployment-replica-change-by-scaler-time:2026-05-28T14:13:17Z node.machine.sapcloud.io/effective-creation-timeout:2m] <-- Changed to 2m

2. Increase replicas & check annotation propagated to new Machine object

k scale mcd shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw --replicas=3  
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 scaled

k get mc -n shoot--i034796--aw                                               
NAME                                       STATUS   AGE   NODE
shoot--i034796--aw-fruit-z1-79c97-4s9ln             30m
shoot--i034796--aw-fruit-z1-79c97-kgztg             30m
shoot--i034796--aw-weapon-z1-c998d-ksnqm            30m
shoot--i034796--aw-weapon-z1-c998d-mntb7            30m
shoot--i034796--aw-weapon-z1-c998d-p8png            9s


k get mc -n shoot--i034796--aw shoot--i034796--aw-weapon-z1-c998d-p8png -o custom-columns=NAME:.metadata.name,ANNOTATIONS:.metadata.annotations
NAME                                       ANNOTATIONS
shoot--i034796--aw-weapon-z1-c998d-p8png   map[bingo:lingo deployment.kubernetes.io/desired-replicas:3 deployment.kubernetes.io/max-replicas:4 deployment.kubernetes.io/revision:1 deployment.kubernetes.io/revision-history:1,1 greeting:howdy machine.sapcloud.io/last-deployment-replica-change-by-scaler-time:2026-05-28T14:13:17Z node.machine.sapcloud.io/effective-creation-timeout:2m] <-- New Machine object has 2m

3. Case: Creation Success Metrics

Machines should be created and new Node joins the cluster. init, create metrics available.

k get no shoot--i034796--aw-weapon-z1-c998d-p8png                            
NAME                                       STATUS   ROLES    AGE   VERSION
shoot--i034796--aw-weapon-z1-c998d-p8png   Ready    <none>   13m

From local prometheus http://127.0.0.1:9090/prometheus
NOTE: Virtual provider randomizes create and join durations.


I0610 11:26:15.480417   13619 machine.go:193] reconcileClusterMachine: Start for "shoot--i034796--aw-weapon-z1-c998d-p8png" with phase:"Running", description:"Machine shoot--i034796--aw-weapon-z1-c998d-p8png successfully joined the cluster in 5m13.183556s"


|mcm_machine_machine_create_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-fruit-z1", namespace="shoot--i034796--aw"}|214|

|mcm_machine_machine_create_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw"}|307|

mcm_machine_machine_join_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-fruit-z1", namespace="shoot--i034796--aw"} |214|

mcm_machine_machine_join_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw"}|313|

4. Case: Machine Join Failure

Before state: weapon MCD replica count = 1
Increased virtual provider InstanceDelays.JoinMin to 130s above the effective-creation-timeout of 2m

Increased weapon MCD replica count to 2.

k get mcd -nshoot--i034796--aw                                               
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1    2       2         2            2           3h25m
shoot--i034796--aw-weapon-z1   1       1         1            1           3h25m


k scale mcd shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw --replicas=2 
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 scaled

k get mcd -nshoot--i034796--aw; k get no -n shoot--i034796--aw              
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1    2       2         2            2           3h27m
shoot--i034796--aw-weapon-z1   1       2         2            1           3h27m <-- replice inc to 2
NAME                                       STATUS     ROLES    AGE    VERSION
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Ready      <none>   172m
shoot--i034796--aw-fruit-z1-79c97-kgztg    Ready      <none>   172m
shoot--i034796--aw-weapon-z1-c998d-4xk78   NotReady   <none>   12s <-- will NOT join within 2m
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Ready      <none>   46m

Machine object 4xk78 is Terminated after effective-creation-timeout of 2m and new object xs276 is created. This behaviour will keep repeating. num_failed_join is incremented.

k get mc -nshoot--i034796--aw; k get no -n shoot--i034796--aw                git:main*
NAME                                       STATUS        AGE     NODE
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Running       3h30m   shoot--i034796--aw-fruit-z1-79c97-4s9ln
shoot--i034796--aw-fruit-z1-79c97-kgztg    Running       3h30m   shoot--i034796--aw-fruit-z1-79c97-kgztg
shoot--i034796--aw-weapon-z1-c998d-4xk78   Terminating   2m31s   shoot--i034796--aw-weapon-z1-c998d-4xk78
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Running       48m     shoot--i034796--aw-weapon-z1-c998d-m2wz9
shoot--i034796--aw-weapon-z1-c998d-xs276   Pending       15s     shoot--i034796--aw-weapon-z1-c998d-xs276
NAME                                       STATUS                        ROLES    AGE     VERSION
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Ready                         <none>   174m
shoot--i034796--aw-fruit-z1-79c97-kgztg    Ready                         <none>   174m
shoot--i034796--aw-weapon-z1-c998d-4xk78   NotReady,SchedulingDisabled   <none>   2m30s
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Ready                         <none>   48m
shoot--i034796--aw-weapon-z1-c998d-xs276   NotReady                      <none>   12s

E0610 14:10:02.395176   26065 machine_util.go:1142] Machine shoot--i034796--aw-weapon-z1-c998d-4xk78 failed to join the cluster in 2m0s minutes.
I0610 14:10:02.395230   26065 metrics.go:240] incremented num_failed_join metric due to "shoot--i034796--aw-weapon-z1-c998d-4xk78" with labels instance_type=m5.xlarge,machine_deployment=shoot--i034796--aw-weapon-z1,namespace=shoot--i034796--aw,zone=eu-west-1c

From Prometheus

|mcm_machine_num_failed_join{instance="localhost:10259", instance_type="m5.xlarge", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw", zone="eu-west-1c"}|1|

aaronfern · 2026-06-23T07:51:23Z

/assign

aaronfern · 2026-06-15T11:38:45Z

+func getMachinesAnnotationSet(template *v1alpha1.MachineTemplateSpec, parentObject metav1.Object) labels.Set {
 	desiredAnnotations := make(labels.Set)
 	maps.Copy(desiredAnnotations, template.Annotations)
+	maps.Copy(desiredAnnotations, parentObject.GetAnnotations())


Will this result in all parent object annotations to now be present on the machine object? if that is the case then there might be additional unnecessary annotations present on the machine object

Hmm.. I took a look at MCD annotations copied to MCS annotations. I assumed we copy everything there and therefore we should be symmetric when doing MCS->MC. but apparently in copyMachineDeploymentAnnotationsToMachineSetwe have a skipCopyAnnotation which checks whether annotation should be copied. I will reuse this.

Never mind - re-use involves lot of refactoring due to different packages. For now, will just copy one annotation. Refactoring annotations for DRY makes too many changes.

aaronfern · 2026-06-23T08:29:50Z

This could overwrite the value of machine.Spec.MachineConfiguration.MachineCreationTimeout set by lines 558-566

Uggh..this line wasmoved upward but commit not pushed to the PR branch. Fixing.

aaronfern · 2026-06-23T08:41:01Z

 		machineName          = createMachineRequest.Machine.Name
 		uninitializedMachine = false
 		addresses            = sets.New[corev1.NodeAddress]()
+		createDuration       time.Duration


nit, but could you move this up to the Declarations block?

aaronfern · 2026-06-23T10:31:30Z

+		Help:      "Duration in seconds to delete a Machine of a MachineDeployment.",
+	}, []string{"namespace", "machine_deployment"})
+
+	MachineNumFailedJoin = prometheus.NewCounterVec(prometheus.CounterOpts{


Please add a docstring for MachineNumFailedJoin

aaronfern · 2026-06-23T10:37:11Z

+	MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Namespace: namespace,
+		Subsystem: machineSubsystem,
+		Name:      "machine_create_duration_seconds",


The name here should be create_duration_seconds.

The prefix machine is obtained from the Subsystem. This can be seen from the test logs you posted, where the metric there is called mcm_machine_machine_create_duration_seconds.
Please update this for the other newly introduced metrics as well

A refactor screwed this up when I moved the metric to MC from MCD metrics package. Fixed.

aaronfern · 2026-06-23T10:53:20Z

+
+	// MachineCreateDurationSeconds is the Prometheus gauge metric representing the time duration
+	// in seconds to create a Machine of a MachineDeployment.
+	MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{


I wonder if histograms might be a worth considering here instead of gauges.
The disadvantage I see here is of values getting lost when there are multiple machines from the same mcd simultaneously in creation (same problem for the other metrics too).

I'm not sure how you anticipate dwd using these values and if a few missed can be tolerated them that's fine.
I also see that histograms have issues of their own with bucketing and we won't get accurate values to make decisions.
wdyt?

In the meeting we had last, madhav had asked to use Gauge here, so I just blindly followed the suggestion. But he was right - practically speaking the Prometheus server remembers historical values because it scrapes metric periodically and stores each sample. With promql we can do something like max_over_time(machine_create_duration_seconds[1h]). We are not really interested in a histogram since there is no benefit to remembering every provisioning duration. I am not sure, let us discuss this. Histogram is is also very expensive.

elankath · 2026-06-26T06:22:38Z

 		machineName          = createMachineRequest.Machine.Name
 		uninitializedMachine = false
 		addresses            = sets.New[corev1.NodeAddress]()
+		createDuration       time.Duration


Didn't get the comment. This is zero value not nil and it is already in var declarations block ?

OK, got it - moved it upward.

Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>

elankath added 2 commits May 26, 2026 22:20

WIP: metrics & effective creation timeout

efbc070

support for node.machine.sapcloud.io/effective-creation-duration

494e78b

elankath self-assigned this May 26, 2026

elankath requested a review from a team as a code owner May 26, 2026 18:45

gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels May 26, 2026

elankath marked this pull request as draft May 26, 2026 18:45

gardener-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 26, 2026

gardener-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2026

elankath added 3 commits May 28, 2026 14:43

removed max metrics - record all durations

d6d57f5

init,create,join,drain,delete Machine metrics

4fcf8d2

added mcm_machine_num_failed_join

b5420f3

added godoc for IncrementNumFailedToJoin

c645b19

elankath added the kind/enhancement Enhancement, improvement, extension label Jun 10, 2026

gardener-prow Bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 10, 2026

elankath marked this pull request as ready for review June 10, 2026 09:18

gardener-prow Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026

added AnnotationKeyMachineJoinDuration

3a5fe51

gardener-prow Bot assigned aaronfern Jun 23, 2026

aaronfern reviewed Jun 23, 2026

View reviewed changes

aaronfern removed their assignment Jun 23, 2026

aaronfern reviewed Jun 23, 2026

View reviewed changes

elankath commented Jun 26, 2026

View reviewed changes

elankath added 3 commits June 26, 2026 12:04

machine.Spec moved upwards

535bcd2

added godoc for MachineNumFailedJoin

dfd5870

added godoc and fixed metric names

0efcf8a

elankath added 2 commits June 26, 2026 12:24

consistent var name effectiveCreationTimeout

e3ce5cb

better log for overridden machine creation timeout

97b3a51

elankath and others added 3 commits June 26, 2026 14:30

addressed review comments

3284907

Update pkg/util/provider/machinecontroller/machine.go

0d8115d

Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>

Update pkg/util/provider/machinecontroller/machine_util.go

fd926a9

Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>

Uh oh!

Conversation

elankath commented May 26, 2026 • edited by takoverflow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gardener-prow Bot commented May 26, 2026

Uh oh!

elankath commented Jun 10, 2026

Test Log-1

Effective Machine Creation Timeout

1. Check MCD

2. Add effective-creation-timeout annotation to MCD

2. Check annotation propagated to MachineSet

2. Increase replicas & check annotation propagated to new Machine object

3. Case: Creation Success Metrics

4. Case: Machine Join Failure

Uh oh!

aaronfern commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elankath Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elankath Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elankath Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elankath commented May 26, 2026 •

edited by takoverflow

Loading

elankath Jun 26, 2026 •

edited

Loading

elankath Jun 26, 2026 •

edited

Loading

elankath Jun 26, 2026 •

edited

Loading