Skip to content

effective-creation-timeout & new metrics for use by dependency-watchdog#1104

Open
elankath wants to merge 15 commits into
gardener:masterfrom
elankath:timeout-metrics
Open

effective-creation-timeout & new metrics for use by dependency-watchdog#1104
elankath wants to merge 15 commits into
gardener:masterfrom
elankath:timeout-metrics

Conversation

@elankath

@elankath elankath commented May 26, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #1103
Fixes #1098

Special notes for your reviewer:

Release note:

Support node.machine.sapcloud.io/effective-creation-timeout on MachineDeployment to override shoot configured machineControllerManager.machineCreationTimeout for use by dependency-watchdog

Support new metrics: machine_create_duration_seconds, machine_initialize_duration_seconds, machine_join_duration_seconds,machine_drain_duration_seconds,machine_delete_duration_seconds

@elankath elankath self-assigned this May 26, 2026
@elankath elankath requested a review from a team as a code owner May 26, 2026 18:45
@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels May 26, 2026
@elankath elankath marked this pull request as draft May 26, 2026 18:45
@gardener-prow gardener-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 26, 2026
@gardener-prow

gardener-prow Bot commented May 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from elankath. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2026
@elankath

Copy link
Copy Markdown
Member Author

Test Log-1

Testing done with virtual provider. Real provider test logs will be attached to PR's raised for the specific providers.

Effective Machine Creation Timeout

1. Check MCD

k get mcd -n shoot--i034796--aw                                              
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1            2         2                        4m5s
shoot--i034796--aw-weapon-z1           2         2                        4m5s

2. Add effective-creation-timeout annotation to MCD

k annotate mcd --overwrite shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw node.machine.sapcloud.io/effective-creation-timeout="2m"
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 annotated

2. Check annotation propagated to MachineSet

k get mcs -n shoot--i034796--aw                                              git:main*
NAME                                 DESIRED   CURRENT   READY   AGE
shoot--i034796--aw-fruit-z1-79c97    2         2                 7m20s
shoot--i034796--aw-weapon-z1-c998d   2         2                 7m20s


k get mcs -n shoot--i034796--aw shoot--i034796--aw-weapon-z1-c998d -o custom-columns=NAME:.metadata.name,ANNOTATIONS:.metadata.annotations
NAME                                 ANNOTATIONS
shoot--i034796--aw-weapon-z1-c998d   map[bingo:lingo deployment.kubernetes.io/desired-replicas:2 deployment.kubernetes.io/max-replicas:3 deployment.kubernetes.io/revision:1 deployment.kubernetes.io/revision-history:1,1 greeting:howdy machine.sapcloud.io/last-deployment-replica-change-by-scaler-time:2026-05-28T14:13:17Z node.machine.sapcloud.io/effective-creation-timeout:2m] <-- Changed to 2m

2. Increase replicas & check annotation propagated to new Machine object

k scale mcd shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw --replicas=3  
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 scaled

k get mc -n shoot--i034796--aw                                               
NAME                                       STATUS   AGE   NODE
shoot--i034796--aw-fruit-z1-79c97-4s9ln             30m
shoot--i034796--aw-fruit-z1-79c97-kgztg             30m
shoot--i034796--aw-weapon-z1-c998d-ksnqm            30m
shoot--i034796--aw-weapon-z1-c998d-mntb7            30m
shoot--i034796--aw-weapon-z1-c998d-p8png            9s


k get mc -n shoot--i034796--aw shoot--i034796--aw-weapon-z1-c998d-p8png -o custom-columns=NAME:.metadata.name,ANNOTATIONS:.metadata.annotations
NAME                                       ANNOTATIONS
shoot--i034796--aw-weapon-z1-c998d-p8png   map[bingo:lingo deployment.kubernetes.io/desired-replicas:3 deployment.kubernetes.io/max-replicas:4 deployment.kubernetes.io/revision:1 deployment.kubernetes.io/revision-history:1,1 greeting:howdy machine.sapcloud.io/last-deployment-replica-change-by-scaler-time:2026-05-28T14:13:17Z node.machine.sapcloud.io/effective-creation-timeout:2m] <-- New Machine object has 2m

3. Case: Creation Success Metrics

Machines should be created and new Node joins the cluster. init, create metrics available.

k get no shoot--i034796--aw-weapon-z1-c998d-p8png                            
NAME                                       STATUS   ROLES    AGE   VERSION
shoot--i034796--aw-weapon-z1-c998d-p8png   Ready    <none>   13m

From local prometheus http://127.0.0.1:9090/prometheus
NOTE: Virtual provider randomizes create and join durations.


I0610 11:26:15.480417   13619 machine.go:193] reconcileClusterMachine: Start for "shoot--i034796--aw-weapon-z1-c998d-p8png" with phase:"Running", description:"Machine shoot--i034796--aw-weapon-z1-c998d-p8png successfully joined the cluster in 5m13.183556s"


|mcm_machine_machine_create_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-fruit-z1", namespace="shoot--i034796--aw"}|214|

|mcm_machine_machine_create_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw"}|307|

mcm_machine_machine_join_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-fruit-z1", namespace="shoot--i034796--aw"} |214|

mcm_machine_machine_join_duration_seconds{instance="localhost:10259", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw"}|313|

4. Case: Machine Join Failure

  • Before state: weapon MCD replica count = 1
  • Increased virtual provider InstanceDelays.JoinMin to 130s above the effective-creation-timeout of 2m
  • Increased weapon MCD replica count to 2.
k get mcd -nshoot--i034796--aw                                               
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1    2       2         2            2           3h25m
shoot--i034796--aw-weapon-z1   1       1         1            1           3h25m


k scale mcd shoot--i034796--aw-weapon-z1 -n shoot--i034796--aw --replicas=2 
machinedeployment.machine.sapcloud.io/shoot--i034796--aw-weapon-z1 scaled

k get mcd -nshoot--i034796--aw; k get no -n shoot--i034796--aw              
NAME                           READY   DESIRED   UP-TO-DATE   AVAILABLE   AGE
shoot--i034796--aw-fruit-z1    2       2         2            2           3h27m
shoot--i034796--aw-weapon-z1   1       2         2            1           3h27m <-- replice inc to 2
NAME                                       STATUS     ROLES    AGE    VERSION
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Ready      <none>   172m
shoot--i034796--aw-fruit-z1-79c97-kgztg    Ready      <none>   172m
shoot--i034796--aw-weapon-z1-c998d-4xk78   NotReady   <none>   12s <-- will NOT join within 2m
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Ready      <none>   46m

Machine object 4xk78 is Terminated after effective-creation-timeout of 2m and new object xs276 is created. This behaviour will keep repeating. num_failed_join is incremented.

k get mc -nshoot--i034796--aw; k get no -n shoot--i034796--aw                git:main*
NAME                                       STATUS        AGE     NODE
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Running       3h30m   shoot--i034796--aw-fruit-z1-79c97-4s9ln
shoot--i034796--aw-fruit-z1-79c97-kgztg    Running       3h30m   shoot--i034796--aw-fruit-z1-79c97-kgztg
shoot--i034796--aw-weapon-z1-c998d-4xk78   Terminating   2m31s   shoot--i034796--aw-weapon-z1-c998d-4xk78
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Running       48m     shoot--i034796--aw-weapon-z1-c998d-m2wz9
shoot--i034796--aw-weapon-z1-c998d-xs276   Pending       15s     shoot--i034796--aw-weapon-z1-c998d-xs276
NAME                                       STATUS                        ROLES    AGE     VERSION
shoot--i034796--aw-fruit-z1-79c97-4s9ln    Ready                         <none>   174m
shoot--i034796--aw-fruit-z1-79c97-kgztg    Ready                         <none>   174m
shoot--i034796--aw-weapon-z1-c998d-4xk78   NotReady,SchedulingDisabled   <none>   2m30s
shoot--i034796--aw-weapon-z1-c998d-m2wz9   Ready                         <none>   48m
shoot--i034796--aw-weapon-z1-c998d-xs276   NotReady                      <none>   12s

E0610 14:10:02.395176   26065 machine_util.go:1142] Machine shoot--i034796--aw-weapon-z1-c998d-4xk78 failed to join the cluster in 2m0s minutes.
I0610 14:10:02.395230   26065 metrics.go:240] incremented num_failed_join metric due to "shoot--i034796--aw-weapon-z1-c998d-4xk78" with labels instance_type=m5.xlarge,machine_deployment=shoot--i034796--aw-weapon-z1,namespace=shoot--i034796--aw,zone=eu-west-1c

From Prometheus

|mcm_machine_num_failed_join{instance="localhost:10259", instance_type="m5.xlarge", job="am_1", machine_deployment="shoot--i034796--aw-weapon-z1", namespace="shoot--i034796--aw", zone="eu-west-1c"}|1|

@elankath elankath added the kind/enhancement Enhancement, improvement, extension label Jun 10, 2026
@gardener-prow gardener-prow Bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 10, 2026
@elankath elankath marked this pull request as ready for review June 10, 2026 09:18
@gardener-prow gardener-prow Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026
@aaronfern

Copy link
Copy Markdown
Member

/assign

Comment thread pkg/controller/controller_utils.go Outdated
func getMachinesAnnotationSet(template *v1alpha1.MachineTemplateSpec, parentObject metav1.Object) labels.Set {
desiredAnnotations := make(labels.Set)
maps.Copy(desiredAnnotations, template.Annotations)
maps.Copy(desiredAnnotations, parentObject.GetAnnotations())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this result in all parent object annotations to now be present on the machine object? if that is the case then there might be additional unnecessary annotations present on the machine object

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I took a look at MCD annotations copied to MCS annotations. I assumed we copy everything there and therefore we should be symmetric when doing MCS->MC. but apparently in copyMachineDeploymentAnnotationsToMachineSetwe have a skipCopyAnnotation which checks whether annotation should be copied. I will reuse this.

@elankath elankath Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind - re-use involves lot of refactoring due to different packages. For now, will just copy one annotation. Refactoring annotations for DRY makes too many changes.

Comment thread pkg/controller/controller_utils.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could overwrite the value of machine.Spec.MachineConfiguration.MachineCreationTimeout set by lines 558-566

@elankath elankath Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uggh..this line wasmoved upward but commit not pushed to the PR branch. Fixing.

machineName = createMachineRequest.Machine.Name
uninitializedMachine = false
addresses = sets.New[corev1.NodeAddress]()
createDuration time.Duration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but could you move this up to the Declarations block?

@aaronfern aaronfern removed their assignment Jun 23, 2026
Comment thread pkg/util/provider/machinecontroller/machine.go Outdated
Comment thread pkg/util/provider/machinecontroller/machine_util.go Outdated
Help: "Duration in seconds to delete a Machine of a MachineDeployment.",
}, []string{"namespace", "machine_deployment"})

MachineNumFailedJoin = prometheus.NewCounterVec(prometheus.CounterOpts{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring for MachineNumFailedJoin

Comment thread pkg/util/provider/metrics/metrics.go Outdated
MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: namespace,
Subsystem: machineSubsystem,
Name: "machine_create_duration_seconds",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here should be create_duration_seconds.

The prefix machine is obtained from the Subsystem. This can be seen from the test logs you posted, where the metric there is called mcm_machine_machine_create_duration_seconds.
Please update this for the other newly introduced metrics as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A refactor screwed this up when I moved the metric to MC from MCD metrics package. Fixed.


// MachineCreateDurationSeconds is the Prometheus gauge metric representing the time duration
// in seconds to create a Machine of a MachineDeployment.
MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if histograms might be a worth considering here instead of gauges.
The disadvantage I see here is of values getting lost when there are multiple machines from the same mcd simultaneously in creation (same problem for the other metrics too).

I'm not sure how you anticipate dwd using these values and if a few missed can be tolerated them that's fine.
I also see that histograms have issues of their own with bucketing and we won't get accurate values to make decisions.
wdyt?

@elankath elankath Jun 26, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meeting we had last, madhav had asked to use Gauge here, so I just blindly followed the suggestion. But he was right - practically speaking the Prometheus server remembers historical values because it scrapes metric periodically and stores each sample. With promql we can do something like max_over_time(machine_create_duration_seconds[1h]). We are not really interested in a histogram since there is no benefit to remembering every provisioning duration. I am not sure, let us discuss this. Histogram is is also very expensive.

machineName = createMachineRequest.Machine.Name
uninitializedMachine = false
addresses = sets.New[corev1.NodeAddress]()
createDuration time.Duration

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get the comment. This is zero value not nil and it is already in var declarations block ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it - moved it upward.

@gardener-prow gardener-prow Bot added cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels Jun 26, 2026
elankath and others added 3 commits June 26, 2026 14:30
Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>
Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

2 participants