effective-creation-timeout & new metrics for use by dependency-watchdog#1104
effective-creation-timeout & new metrics for use by dependency-watchdog#1104elankath wants to merge 15 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Test Log-1Testing done with virtual provider. Real provider test logs will be attached to PR's raised for the specific providers. Effective Machine Creation Timeout1. Check MCD2. Add effective-creation-timeout annotation to MCD2. Check annotation propagated to MachineSet2. Increase replicas & check annotation propagated to new Machine object3. Case: Creation Success MetricsMachines should be created and new Node joins the cluster. From local prometheus http://127.0.0.1:9090/prometheus 4. Case: Machine Join Failure
Machine object From Prometheus |
|
/assign |
| func getMachinesAnnotationSet(template *v1alpha1.MachineTemplateSpec, parentObject metav1.Object) labels.Set { | ||
| desiredAnnotations := make(labels.Set) | ||
| maps.Copy(desiredAnnotations, template.Annotations) | ||
| maps.Copy(desiredAnnotations, parentObject.GetAnnotations()) |
There was a problem hiding this comment.
Will this result in all parent object annotations to now be present on the machine object? if that is the case then there might be additional unnecessary annotations present on the machine object
There was a problem hiding this comment.
Hmm.. I took a look at MCD annotations copied to MCS annotations. I assumed we copy everything there and therefore we should be symmetric when doing MCS->MC. but apparently in copyMachineDeploymentAnnotationsToMachineSetwe have a skipCopyAnnotation which checks whether annotation should be copied. I will reuse this.
There was a problem hiding this comment.
Never mind - re-use involves lot of refactoring due to different packages. For now, will just copy one annotation. Refactoring annotations for DRY makes too many changes.
There was a problem hiding this comment.
This could overwrite the value of machine.Spec.MachineConfiguration.MachineCreationTimeout set by lines 558-566
There was a problem hiding this comment.
Uggh..this line wasmoved upward but commit not pushed to the PR branch. Fixing.
| machineName = createMachineRequest.Machine.Name | ||
| uninitializedMachine = false | ||
| addresses = sets.New[corev1.NodeAddress]() | ||
| createDuration time.Duration |
There was a problem hiding this comment.
nit, but could you move this up to the Declarations block?
| Help: "Duration in seconds to delete a Machine of a MachineDeployment.", | ||
| }, []string{"namespace", "machine_deployment"}) | ||
|
|
||
| MachineNumFailedJoin = prometheus.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Please add a docstring for MachineNumFailedJoin
| MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: machineSubsystem, | ||
| Name: "machine_create_duration_seconds", |
There was a problem hiding this comment.
The name here should be create_duration_seconds.
The prefix machine is obtained from the Subsystem. This can be seen from the test logs you posted, where the metric there is called mcm_machine_machine_create_duration_seconds.
Please update this for the other newly introduced metrics as well
There was a problem hiding this comment.
A refactor screwed this up when I moved the metric to MC from MCD metrics package. Fixed.
|
|
||
| // MachineCreateDurationSeconds is the Prometheus gauge metric representing the time duration | ||
| // in seconds to create a Machine of a MachineDeployment. | ||
| MachineCreateDurationSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
I wonder if histograms might be a worth considering here instead of gauges.
The disadvantage I see here is of values getting lost when there are multiple machines from the same mcd simultaneously in creation (same problem for the other metrics too).
I'm not sure how you anticipate dwd using these values and if a few missed can be tolerated them that's fine.
I also see that histograms have issues of their own with bucketing and we won't get accurate values to make decisions.
wdyt?
There was a problem hiding this comment.
In the meeting we had last, madhav had asked to use Gauge here, so I just blindly followed the suggestion. But he was right - practically speaking the Prometheus server remembers historical values because it scrapes metric periodically and stores each sample. With promql we can do something like max_over_time(machine_create_duration_seconds[1h]). We are not really interested in a histogram since there is no benefit to remembering every provisioning duration. I am not sure, let us discuss this. Histogram is is also very expensive.
| machineName = createMachineRequest.Machine.Name | ||
| uninitializedMachine = false | ||
| addresses = sets.New[corev1.NodeAddress]() | ||
| createDuration time.Duration |
There was a problem hiding this comment.
Didn't get the comment. This is zero value not nil and it is already in var declarations block ?
There was a problem hiding this comment.
OK, got it - moved it upward.
Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>
Co-authored-by: Aaron Francis Fernandes <79958509+aaronfern@users.noreply.github.com>
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1103
Fixes #1098
Special notes for your reviewer:
Release note: