Skip to content

OCPBUGS-90541: Don't set Progressing=False until all pods available#3034

Open
mdbooth wants to merge 1 commit into
openshift:masterfrom
mdbooth:fix-progressing-false-timing
Open

OCPBUGS-90541: Don't set Progressing=False until all pods available#3034
mdbooth wants to merge 1 commit into
openshift:masterfrom
mdbooth:fix-progressing-false-timing

Conversation

@mdbooth

@mdbooth mdbooth commented Jun 20, 2026

Copy link
Copy Markdown

We don't want the ClusterOperator to report Progressing=True whenever a Node reboots. However, we also don't want to report Progressing=False during a CNI rollout until all pods are available. This change ensures both are covered.

Related to openshift/origin#31320

Confirmed with a multi-pr test that it fixes the origin test.

Summary by CodeRabbit

  • Bug Fixes

    • Improved status logic to better distinguish “rollout in progress” from reboot/churn for DaemonSets, StatefulSets, and Deployments, including more accurate Progressing and Degraded behavior when pods are temporarily unavailable.
    • Enhanced pod failure detection during these rollout phases by tracking and identifying CrashLoopBackOff conditions.
    • Corrected progress transitions when pods become unavailable without prior rollout tracking, treating the scenario as churn instead of rollout.
  • Tests

    • Added coverage for “awaiting pod availability,” including behavior during timeouts and persistence across StatusManager restarts.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 20, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@mdbooth: This pull request references Jira Issue OCPBUGS-90541, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (core-networking-bot@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

We don't want the ClusterOperator to report Progressing=True whenever a Node reboots. However, we also don't want to report Progressing=False during a CNI rollout until all pods are available. This change ensures both are covered.

Related to openshift/origin#31320

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mdbooth
Once this PR has been reviewed and has the lgtm label, please assign jcaamano for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7babe90-518a-4fd1-8503-2fbe117c65b6

📥 Commits

Reviewing files that changed from the base of the PR and between d343fbc and 6812a1e.

📒 Files selected for processing (2)
  • pkg/controller/statusmanager/pod_status.go
  • pkg/controller/statusmanager/status_manager_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/statusmanager/pod_status.go
  • pkg/controller/statusmanager/status_manager_test.go

Walkthrough

The PR modifies SetFromPods in pod_status.go to condition the rollout-in-progress path for DaemonSets, StatefulSets, and Deployments on (hadState || *RolloutActive) instead of only *RolloutActive, moves CrashLoopBackOff scanning into that branch, and adjusts progress messages based on hadState. Existing tests are updated to clear tracked state before simulating reboot churn, and five new multi-phase tests are added to validate awaiting-availability and hung-rollout timeout behavior.

Changes

hadState-aware rollout detection and tests

Layer / File(s) Summary
DaemonSet, StatefulSet, Deployment rollout-in-progress logic
pkg/controller/statusmanager/pod_status.go
For all three workload types, the unavailability branch now enters the rollout-in-progress path when hadState || *RolloutActive is true; progress messages vary by hadState; CrashLoopBackOff pod scanning is moved/added inside the rollout-in-progress branch for non-critical resources.
Test constant and existing test adjustments
pkg/controller/statusmanager/status_manager_test.go
Test constant for release version is introduced; DaemonSet, Deployment, and StatefulSet rollout tests each now complete the rollout and call SetFromPods() to clear hadState before simulating reboot churn; associated assertion comments updated to reference hadState=false behavior.
New awaiting-availability and hung-rollout tests
pkg/controller/statusmanager/status_manager_test.go
Adds TestDaemonSetRolloutWaitsForAvailability, TestDaemonSetRolloutWaitsForAvailabilityAcrossRestart, TestDeploymentRolloutWaitsForAvailability, TestStatefulSetRolloutWaitsForAvailability, and TestDaemonSetHungRolloutDuringAvailabilityWait covering multi-phase Progressing transitions, cross-restart persistence, and hung-rollout degradation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Logging statement uses %+v to expand Kubernetes API errors, which may expose tokens or credentials from failed Patch operations. Replace %+v with %v or %s in the "Failed to set pod state" error log to avoid exposing full error structures containing sensitive API response data.
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: preventing Progressing=False from being set until all pods are available, which directly matches the primary objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable, static, and deterministic. No dynamic information (pods, timestamps, UUIDs, nodes, namespaces, IPs) found in test titles. Names are descriptive and use fixed st...
Test Structure And Quality ✅ Passed All five new tests follow quality standards: (1) Single responsibility—each tests one behavior/lifecycle; (2) Setup/cleanup uses established helper patterns with fake client, no explicit cleanup ne...
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The tests added are standard Go unit tests (testing.T) in status_manager_test.go, to which this MicroShift compatibility check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests in PR. New tests are standard Go unit tests (testing.T) in pkg/controller/statusmanager/, not BDD-style Ginkgo tests. Custom check applies only to Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed The PR modifies status management logic in pod_status.go and adds tests; it does not add or modify deployment manifests, operator code, or controllers that define scheduling constraints (affinity,...
Ote Binary Stdout Contract ✅ Passed This PR modifies the Cluster Network Operator, a standard Kubernetes operator, not an OTE binary. The check is inapplicable to non-OTE projects.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The new tests are standard Go unit tests (using testing.T) in pkg/controller/statusmanager/status_manager_test.go, not Ginkgo e2e tests, so the IPv6/disco...
No-Weak-Crypto ✅ Passed The PR contains no weak cryptography usage (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), no custom crypto implementations, and no non-constant-time secret/token comparisons. The code uses standard eq...
Container-Privileges ✅ Passed PR introduces no new privileged container settings. Changes are limited to pod status reporting logic in Go source files with no container security context modifications.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17357 characters] ...

red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 172-188: The condition on line 172 requires ReadyReplicas > 0 to
treat a StatefulSet as having an active rollout, but when hadState is true
(indicating tracked state from a previous rollout) and ReadyReplicas is 0 while
UpdatedReplicas equals Replicas, the StatefulSet falls out of the
rollout-in-progress path prematurely, clearing tracked state before pods
recover. Modify the condition to also handle the zero-ready availability case by
checking if we have tracked state or an active rollout regardless of whether
ReadyReplicas is greater than zero, ensuring that pending pods waiting to become
ready are properly tracked as progressing rather than being reported as
non-progressing.

In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 2388-2390: The test file hardcodes the release version annotation
as "v1.0.0" while the rollout logic reads the actual RELEASE_VERSION environment
variable. This mismatch causes test assertions to fail when the environment
variable differs from the hardcoded value. Replace all hardcoded "v1.0.0" values
in the Annotations map with the "release.openshift.io/version" key at all
locations (lines 2388-2390, 2562-2564, 2653-2655, 2766-2768, and 2882-2884) by
retrieving the actual RELEASE_VERSION environment variable value using os.Getenv
and using that value instead. This ensures the test annotations match what the
rollout logic actually reads from the environment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d9ebdf1-4112-4d54-a0f5-168c3667049c

📥 Commits

Reviewing files that changed from the base of the PR and between c376140 and d343fbc.

📒 Files selected for processing (2)
  • pkg/controller/statusmanager/pod_status.go
  • pkg/controller/statusmanager/status_manager_test.go

Comment thread pkg/controller/statusmanager/pod_status.go Outdated
Comment thread pkg/controller/statusmanager/status_manager_test.go
@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Author

/retest-required

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Author

Pods are crashlooping due to:

 2026-06-20T18:21:33.522142681Z /usr/libexec/ipsec/addconn: /lib64/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /usr/libexec/ipsec/addconn)

Tracked in https://issues.redhat.com/browse/OCPBUGS-89238

We don't want the ClusterOperator to report Progressing=True whenever a
Node reboots. However, we also don't want to report Progressing=False
during a CNI rollout until all pods are available. This change ensures
both are covered.
@mdbooth mdbooth force-pushed the fix-progressing-false-timing branch from d343fbc to 6812a1e Compare June 20, 2026 20:44
@openshift-ci

openshift-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-rhcos10-techpreview 6812a1e link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/hypershift-e2e-aks 6812a1e link true /test hypershift-e2e-aks
ci/prow/e2e-ovn-ipsec-step-registry 6812a1e link true /test e2e-ovn-ipsec-step-registry
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 6812a1e link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp 6812a1e link true /test e2e-metal-ipi-ovn-dualstack-bgp
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec 6812a1e link true /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/e2e-aws-ovn-windows 6812a1e link true /test e2e-aws-ovn-windows
ci/prow/e2e-aws-ovn-upgrade-ipsec 6812a1e link true /test e2e-aws-ovn-upgrade-ipsec
ci/prow/e2e-aws-ovn-upgrade 6812a1e link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-fdp-qe 6812a1e link true /test e2e-aws-ovn-fdp-qe
ci/prow/5.0-upgrade-from-stable-4.22-e2e-azure-ovn-upgrade 6812a1e link false /test 5.0-upgrade-from-stable-4.22-e2e-azure-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants