Skip to content

networking-calico: warn on stale Felix status updates#13024

Merged
nelljerram merged 2 commits into
projectcalico:masterfrom
nelljerram:status-stale-update-warning
Jun 23, 2026
Merged

networking-calico: warn on stale Felix status updates#13024
nelljerram merged 2 commits into
projectcalico:masterfrom
nelljerram:status-stale-update-warning

Conversation

@nelljerram

Copy link
Copy Markdown
Member

If the StatusWatcher is processing Felix uptime updates whose embedded "time" field is materially older than wall-clock now, this is evidence that we are running behind the rate of updates Felix is producing and a backlog is building up. Customers have hit this in production: Neutron ends up seeing agent up/down transitions hours after they actually happened, and the existing logs give no early warning while the backlog is growing.

Add a rate-limited WARNING in AgentStatusWatcher._on_status_set -- skipped during initial-snapshot replay where old timestamps are expected -- so operators can see the backlog building up long before it grows to hours.

This is the operator-facing piece of the CI-1892 hardening work that still stands on its own merits after PR #12668 split the mech driver into per-process workers. The other defensive fixes from that branch (elector watchdog, etcd-confirmed mastership) no longer carry their weight: the periodic-resync loop they were targeted at is gone, and the existing time-based check in is_master() already catches stale-elector mastership for the continuous loops that remain.

If the StatusWatcher is processing Felix uptime updates whose embedded
"time" field is materially older than wall-clock now, this is evidence
that we are running behind the rate of updates Felix is producing and a
backlog is building up.  Customers have hit this in production: Neutron
ends up seeing agent up/down transitions hours after they actually
happened, and the existing logs give no early warning while the backlog
is growing.

Add a rate-limited WARNING in AgentStatusWatcher._on_status_set --
skipped during initial-snapshot replay where old timestamps are
expected -- so operators can see the backlog building up long before it
grows to hours.

This is the operator-facing piece of the CI-1892 hardening work that
still stands on its own merits after PR projectcalico#12668 split the mech driver
into per-process workers.  The other defensive fixes from that branch
(elector watchdog, etcd-confirmed mastership) no longer carry their
weight: the periodic-resync loop they were targeted at is gone, and the
existing time-based check in is_master() already catches stale-elector
mastership for the continuous loops that remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram requested a review from a team as a code owner June 18, 2026 15:14
Copilot AI review requested due to automatic review settings June 18, 2026 15:14
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Jun 18, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 18, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 18, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented and removed docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact labels Jun 18, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds operator-facing visibility into backlog buildup in the OpenStack ML2 Calico driver by warning when Felix status updates being processed are materially older than wall-clock time, with rate limiting and snapshot-replay suppression.

Changes:

  • Add stale-status detection + rate-limited WARNING logging to StatusWatcher, invoked from AgentStatusWatcher._on_status_set.
  • Add focused unit tests covering fresh vs stale updates, snapshot replay suppression, rate limiting, and parse failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py Introduces stale-status detection helper and logs rate-limited warnings when processing lags behind Felix updates.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_status.py Adds unit tests for stale-status detection behavior and rate limiting.

Comment on lines +95 to +97
# Monotonic time of the last stale-status WARNING we logged. Used to rate-limit
# the warning so we do not flood the log when the whole cluster is backlogged.
self._last_stale_warn = 0.0

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4dee301. Switched both production (self._last_stale_warn = float("-inf")) and the test setUp so the first warning is always permitted regardless of system uptime.

Comment on lines +36 to +38
self.watcher = status.StatusWatcher.__new__(status.StatusWatcher)
self.watcher._last_stale_warn = 0.0
self.watcher.processing_snapshot = False

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the same commit (4dee301) -- test setUp now matches the production sentinel.

monotonic_time() is seconds-since-boot on Linux, so a 0.0 sentinel would
suppress the first stale-update WARNING on any host whose uptime is below
STALE_STATUS_WARN_INTERVAL_SECS (300s): the rate-limit check

    now_mono - self._last_stale_warn < STALE_STATUS_WARN_INTERVAL_SECS

would see a freshly-booted host's monotonic time as falsely close to the
0.0 "never warned" sentinel and treat the first stale update as within
the rate-limit window.

Initialise to float("-inf") so the first warning always passes the check
regardless of system uptime, and update the test setUp to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@nelljerram nelljerram merged commit 5c312a0 into projectcalico:master Jun 23, 2026
3 checks passed
@nelljerram nelljerram deleted the status-stale-update-warning branch June 23, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-candidate docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants