networking-calico: warn on stale Felix status updates#13024
Merged
nelljerram merged 2 commits intoJun 23, 2026
Conversation
If the StatusWatcher is processing Felix uptime updates whose embedded "time" field is materially older than wall-clock now, this is evidence that we are running behind the rate of updates Felix is producing and a backlog is building up. Customers have hit this in production: Neutron ends up seeing agent up/down transitions hours after they actually happened, and the existing logs give no early warning while the backlog is growing. Add a rate-limited WARNING in AgentStatusWatcher._on_status_set -- skipped during initial-snapshot replay where old timestamps are expected -- so operators can see the backlog building up long before it grows to hours. This is the operator-facing piece of the CI-1892 hardening work that still stands on its own merits after PR projectcalico#12668 split the mech driver into per-process workers. The other defensive fixes from that branch (elector watchdog, etcd-confirmed mastership) no longer carry their weight: the periodic-resync loop they were targeted at is gone, and the existing time-based check in is_master() already catches stale-elector mastership for the continuous loops that remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds operator-facing visibility into backlog buildup in the OpenStack ML2 Calico driver by warning when Felix status updates being processed are materially older than wall-clock time, with rate limiting and snapshot-replay suppression.
Changes:
- Add stale-status detection + rate-limited WARNING logging to
StatusWatcher, invoked fromAgentStatusWatcher._on_status_set. - Add focused unit tests covering fresh vs stale updates, snapshot replay suppression, rate limiting, and parse failures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py |
Introduces stale-status detection helper and logs rate-limited warnings when processing lags behind Felix updates. |
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_status.py |
Adds unit tests for stale-status detection behavior and rate limiting. |
Comment on lines
+95
to
+97
| # Monotonic time of the last stale-status WARNING we logged. Used to rate-limit | ||
| # the warning so we do not flood the log when the whole cluster is backlogged. | ||
| self._last_stale_warn = 0.0 |
Member
Author
There was a problem hiding this comment.
Fixed in 4dee301. Switched both production (self._last_stale_warn = float("-inf")) and the test setUp so the first warning is always permitted regardless of system uptime.
Comment on lines
+36
to
+38
| self.watcher = status.StatusWatcher.__new__(status.StatusWatcher) | ||
| self.watcher._last_stale_warn = 0.0 | ||
| self.watcher.processing_snapshot = False |
Member
Author
There was a problem hiding this comment.
Fixed in the same commit (4dee301) -- test setUp now matches the production sentinel.
monotonic_time() is seconds-since-boot on Linux, so a 0.0 sentinel would
suppress the first stale-update WARNING on any host whose uptime is below
STALE_STATUS_WARN_INTERVAL_SECS (300s): the rate-limit check
now_mono - self._last_stale_warn < STALE_STATUS_WARN_INTERVAL_SECS
would see a freshly-booted host's monotonic time as falsely close to the
0.0 "never warned" sentinel and treat the first stale update as within
the rate-limit window.
Initialise to float("-inf") so the first warning always passes the check
regardless of system uptime, and update the test setUp to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coutinhop
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If the StatusWatcher is processing Felix uptime updates whose embedded "time" field is materially older than wall-clock now, this is evidence that we are running behind the rate of updates Felix is producing and a backlog is building up. Customers have hit this in production: Neutron ends up seeing agent up/down transitions hours after they actually happened, and the existing logs give no early warning while the backlog is growing.
Add a rate-limited WARNING in AgentStatusWatcher._on_status_set -- skipped during initial-snapshot replay where old timestamps are expected -- so operators can see the backlog building up long before it grows to hours.
This is the operator-facing piece of the CI-1892 hardening work that still stands on its own merits after PR #12668 split the mech driver into per-process workers. The other defensive fixes from that branch (elector watchdog, etcd-confirmed mastership) no longer carry their weight: the periodic-resync loop they were targeted at is gone, and the existing time-based check in is_master() already catches stale-elector mastership for the continuous loops that remain.