networking-calico: defensive fixes for silent elector death and status backlog by nelljerram · Pull Request #12456 · projectcalico/calico

nelljerram · 2026-04-14T10:13:55Z

For CORE-12651 and CI-1892

Three fixes for the issues investigated under CI-1892. None of them changes the election wire protocol or the existing behaviour in the common case; they add extra checks that catch known failure modes.

Elector.healthy()

self._master is a local Python flag that only gets cleared if the election greenlet exits normally via _attempt_step_down() or the refresh-loop finally clause. If the greenlet dies silently - e.g. due to an eventlet-level issue that drops the frame without unwinding Python exceptions - _master stays True indefinitely and the rest of the driver keeps running master-only work (periodic resync, compaction, StatusWatcher) on a node whose etcd lease has long since expired, causing split-brain.

Elector now stamps self._last_refresh on each successful lease refresh, and healthy() returns True only if the greenlet is still alive AND the lease has been refreshed within the last TTL. mech_calico.py callsites that decide whether to do master-only work switch from master() to healthy().
Elector.confirmed_master()

Before each iteration of periodic_resync_thread, also re-read the election key from etcd and confirm the value matches our id_string. If etcd disagrees, clear _master locally and skip the resync cycle. This defends against any residual disagreement between our in-process flag and etcd's ground truth, at the cost of one extra etcd GET per resync tick (negligible vs. the resync itself).
StatusWatcher stale-update WARNING

When an incoming Felix status update has a "time" field more than 5 mins in the past, log a rate-limited WARNING so operators see the backlog building up long before it grows to hours. Skipped during initial-snapshot replay to avoid false positives. Complements the ReportingIntervalSecs / agent_down_time config tuning that was recommended to the customer.

Also promote the surviving LOG.debug calls in election.py to LOG.info so election state transitions and refresh activity are visible in production logs without toggling debug logging.

Unit tests added for healthy(), confirmed_master() and the stale status check.

Release note:

Bugfix: Make OpenStack driver periodic processing more resilient to an eventlet bug that can result in silent greenlet death.

Copilot

Pull request overview

This PR hardens the OpenStack networking-calico ML2 driver against silent leader-election failures and status-processing backlogs by adding stronger “am I really master?” checks and surfacing delayed Felix status updates via warnings.

Changes:

Add Elector.healthy() (lease-refresh freshness + greenlet liveness) and switch master-only loops to use it.
Add Elector.confirmed_master() (extra etcd GET verification) and use it for periodic resync gating.
Add stale Felix status detection with rate-limited warnings, plus unit tests for the new behaviors.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py	Adds stale-status warning logic (with rate limiting) for delayed Felix status updates.
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py	Adds `healthy()` / `confirmed_master()` and increases election logging verbosity.
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py	Switches master-only work gating from `master()` to `healthy()` / `confirmed_master()`.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_status.py	New focused unit tests for stale-status handling.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py	Adds unit tests for `healthy()` and `confirmed_master()`.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py	Updates mocks to account for new elector methods used by threads.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py	Updates mocks/tests for new elector gating methods.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py	Extends elector stub to implement `healthy()` / `confirmed_master()`.

Copilot · 2026-04-14T10:19:39Z

+    def test_fresh_update_does_not_warn(self):
+        fresh = self._fmt(datetime.now(tz=timezone.utc))
+        with mock.patch.object(status.LOG, "warning") as m_warn:
+            self.watcher._check_for_stale_status("host1", {"time": fresh})
+        m_warn.assert_not_called()
+        self.assertEqual(0.0, self.watcher._last_stale_warn)
+
+    def test_stale_update_warns(self):
+        stale = self._fmt(datetime.now(tz=timezone.utc) - timedelta(hours=1))
+        with mock.patch.object(status.LOG, "warning") as m_warn:
+            self.watcher._check_for_stale_status("host1", {"time": stale})
+        m_warn.assert_called_once()


The new stale-status tests only cover RFC3339 timestamps with a trailing 'Z'. The existing etcd test harness uses a timezone-less value like "2015-08-14T10:37:54"; that currently causes a crash in _check_for_stale_status due to naive/aware datetime subtraction. Please add a unit test case for a timezone-less timestamp so this regression is caught.

Copilot · 2026-04-14T10:19:39Z

+                status_time_str,
+                hostname,
+            )
+            return


_check_for_stale_status() can raise a TypeError when the status "time" parses as a naive datetime (no timezone info). datetime.fromisoformat('2015-08-14T10:37:54') returns a naive datetime, and subtracting it from datetime.now(tz=UTC) will crash the watcher thread. This is already a format used in the existing test harness (see test_plugin_etcd.py), so this needs to be handled (e.g., treat naive timestamps as UTC, or catch TypeError and skip/log).

Suggested change

return

return

if status_time.tzinfo is None:

# Older/test inputs may omit timezone information. Treat those

# naive timestamps as UTC so arithmetic with an aware UTC "now"

# does not raise TypeError.

status_time = status_time.replace(tzinfo=timezone.utc)

…s backlog For CORE-12651 and CI-1892 Three fixes for the issues investigated under CI-1892. None of them changes the election wire protocol or the existing behaviour in the common case; they add extra checks that catch known failure modes. 1. Elector.healthy() self._master is a local Python flag that only gets cleared if the election greenlet exits normally via _attempt_step_down() or the refresh-loop finally clause. If the greenlet dies silently - e.g. due to an eventlet-level issue that drops the frame without unwinding Python exceptions - _master stays True indefinitely and the rest of the driver keeps running master-only work (periodic resync, compaction, StatusWatcher) on a node whose etcd lease has long since expired, causing split-brain. Elector now stamps self._last_refresh on each successful lease refresh, and healthy() returns True only if the greenlet is still alive AND the lease has been refreshed within the last TTL. mech_calico.py callsites that decide whether to do master-only work switch from master() to healthy(). 2. Elector.confirmed_master() Before each iteration of periodic_resync_thread, also re-read the election key from etcd and confirm the value matches our id_string. If etcd disagrees, clear _master locally and skip the resync cycle. This defends against any residual disagreement between our in-process flag and etcd's ground truth, at the cost of one extra etcd GET per resync tick (negligible vs. the resync itself). 3. StatusWatcher stale-update WARNING When an incoming Felix status update has a "time" field more than 5 mins in the past, log a rate-limited WARNING so operators see the backlog building up long before it grows to hours. Skipped during initial-snapshot replay to avoid false positives. Complements the ReportingIntervalSecs / agent_down_time config tuning that was recommended to the customer. Also promote the surviving LOG.debug calls in election.py to LOG.info so election state transitions and refresh activity are visible in production logs without toggling debug logging. Unit tests added for healthy(), confirmed_master() and the stale status check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The silent-greenlet-death issue (CI-1892) can theoretically affect any of the driver's long-running greenlets, not just the elector. If periodic_resync_thread dies silently, Neutron DB / etcd drift goes undetected; if _status_updating_thread dies, Felix agent status stops reaching Neutron; and so on. Track all spawned worker greenlets in self._greenlets as (name, greenlet) pairs. Add _check_greenlets_alive(), which iterates the list and calls sys.exit(1) if any greenlet has .dead == True — the same recovery strategy the elector already uses in its BaseException handler. The process manager (systemd) restarts neutron-server, which is the safest recovery from an unknown-state failure. The check is called from _status_updating_thread (every 5s) and resync_monitor_thread (every resync_max_interval_secs / 5), giving mutual watchdogging: either can detect the other's death, and both can detect death of the remaining greenlets (periodic_resync, periodic_compaction, port_status workers). A greenlet checking itself always sees .dead == False, so no self-check special-casing is needed. The elector greenlet is not included — it has its own healthy() watchdog mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI-1892 is a private Jira ticket; non-Tigera readers cannot access it. The code comment is self-contained without the reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tatus check datetime.fromisoformat() returns a naive datetime when the input has no timezone suffix (e.g. "2015-08-14T10:37:54" with no trailing "Z"). Subtracting a naive datetime from datetime.now(tz=UTC) raises TypeError. Treat naive timestamps as UTC so the lag computation succeeds. Add a test for this case. Addresses Copilot review feedback on PR projectcalico#12456. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l_secs == 0 When resync_interval_secs is 0, periodic_resync_thread does a single resync cycle and then exits intentionally. The greenlet becomes .dead, which the mutual watchdog treats as an unexpected death and calls sys.exit(1) — a false positive that was triggered immediately in our DevStack CI (which uses resync_interval_secs = 0 by default). Only register the periodic_resync greenlet in the watchdog list when resync_interval_secs > 0, i.e. when the thread is expected to keep running indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zhanz1 · 2026-04-14T20:46:31Z

@@ -482,6 +519,29 @@ def _post_fork_init(self, voting=False):
                "Calico mechanism driver initialisation done in process %s", current_pid
            )

+    def _check_greenlets_alive(self):


With the current code (without this change), even though we might have multiple masters doing the same job, the cluster is actually still functional (e.g., it's ok to have two masters writing agent/status updates). My main worry here (with this change) is what if the elector threads on all neutron-servers die at roughly the same time? In this case, everyone will stop doing the job until we restart it here. This may take a LONG time, as this function is called only when we start to watch statuses or in resync monitor (which will sleep based on the configured time).

It's unfortunate that these threads are not reliable and none of the solutions on top my mind handles this well (e.g., have yet another thread to monitor this (or alter the existing monitor thread to be more general-purposed)), but lemme know what do you think?

@zhanz1 Thanks for looking at this. I entirely agree that this is a theoretical problem:

what if the elector threads on all neutron-servers die at roughly the same time?

But do you think there is anything in this PR that has made that more likely? I don't think the PR does that, but perhaps I have missed something.

Ah I didn't realize that the status update and resync monitor thread is constantly (every 5 secs) checking if the threads are alive if it's not the leader, my bad. I was thinking could there be a big gap between all three electors die till at least one the neutron-servers restarts.

No it doesn't make the elector threads on all neutron-servers die at roughly the same time more likely from my reading. I was simply thinking about edge cases because we don't have enough information on why exactly the threads die.

zhanz1 · 2026-04-15T13:32:18Z

+        the safest recovery — the same approach the elector already uses
+        for its own unhandled-exception path.
+        """
+        for name, gt in self._greenlets:


I would be more aggressive - not trusting eventlet at all.

If an elector is not doing it's job (for master, this would be not updating self._last_refresh, for non-master, this would be the last time they watched the vote key change (so something like self._last_checked)), do a restart.

If the port status update worker thread is not consuming events from the queue, do a restart.

github-actions · 2026-06-14T18:35:50Z

This PR is stale because it has been open for 60 days with no activity.

nelljerram · 2026-06-18T15:33:07Z

Re-assessing this PR after #12668 (the worker-split) merged. The picture has changed enough that I'm splitting this into smaller, focused PRs and will close this one once they land.

What was here:

Elector.healthy() — defensive check against silent elector greenthread death.
Elector.confirmed_master() — etcd ground-truth re-read before each periodic_resync_thread iteration.
Mutual cross-greenlet watchdog.
StatusWatcher stale-update WARNING.

What's being carried forward:

Stale-update WARNING → networking-calico: warn on stale Felix status updates #13024. Stands on its own merits independent of the mastership story — operators benefit from early warning of a Felix status backlog either way.
Mastership documentation → mech_calico: document mastership architecture in get_workers docstring #13023. The de facto split since networking-calico: Separate leader jobs into different processes #12668 — static [calico] startup_resync config for the one-shot startup resync, dynamic election for the continuous compaction and status-watching loops — was never written down. Documenting it on get_workers() so future readers don't have to reverse-engineer it from is_master() callsites.

What's being dropped, and why:

Elector.healthy(): largely superseded. is_master() already does a time-based stale-refresh check and warns if the elector hasn't refreshed within MASTER_TIMEOUT. The only thing healthy() added was an explicit _greenlet.dead check — but if the greenlet died, the refresh would stop, and the time check catches that.
Elector.confirmed_master(): targeted at periodic_resync_thread, which no longer exists post-networking-calico: Separate leader jobs into different processes #12668. The resync is now one-shot at process start. The blast radius of stale mastership in the remaining loops is much smaller: compaction has its own etcd-coordinated lock, and stale status-watcher writes are caught by is_master()'s time check.
Mutual cross-greenlet watchdog: designed for the old "one process runs many threads" layout. With networking-calico: Separate leader jobs into different processes #12668's per-worker process split, each worker has one job; if its main greenlet dies the process is effectively dead and Neutron's supervisor respawns it.

Closing this once #13023 and #13024 merge.

nelljerram requested a review from a team as a code owner April 14, 2026 10:13

Copilot AI review requested due to automatic review settings April 14, 2026 10:13

marvin-tigera added this to the Calico v3.33.0 milestone Apr 14, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 14, 2026

nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Apr 14, 2026

Copilot started reviewing on behalf of nelljerram April 14, 2026 10:14 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

nelljerram and others added 4 commits April 14, 2026 12:00

networking-calico: remove CI-1892 reference from code comment

56bf525

CI-1892 is a private Jira ticket; non-Tigera readers cannot access it. The code comment is self-contained without the reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nelljerram force-pushed the openstack-election-hardening branch from b4d8299 to f34c2b3 Compare April 14, 2026 11:06

nelljerram and others added 2 commits April 14, 2026 12:10

make -C networking-calico fmtpy

ac8e9a3

nelljerram force-pushed the openstack-election-hardening branch from 69bd85e to b4a1a4e Compare April 14, 2026 11:33

zhanz1 reviewed Apr 14, 2026

View reviewed changes

zhanz1 reviewed Apr 15, 2026

View reviewed changes

nelljerram mentioned this pull request May 20, 2026

networking-calico: Separate leader jobs into different processes #12668

Merged

github-actions Bot added the stale Issues without recent activity label Jun 14, 2026

github-actions Bot removed the stale Issues without recent activity label Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

networking-calico: defensive fixes for silent elector death and status backlog#12456

networking-calico: defensive fixes for silent elector death and status backlog#12456
nelljerram wants to merge 6 commits into
projectcalico:masterfrom
nelljerram:openstack-election-hardening

nelljerram commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

nelljerram Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

nelljerram Apr 14, 2026

Uh oh!

Uh oh!

zhanz1 Apr 14, 2026

Uh oh!

nelljerram Apr 15, 2026

Uh oh!

zhanz1 Apr 15, 2026

Uh oh!

zhanz1 Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

nelljerram commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-            return
+            return
+        if status_time.tzinfo is None:
+            # Older/test inputs may omit timezone information.  Treat those
+            # naive timestamps as UTC so arithmetic with an aware UTC "now"
+            # does not raise TypeError.
+            status_time = status_time.replace(tzinfo=timezone.utc)

Conversation

nelljerram commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

nelljerram Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

nelljerram Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhanz1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

nelljerram Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

zhanz1 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

zhanz1 Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

nelljerram commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nelljerram commented Apr 14, 2026 •

edited

Loading

zhanz1 Apr 15, 2026 •

edited

Loading