Skip to content

networking-calico: defensive fixes for silent elector death and status backlog#12456

Open
nelljerram wants to merge 6 commits into
projectcalico:masterfrom
nelljerram:openstack-election-hardening
Open

networking-calico: defensive fixes for silent elector death and status backlog#12456
nelljerram wants to merge 6 commits into
projectcalico:masterfrom
nelljerram:openstack-election-hardening

Conversation

@nelljerram

@nelljerram nelljerram commented Apr 14, 2026

Copy link
Copy Markdown
Member

For CORE-12651 and CI-1892

Three fixes for the issues investigated under CI-1892. None of them changes the election wire protocol or the existing behaviour in the common case; they add extra checks that catch known failure modes.

  1. Elector.healthy()

    self._master is a local Python flag that only gets cleared if the election greenlet exits normally via _attempt_step_down() or the refresh-loop finally clause. If the greenlet dies silently - e.g. due to an eventlet-level issue that drops the frame without unwinding Python exceptions - _master stays True indefinitely and the rest of the driver keeps running master-only work (periodic resync, compaction, StatusWatcher) on a node whose etcd lease has long since expired, causing split-brain.

    Elector now stamps self._last_refresh on each successful lease refresh, and healthy() returns True only if the greenlet is still alive AND the lease has been refreshed within the last TTL. mech_calico.py callsites that decide whether to do master-only work switch from master() to healthy().

  2. Elector.confirmed_master()

    Before each iteration of periodic_resync_thread, also re-read the election key from etcd and confirm the value matches our id_string. If etcd disagrees, clear _master locally and skip the resync cycle. This defends against any residual disagreement between our in-process flag and etcd's ground truth, at the cost of one extra etcd GET per resync tick (negligible vs. the resync itself).

  3. StatusWatcher stale-update WARNING

    When an incoming Felix status update has a "time" field more than 5 mins in the past, log a rate-limited WARNING so operators see the backlog building up long before it grows to hours. Skipped during initial-snapshot replay to avoid false positives. Complements the ReportingIntervalSecs / agent_down_time config tuning that was recommended to the customer.

Also promote the surviving LOG.debug calls in election.py to LOG.info so election state transitions and refresh activity are visible in production logs without toggling debug logging.

Unit tests added for healthy(), confirmed_master() and the stale status check.

Release note:

Bugfix: Make OpenStack driver periodic processing more resilient to an eventlet bug that can result in silent greenlet death.

@nelljerram nelljerram requested a review from a team as a code owner April 14, 2026 10:13
Copilot AI review requested due to automatic review settings April 14, 2026 10:13
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 14, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 14, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Apr 14, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the OpenStack networking-calico ML2 driver against silent leader-election failures and status-processing backlogs by adding stronger “am I really master?” checks and surfacing delayed Felix status updates via warnings.

Changes:

  • Add Elector.healthy() (lease-refresh freshness + greenlet liveness) and switch master-only loops to use it.
  • Add Elector.confirmed_master() (extra etcd GET verification) and use it for periodic resync gating.
  • Add stale Felix status detection with rate-limited warnings, plus unit tests for the new behaviors.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py Adds stale-status warning logic (with rate limiting) for delayed Felix status updates.
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py Adds healthy() / confirmed_master() and increases election logging verbosity.
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Switches master-only work gating from master() to healthy() / confirmed_master().
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_status.py New focused unit tests for stale-status handling.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py Adds unit tests for healthy() and confirmed_master().
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py Updates mocks to account for new elector methods used by threads.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py Updates mocks/tests for new elector gating methods.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py Extends elector stub to implement healthy() / confirmed_master().

Comment on lines +44 to +55
def test_fresh_update_does_not_warn(self):
fresh = self._fmt(datetime.now(tz=timezone.utc))
with mock.patch.object(status.LOG, "warning") as m_warn:
self.watcher._check_for_stale_status("host1", {"time": fresh})
m_warn.assert_not_called()
self.assertEqual(0.0, self.watcher._last_stale_warn)

def test_stale_update_warns(self):
stale = self._fmt(datetime.now(tz=timezone.utc) - timedelta(hours=1))
with mock.patch.object(status.LOG, "warning") as m_warn:
self.watcher._check_for_stale_status("host1", {"time": stale})
m_warn.assert_called_once()

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new stale-status tests only cover RFC3339 timestamps with a trailing 'Z'. The existing etcd test harness uses a timezone-less value like "2015-08-14T10:37:54"; that currently causes a crash in _check_for_stale_status due to naive/aware datetime subtraction. Please add a unit test case for a timezone-less timestamp so this regression is caught.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status_time_str,
hostname,
)
return

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_for_stale_status() can raise a TypeError when the status "time" parses as a naive datetime (no timezone info). datetime.fromisoformat('2015-08-14T10:37:54') returns a naive datetime, and subtracting it from datetime.now(tz=UTC) will crash the watcher thread. This is already a format used in the existing test harness (see test_plugin_etcd.py), so this needs to be handled (e.g., treat naive timestamps as UTC, or catch TypeError and skip/log).

Suggested change
return
return
if status_time.tzinfo is None:
# Older/test inputs may omit timezone information. Treat those
# naive timestamps as UTC so arithmetic with an aware UTC "now"
# does not raise TypeError.
status_time = status_time.replace(tzinfo=timezone.utc)

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nelljerram and others added 4 commits April 14, 2026 12:00
…s backlog

For CORE-12651 and CI-1892

Three fixes for the issues investigated under CI-1892.  None of them
changes the election wire protocol or the existing behaviour in the
common case; they add extra checks that catch known failure modes.

1. Elector.healthy()

   self._master is a local Python flag that only gets cleared if the
   election greenlet exits normally via _attempt_step_down() or the
   refresh-loop finally clause.  If the greenlet dies silently - e.g.
   due to an eventlet-level issue that drops the frame without
   unwinding Python exceptions - _master stays True indefinitely and
   the rest of the driver keeps running master-only work (periodic
   resync, compaction, StatusWatcher) on a node whose etcd lease has
   long since expired, causing split-brain.

   Elector now stamps self._last_refresh on each successful lease
   refresh, and healthy() returns True only if the greenlet is still
   alive AND the lease has been refreshed within the last TTL.
   mech_calico.py callsites that decide whether to do master-only work
   switch from master() to healthy().

2. Elector.confirmed_master()

   Before each iteration of periodic_resync_thread, also re-read the
   election key from etcd and confirm the value matches our
   id_string.  If etcd disagrees, clear _master locally and skip the
   resync cycle.  This defends against any residual disagreement
   between our in-process flag and etcd's ground truth, at the cost of
   one extra etcd GET per resync tick (negligible vs. the resync
   itself).

3. StatusWatcher stale-update WARNING

   When an incoming Felix status update has a "time" field more than
   5 mins in the past, log a rate-limited WARNING so operators see the
   backlog building up long before it grows to hours.  Skipped during
   initial-snapshot replay to avoid false positives.  Complements the
   ReportingIntervalSecs / agent_down_time config tuning that was
   recommended to the customer.

Also promote the surviving LOG.debug calls in election.py to LOG.info
so election state transitions and refresh activity are visible in
production logs without toggling debug logging.

Unit tests added for healthy(), confirmed_master() and the stale
status check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The silent-greenlet-death issue (CI-1892) can theoretically affect any
of the driver's long-running greenlets, not just the elector.  If
periodic_resync_thread dies silently, Neutron DB / etcd drift goes
undetected; if _status_updating_thread dies, Felix agent status stops
reaching Neutron; and so on.

Track all spawned worker greenlets in self._greenlets as (name,
greenlet) pairs.  Add _check_greenlets_alive(), which iterates the
list and calls sys.exit(1) if any greenlet has .dead == True — the
same recovery strategy the elector already uses in its BaseException
handler.  The process manager (systemd) restarts neutron-server,
which is the safest recovery from an unknown-state failure.

The check is called from _status_updating_thread (every 5s) and
resync_monitor_thread (every resync_max_interval_secs / 5), giving
mutual watchdogging: either can detect the other's death, and both
can detect death of the remaining greenlets (periodic_resync,
periodic_compaction, port_status workers).  A greenlet checking
itself always sees .dead == False, so no self-check special-casing
is needed.

The elector greenlet is not included — it has its own healthy()
watchdog mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI-1892 is a private Jira ticket; non-Tigera readers cannot access it.
The code comment is self-contained without the reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tatus check

datetime.fromisoformat() returns a naive datetime when the input has
no timezone suffix (e.g. "2015-08-14T10:37:54" with no trailing "Z").
Subtracting a naive datetime from datetime.now(tz=UTC) raises
TypeError.  Treat naive timestamps as UTC so the lag computation
succeeds.

Add a test for this case.

Addresses Copilot review feedback on PR projectcalico#12456.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram force-pushed the openstack-election-hardening branch from b4d8299 to f34c2b3 Compare April 14, 2026 11:06
nelljerram and others added 2 commits April 14, 2026 12:10
…l_secs == 0

When resync_interval_secs is 0, periodic_resync_thread does a single
resync cycle and then exits intentionally.  The greenlet becomes
.dead, which the mutual watchdog treats as an unexpected death and
calls sys.exit(1) — a false positive that was triggered immediately
in our DevStack CI (which uses resync_interval_secs = 0 by default).

Only register the periodic_resync greenlet in the watchdog list when
resync_interval_secs > 0, i.e. when the thread is expected to keep
running indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram force-pushed the openstack-election-hardening branch from 69bd85e to b4a1a4e Compare April 14, 2026 11:33
@@ -482,6 +519,29 @@ def _post_fork_init(self, voting=False):
"Calico mechanism driver initialisation done in process %s", current_pid
)

def _check_greenlets_alive(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current code (without this change), even though we might have multiple masters doing the same job, the cluster is actually still functional (e.g., it's ok to have two masters writing agent/status updates). My main worry here (with this change) is what if the elector threads on all neutron-servers die at roughly the same time? In this case, everyone will stop doing the job until we restart it here. This may take a LONG time, as this function is called only when we start to watch statuses or in resync monitor (which will sleep based on the configured time).

It's unfortunate that these threads are not reliable and none of the solutions on top my mind handles this well (e.g., have yet another thread to monitor this (or alter the existing monitor thread to be more general-purposed)), but lemme know what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanz1 Thanks for looking at this. I entirely agree that this is a theoretical problem:

what if the elector threads on all neutron-servers die at roughly the same time?

But do you think there is anything in this PR that has made that more likely? I don't think the PR does that, but perhaps I have missed something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize that the status update and resync monitor thread is constantly (every 5 secs) checking if the threads are alive if it's not the leader, my bad. I was thinking could there be a big gap between all three electors die till at least one the neutron-servers restarts.

No it doesn't make the elector threads on all neutron-servers die at roughly the same time more likely from my reading. I was simply thinking about edge cases because we don't have enough information on why exactly the threads die.

the safest recovery — the same approach the elector already uses
for its own unhandled-exception path.
"""
for name, gt in self._greenlets:

@zhanz1 zhanz1 Apr 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more aggressive - not trusting eventlet at all.

If an elector is not doing it's job (for master, this would be not updating self._last_refresh, for non-master, this would be the last time they watched the vote key change (so something like self._last_checked)), do a restart.

If the port status update worker thread is not consuming events from the queue, do a restart.

@github-actions

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open for 60 days with no activity.

@github-actions github-actions Bot added the stale Issues without recent activity label Jun 14, 2026
@nelljerram

Copy link
Copy Markdown
Member Author

Re-assessing this PR after #12668 (the worker-split) merged. The picture has changed enough that I'm splitting this into smaller, focused PRs and will close this one once they land.

What was here:

  1. Elector.healthy() — defensive check against silent elector greenthread death.
  2. Elector.confirmed_master() — etcd ground-truth re-read before each periodic_resync_thread iteration.
  3. Mutual cross-greenlet watchdog.
  4. StatusWatcher stale-update WARNING.

What's being carried forward:

What's being dropped, and why:

  • Elector.healthy(): largely superseded. is_master() already does a time-based stale-refresh check and warns if the elector hasn't refreshed within MASTER_TIMEOUT. The only thing healthy() added was an explicit _greenlet.dead check — but if the greenlet died, the refresh would stop, and the time check catches that.
  • Elector.confirmed_master(): targeted at periodic_resync_thread, which no longer exists post-networking-calico: Separate leader jobs into different processes #12668. The resync is now one-shot at process start. The blast radius of stale mastership in the remaining loops is much smaller: compaction has its own etcd-coordinated lock, and stale status-watcher writes are caught by is_master()'s time check.
  • Mutual cross-greenlet watchdog: designed for the old "one process runs many threads" layout. With networking-calico: Separate leader jobs into different processes #12668's per-worker process split, each worker has one job; if its main greenlet dies the process is effectively dead and Neutron's supervisor respawns it.

Closing this once #13023 and #13024 merge.

@github-actions github-actions Bot removed the stale Issues without recent activity label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants