networking-calico: defensive fixes for silent elector death and status backlog#12456
networking-calico: defensive fixes for silent elector death and status backlog#12456nelljerram wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the OpenStack networking-calico ML2 driver against silent leader-election failures and status-processing backlogs by adding stronger “am I really master?” checks and surfacing delayed Felix status updates via warnings.
Changes:
- Add
Elector.healthy()(lease-refresh freshness + greenlet liveness) and switch master-only loops to use it. - Add
Elector.confirmed_master()(extra etcd GET verification) and use it for periodic resync gating. - Add stale Felix status detection with rate-limited warnings, plus unit tests for the new behaviors.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py | Adds stale-status warning logic (with rate limiting) for delayed Felix status updates. |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py | Adds healthy() / confirmed_master() and increases election logging verbosity. |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py | Switches master-only work gating from master() to healthy() / confirmed_master(). |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_status.py | New focused unit tests for stale-status handling. |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py | Adds unit tests for healthy() and confirmed_master(). |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py | Updates mocks to account for new elector methods used by threads. |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_monitor_thread.py | Updates mocks/tests for new elector gating methods. |
| networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py | Extends elector stub to implement healthy() / confirmed_master(). |
| def test_fresh_update_does_not_warn(self): | ||
| fresh = self._fmt(datetime.now(tz=timezone.utc)) | ||
| with mock.patch.object(status.LOG, "warning") as m_warn: | ||
| self.watcher._check_for_stale_status("host1", {"time": fresh}) | ||
| m_warn.assert_not_called() | ||
| self.assertEqual(0.0, self.watcher._last_stale_warn) | ||
|
|
||
| def test_stale_update_warns(self): | ||
| stale = self._fmt(datetime.now(tz=timezone.utc) - timedelta(hours=1)) | ||
| with mock.patch.object(status.LOG, "warning") as m_warn: | ||
| self.watcher._check_for_stale_status("host1", {"time": stale}) | ||
| m_warn.assert_called_once() |
There was a problem hiding this comment.
The new stale-status tests only cover RFC3339 timestamps with a trailing 'Z'. The existing etcd test harness uses a timezone-less value like "2015-08-14T10:37:54"; that currently causes a crash in _check_for_stale_status due to naive/aware datetime subtraction. Please add a unit test case for a timezone-less timestamp so this regression is caught.
| status_time_str, | ||
| hostname, | ||
| ) | ||
| return |
There was a problem hiding this comment.
_check_for_stale_status() can raise a TypeError when the status "time" parses as a naive datetime (no timezone info). datetime.fromisoformat('2015-08-14T10:37:54') returns a naive datetime, and subtracting it from datetime.now(tz=UTC) will crash the watcher thread. This is already a format used in the existing test harness (see test_plugin_etcd.py), so this needs to be handled (e.g., treat naive timestamps as UTC, or catch TypeError and skip/log).
| return | |
| return | |
| if status_time.tzinfo is None: | |
| # Older/test inputs may omit timezone information. Treat those | |
| # naive timestamps as UTC so arithmetic with an aware UTC "now" | |
| # does not raise TypeError. | |
| status_time = status_time.replace(tzinfo=timezone.utc) |
…s backlog For CORE-12651 and CI-1892 Three fixes for the issues investigated under CI-1892. None of them changes the election wire protocol or the existing behaviour in the common case; they add extra checks that catch known failure modes. 1. Elector.healthy() self._master is a local Python flag that only gets cleared if the election greenlet exits normally via _attempt_step_down() or the refresh-loop finally clause. If the greenlet dies silently - e.g. due to an eventlet-level issue that drops the frame without unwinding Python exceptions - _master stays True indefinitely and the rest of the driver keeps running master-only work (periodic resync, compaction, StatusWatcher) on a node whose etcd lease has long since expired, causing split-brain. Elector now stamps self._last_refresh on each successful lease refresh, and healthy() returns True only if the greenlet is still alive AND the lease has been refreshed within the last TTL. mech_calico.py callsites that decide whether to do master-only work switch from master() to healthy(). 2. Elector.confirmed_master() Before each iteration of periodic_resync_thread, also re-read the election key from etcd and confirm the value matches our id_string. If etcd disagrees, clear _master locally and skip the resync cycle. This defends against any residual disagreement between our in-process flag and etcd's ground truth, at the cost of one extra etcd GET per resync tick (negligible vs. the resync itself). 3. StatusWatcher stale-update WARNING When an incoming Felix status update has a "time" field more than 5 mins in the past, log a rate-limited WARNING so operators see the backlog building up long before it grows to hours. Skipped during initial-snapshot replay to avoid false positives. Complements the ReportingIntervalSecs / agent_down_time config tuning that was recommended to the customer. Also promote the surviving LOG.debug calls in election.py to LOG.info so election state transitions and refresh activity are visible in production logs without toggling debug logging. Unit tests added for healthy(), confirmed_master() and the stale status check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The silent-greenlet-death issue (CI-1892) can theoretically affect any of the driver's long-running greenlets, not just the elector. If periodic_resync_thread dies silently, Neutron DB / etcd drift goes undetected; if _status_updating_thread dies, Felix agent status stops reaching Neutron; and so on. Track all spawned worker greenlets in self._greenlets as (name, greenlet) pairs. Add _check_greenlets_alive(), which iterates the list and calls sys.exit(1) if any greenlet has .dead == True — the same recovery strategy the elector already uses in its BaseException handler. The process manager (systemd) restarts neutron-server, which is the safest recovery from an unknown-state failure. The check is called from _status_updating_thread (every 5s) and resync_monitor_thread (every resync_max_interval_secs / 5), giving mutual watchdogging: either can detect the other's death, and both can detect death of the remaining greenlets (periodic_resync, periodic_compaction, port_status workers). A greenlet checking itself always sees .dead == False, so no self-check special-casing is needed. The elector greenlet is not included — it has its own healthy() watchdog mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI-1892 is a private Jira ticket; non-Tigera readers cannot access it. The code comment is self-contained without the reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tatus check datetime.fromisoformat() returns a naive datetime when the input has no timezone suffix (e.g. "2015-08-14T10:37:54" with no trailing "Z"). Subtracting a naive datetime from datetime.now(tz=UTC) raises TypeError. Treat naive timestamps as UTC so the lag computation succeeds. Add a test for this case. Addresses Copilot review feedback on PR projectcalico#12456. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b4d8299 to
f34c2b3
Compare
…l_secs == 0 When resync_interval_secs is 0, periodic_resync_thread does a single resync cycle and then exits intentionally. The greenlet becomes .dead, which the mutual watchdog treats as an unexpected death and calls sys.exit(1) — a false positive that was triggered immediately in our DevStack CI (which uses resync_interval_secs = 0 by default). Only register the periodic_resync greenlet in the watchdog list when resync_interval_secs > 0, i.e. when the thread is expected to keep running indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
69bd85e to
b4a1a4e
Compare
| @@ -482,6 +519,29 @@ def _post_fork_init(self, voting=False): | |||
| "Calico mechanism driver initialisation done in process %s", current_pid | |||
| ) | |||
|
|
|||
| def _check_greenlets_alive(self): | |||
There was a problem hiding this comment.
With the current code (without this change), even though we might have multiple masters doing the same job, the cluster is actually still functional (e.g., it's ok to have two masters writing agent/status updates). My main worry here (with this change) is what if the elector threads on all neutron-servers die at roughly the same time? In this case, everyone will stop doing the job until we restart it here. This may take a LONG time, as this function is called only when we start to watch statuses or in resync monitor (which will sleep based on the configured time).
It's unfortunate that these threads are not reliable and none of the solutions on top my mind handles this well (e.g., have yet another thread to monitor this (or alter the existing monitor thread to be more general-purposed)), but lemme know what do you think?
There was a problem hiding this comment.
@zhanz1 Thanks for looking at this. I entirely agree that this is a theoretical problem:
what if the elector threads on all neutron-servers die at roughly the same time?
But do you think there is anything in this PR that has made that more likely? I don't think the PR does that, but perhaps I have missed something.
There was a problem hiding this comment.
Ah I didn't realize that the status update and resync monitor thread is constantly (every 5 secs) checking if the threads are alive if it's not the leader, my bad. I was thinking could there be a big gap between all three electors die till at least one the neutron-servers restarts.
No it doesn't make the elector threads on all neutron-servers die at roughly the same time more likely from my reading. I was simply thinking about edge cases because we don't have enough information on why exactly the threads die.
| the safest recovery — the same approach the elector already uses | ||
| for its own unhandled-exception path. | ||
| """ | ||
| for name, gt in self._greenlets: |
There was a problem hiding this comment.
I would be more aggressive - not trusting eventlet at all.
If an elector is not doing it's job (for master, this would be not updating self._last_refresh, for non-master, this would be the last time they watched the vote key change (so something like self._last_checked)), do a restart.
If the port status update worker thread is not consuming events from the queue, do a restart.
|
This PR is stale because it has been open for 60 days with no activity. |
|
Re-assessing this PR after #12668 (the worker-split) merged. The picture has changed enough that I'm splitting this into smaller, focused PRs and will close this one once they land. What was here:
What's being carried forward:
What's being dropped, and why:
|
For CORE-12651 and CI-1892
Three fixes for the issues investigated under CI-1892. None of them changes the election wire protocol or the existing behaviour in the common case; they add extra checks that catch known failure modes.
Elector.healthy()
self._master is a local Python flag that only gets cleared if the election greenlet exits normally via _attempt_step_down() or the refresh-loop finally clause. If the greenlet dies silently - e.g. due to an eventlet-level issue that drops the frame without unwinding Python exceptions - _master stays True indefinitely and the rest of the driver keeps running master-only work (periodic resync, compaction, StatusWatcher) on a node whose etcd lease has long since expired, causing split-brain.
Elector now stamps self._last_refresh on each successful lease refresh, and healthy() returns True only if the greenlet is still alive AND the lease has been refreshed within the last TTL. mech_calico.py callsites that decide whether to do master-only work switch from master() to healthy().
Elector.confirmed_master()
Before each iteration of periodic_resync_thread, also re-read the election key from etcd and confirm the value matches our id_string. If etcd disagrees, clear _master locally and skip the resync cycle. This defends against any residual disagreement between our in-process flag and etcd's ground truth, at the cost of one extra etcd GET per resync tick (negligible vs. the resync itself).
StatusWatcher stale-update WARNING
When an incoming Felix status update has a "time" field more than 5 mins in the past, log a rate-limited WARNING so operators see the backlog building up long before it grows to hours. Skipped during initial-snapshot replay to avoid false positives. Complements the ReportingIntervalSecs / agent_down_time config tuning that was recommended to the customer.
Also promote the surviving LOG.debug calls in election.py to LOG.info so election state transitions and refresh activity are visible in production logs without toggling debug logging.
Unit tests added for healthy(), confirmed_master() and the stale status check.
Release note: