-
Notifications
You must be signed in to change notification settings - Fork 1.6k
networking-calico: defensive fixes for silent elector death and status backlog #12456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
5cffab3
33cd9c1
56bf525
f34c2b3
ac8e9a3
b4a1a4e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,7 @@ | |
| from datetime import datetime, timedelta | ||
| import os | ||
| import re | ||
| import sys | ||
| import threading | ||
| import uuid | ||
| from functools import wraps | ||
|
|
@@ -341,6 +342,11 @@ def __init__(self): | |
| # Last resync completion time | ||
| self.last_resync_time = datetime.now() | ||
|
|
||
| # List of (name, greenlet) for the long-running worker greenlets | ||
| # spawned by _post_fork_init. Used by _check_greenlets_alive() to | ||
| # detect silent greenlet death. | ||
| self._greenlets = [] | ||
|
|
||
| # Tell the monkeypatch where we are. | ||
| global mech_driver | ||
| assert mech_driver is None | ||
|
|
@@ -462,13 +468,44 @@ def _post_fork_init(self, voting=False): | |
| # We deliberately do this last, to ensure that all of the setup | ||
| # above is complete before we start running. | ||
| self._epoch += 1 | ||
| eventlet.spawn(self.resync_monitor_thread, self._epoch) | ||
| eventlet.spawn(self.periodic_resync_thread, self._epoch) | ||
| self._greenlets = [] | ||
| self._greenlets.append( | ||
| ( | ||
| "resync_monitor", | ||
| eventlet.spawn(self.resync_monitor_thread, self._epoch), | ||
| ) | ||
| ) | ||
| resync_gt = eventlet.spawn(self.periodic_resync_thread, self._epoch) | ||
| if cfg.CONF.calico.resync_interval_secs > 0: | ||
| # Only watchdog the resync thread if it is expected to | ||
| # keep running. When resync_interval_secs == 0 the | ||
| # thread does a single resync and then exits | ||
| # intentionally. | ||
| self._greenlets.append(("periodic_resync", resync_gt)) | ||
| if cfg.CONF.calico.etcd_compaction_period_mins > 0: | ||
| eventlet.spawn(self.periodic_compaction_thread, self._epoch) | ||
| eventlet.spawn(self._status_updating_thread, self._epoch) | ||
| for _ in range(cfg.CONF.calico.num_port_status_threads): | ||
| eventlet.spawn(self._loop_writing_port_statuses, self._epoch) | ||
| self._greenlets.append( | ||
| ( | ||
| "periodic_compaction", | ||
| eventlet.spawn( | ||
| self.periodic_compaction_thread, self._epoch | ||
| ), | ||
| ) | ||
| ) | ||
| self._greenlets.append( | ||
| ( | ||
| "status_updating", | ||
| eventlet.spawn(self._status_updating_thread, self._epoch), | ||
| ) | ||
| ) | ||
| for i in range(cfg.CONF.calico.num_port_status_threads): | ||
| self._greenlets.append( | ||
| ( | ||
| "port_status_%d" % i, | ||
| eventlet.spawn( | ||
| self._loop_writing_port_statuses, self._epoch | ||
| ), | ||
| ) | ||
| ) | ||
| else: | ||
| LOG.info( | ||
| "PID %s: Not a voting participant; " | ||
|
|
@@ -482,6 +519,29 @@ def _post_fork_init(self, voting=False): | |
| "Calico mechanism driver initialisation done in process %s", current_pid | ||
| ) | ||
|
|
||
| def _check_greenlets_alive(self): | ||
| """Detect if any long-running worker greenlet has silently died. | ||
|
|
||
| Under eventlet, a greenlet can occasionally die without unwinding | ||
| its Python frames (e.g. due to a hub-level error), leaving no | ||
| traceback in the log and no state cleanup. This method provides | ||
| mutual watchdogging: each of the driver's long-running loops | ||
| calls it periodically to verify that the others are still alive. | ||
|
|
||
| If a dead greenlet is found, we log an error and exit. The | ||
| process manager (systemd) will restart neutron-server, which is | ||
| the safest recovery — the same approach the elector already uses | ||
| for its own unhandled-exception path. | ||
| """ | ||
| for name, gt in self._greenlets: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be more aggressive - not trusting eventlet at all. If an elector is not doing it's job (for master, this would be not updating If the port status update worker thread is not consuming events from the queue, do a restart. |
||
| if gt.dead: | ||
| LOG.error( | ||
| "Worker greenlet %r has unexpectedly died; exiting so " | ||
| "that the process manager can restart neutron-server.", | ||
| name, | ||
| ) | ||
| sys.exit(1) | ||
|
|
||
| @logging_exceptions(LOG) | ||
| def _status_updating_thread(self, expected_epoch): | ||
| """_status_updating_thread | ||
|
|
@@ -493,8 +553,10 @@ def _status_updating_thread(self, expected_epoch): | |
| TrackTask("STATUS_UPDATING") | ||
| LOG.info("Status updating thread started.") | ||
| while self._epoch == expected_epoch: | ||
| # Only handle updates if we are the master node. | ||
| if self.elector.master(): | ||
| self._check_greenlets_alive() | ||
| # Only handle updates if we are healthily the master node. See | ||
| # Elector.healthy() for why we use healthy() rather than master(). | ||
| if self.elector.healthy(): | ||
| if self._etcd_watcher is None: | ||
| LOG.info("Became the master, starting StatusWatcher") | ||
| self._etcd_watcher = StatusWatcher(self) | ||
|
|
@@ -1232,8 +1294,9 @@ def resync_monitor_thread(self, launch_epoch): | |
| LOG.info("Resync monitor thread started") | ||
|
|
||
| while self._epoch == launch_epoch: | ||
| # Only monitor the resync if we are the master node. | ||
| if self.elector.master(): | ||
| self._check_greenlets_alive() | ||
| # Only monitor the resync if we are healthily the master node. | ||
| if self.elector.healthy(): | ||
| LOG.info("I am master: monitoring periodic resync") | ||
|
|
||
| curr_time = datetime.now() | ||
|
|
@@ -1276,8 +1339,13 @@ def periodic_resync_thread(self, launch_epoch): | |
| try: | ||
| LOG.info("Periodic resync thread started") | ||
| while self._epoch == launch_epoch: | ||
| # Only do the resync if we are the master node. | ||
| if self.elector.master(): | ||
| # Only do the resync if we are healthily the master node AND | ||
| # etcd still agrees that we are the master. The extra etcd | ||
| # read (vs. plain healthy()) defends against the in-process | ||
| # flag disagreeing with etcd's ground truth; resync is much | ||
| # more expensive than a single etcd GET, so the extra check | ||
| # is cheap by comparison. | ||
| if self.elector.confirmed_master(): | ||
| LOG.info("I am master: doing periodic resync") | ||
| start_time = datetime.now() | ||
|
|
||
|
|
@@ -1343,8 +1411,8 @@ def periodic_compaction_thread(self, launch_epoch): | |
| try: | ||
| LOG.info("Periodic compaction thread started") | ||
| while self._epoch == launch_epoch: | ||
| # Only do the compaction if we are the master node. | ||
| if self.elector.master(): | ||
| # Only do the compaction if we are healthily the master node. | ||
| if self.elector.healthy(): | ||
| LOG.info("I am master: doing periodic compaction") | ||
|
|
||
| try: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current code (without this change), even though we might have multiple masters doing the same job, the cluster is actually still functional (e.g., it's ok to have two masters writing agent/status updates). My main worry here (with this change) is what if the elector threads on all neutron-servers die at roughly the same time? In this case, everyone will stop doing the job until we restart it here. This may take a LONG time, as this function is called only when we start to watch statuses or in resync monitor (which will sleep based on the configured time).
It's unfortunate that these threads are not reliable and none of the solutions on top my mind handles this well (e.g., have yet another thread to monitor this (or alter the existing monitor thread to be more general-purposed)), but lemme know what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhanz1 Thanks for looking at this. I entirely agree that this is a theoretical problem:
But do you think there is anything in this PR that has made that more likely? I don't think the PR does that, but perhaps I have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realize that the status update and resync monitor thread is constantly (every 5 secs) checking if the threads are alive if it's not the leader, my bad. I was thinking could there be a big gap between all three electors die till at least one the neutron-servers restarts.
No it doesn't make
the elector threads on all neutron-servers die at roughly the same timemore likely from my reading. I was simply thinking about edge cases because we don't have enough information on why exactly the threads die.