networking-calico: clean up stale election keys on both ends#13069
networking-calico: clean up stale election keys on both ends#13069nelljerram wants to merge 6 commits into
Conversation
Two complementary fixes for the failure mode where the election key is left in etcd after the previous master has gone away. Without them, a restarted neutron-server has to wait out the full lease TTL (60s) before any node can win the next election. 1. Stale-key cleanup on read (`election._check_master_process`). When `_vote` reads the current election key and the value parses as `<server_id>:<pid>`, look up `/proc/<pid>`; if the process is no longer running on this host, CAS-delete the key against the observed value and restart the election. This is the equivalent of the pre-12668 _check_master_process, restored verbatim aside from docstring and wrapping tidy-ups. Covers SIGKILL / silent greenlet death of a previous master on this host. 2. Clean step-down on graceful shutdown (`CalicoManagerWorker.stop`). `CalicoManagerWorker` now keeps a back-reference to the mech driver and calls `driver.elector.stop()` before chaining to the base `BaseWorker.stop()`. The elector's stop() blocks until its greenlet has exited, which is what runs the `finally: _attempt_step_down()` that deletes the election key. Without this hook the elector greenlet was just killed, the finally never ran, and the key lingered. Covers SIGTERM-initiated shutdown of the master. Unit tests in test_election.py cover the /proc cleanup paths: live PID (no-op), dead PID with successful CAS-delete, dead PID with CAS-mismatch (restart), dead PID with etcd exception (restart), different host (no /proc check), and unparseable key value (warn only). Resolves CORE-13033. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a leader-election failure mode in the Neutron Calico ML2 driver where an election key can be left behind in etcd after the elected master disappears, forcing subsequent elections to wait for the full lease TTL before progressing.
Changes:
- Add stale-election-key cleanup when reading the current master key by checking
/proc/<pid>for same-host master IDs and CAS-deleting the key if the process is gone. - Ensure graceful shutdown triggers a clean step-down by having
CalicoManagerWorker.stop()calldriver.elector.stop()before chaining to the base worker shutdown. - Add unit tests for the new
_check_master_processstale-key cleanup behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py |
Adds a mech-driver back-reference and stops the elector during worker shutdown to ensure step-down cleanup runs. |
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py |
Passes the mechanism driver into CalicoManagerWorker so the worker can access self.elector at shutdown. |
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py |
Introduces _check_master_process() and invokes it on initial election-key read to proactively remove stale same-host election keys. |
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py |
Adds direct unit coverage for _check_master_process() across success and failure paths. |
|
@zhanz1 It has turned out that we need to reinstate the Please could you review this PR and let me know your thoughts? |
Cover the three branches the shutdown path can take: - happy path: driver.elector is set, so stop() calls elector.stop() exactly once and then chains to super().stop(); - elector absent: _driver is None / driver has no `elector` attribute / driver.elector is None -- stop() must not raise and must still chain to super(); - elector.stop() raises: the exception is logged and swallowed (we are on the shutdown path; the election key will expire with its lease either way) and super().stop() is still called. Without this, the regression-fix from abaefed ("networking-calico: clean up stale election keys on both ends") is unguarded. Raised in review of projectcalico#13069. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhanz1
left a comment
There was a problem hiding this comment.
Thanks for the fix and I think overall make sense, have some comments.
|
|
||
| def stop(self): | ||
| """Stop service.""" | ||
| """Stop service. |
There was a problem hiding this comment.
This makes sense. IIUC (correct me if I'm wrong), when we do something like systemctl stop neutron-server, it will send a SIGINT to all neutron-server processes, which each process's handler will catch that and call stop?
There was a problem hiding this comment.
Yes, I believe so. So when a shutdown is "graceful" in this way, the master elector deletes its etcd key. However, in case that doesn't happen (for any reason), we also reinstate the check_master_process logic so that a stale etcd key can be immediately identified when the neutron server starts again.
| ) | ||
| self._become_master() | ||
|
|
||
| def _check_master_process(self, master_id): |
There was a problem hiding this comment.
The "master" process over here is really the manager process right? If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?
This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability. Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL? It would probably make sense to make it configurable.
There was a problem hiding this comment.
Yes, the master process is the "manager" process, because that is the one that includes the Elector.
If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?
I'm afraid I don't understand your query here. The specific system test scenario here was the neutron-server being restarted in order to pick up a config change. So the manager process in the old neutron-server is killed, and then a new manager process runs in the new neutron-server.
This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability.
No, I don't think that's right. It was working before because of the _check_master_process logic. That was removed in #12668 , and this PR now reinstates it.
Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL?
What TTL do you have in mind here? I'm afraid I don't understand your suggestion.
There was a problem hiding this comment.
Ah I see what you are referring to here, I believe I misunderstood the intention of this function. Before Chao & I made any changes, IIUC, all neutron-server processes will run _post_fork_init and therefore run Elector. One of them will be elected as leader, and the rest will just do this _check_master_process. With this design, if the old leader process somehow died, then other processes can quickly delete the key and restart the election. When I was writing #12668, I thought this was the intention and hence removed it (because there will only be one process running Elector). I did not think about the restarts.
In terms of the TTL, I mean the lease TTL (MASTER_TIMEOUT). I think it would be helpful to make it configurable so that in case of a machine failure, where no one will delete the election key, we can shrink the time it takes for other machines to step in and become master.
There was a problem hiding this comment.
I agree it might be useful for MASTER_TIMEOUT and MASTER_REFRESH_INTERVAL to be configurable. But I think that is independent of the current PR, isn't it? 10s is already quite low for MASTER_REFRESH_INTERVAL.
There was a problem hiding this comment.
Yeah ofc, this can be a separate PR. While the refresh interval is 10s, I think it's the 60s timeout who's blocking longer for other neutron-server to step up?
There was a problem hiding this comment.
Yes, but (in case it's not clear) the current PR already fixes that. When the Neutron server restarts:
- It immediately creates the ManagerWorker process, ->
_init_start_calico_manager()->self.elector.start() - Elector calls
_vote, which finds the key and calls_check_master_process _check_master_processparses successfully, finds that host matches its own, but PID no longer running, and so deletes the key._votenow either sees the delete, orKeyNotFound, and so calls_become_master.
The important point is that that all happens immediately without waiting for any timeout or refresh interval.
There was a problem hiding this comment.
Yes, this is happy path where the neutron-server is restarted and I think we are on the same page here :D. I'm more referring to when there is a machine failure (i.e., the machine crashed) and thus the neutron-server is never restarted (because it can't), and it would then take this 60s for other machines running neutron-server to become master - and in that case if we want to reduce the downtime, we'll need to reduce the 60s - but this should be another PR as we discussed.
There was a problem hiding this comment.
Thanks, that makes sense. Would you like to prepare that PR?
There was a problem hiding this comment.
Sure, can do when I get a chance.
The 5 TestCalicoManagerWorkerStop tests were failing when run via subunit.run discover under the plugins/ml2/drivers/calico/test/ directory, while passing when run via unittest in isolation. Root cause: that directory's lib.py replaces sys.modules['neutron_lib.worker'] with a MagicMock at module-import time, so that when sibling tests (test_compaction, test_election, test_mech_calico, ...) load lib first, the real neutron_lib.worker.BaseWorker is never imported. When workers.py is then imported via mech_calico's transitive chain, its `worker.BaseWorker` reference is a Mock attribute, and `class CalicoManagerWorker(BaseWorker)` ends up collapsing into a MagicMock instead of becoming a real class. At that point our test patches against a Mock and our super().stop() chain assertions cannot fire. Move the test file to the sibling networking_calico/tests/ directory, where lib.py is not imported and neutron_lib.worker stays real. The two test directories run in separate subunit processes per .testr.conf, so process isolation guarantees this fix holds regardless of test ordering. Also documents the rationale in the file's docstring so future readers don't move it back. Verified: all 5 tests pass under both python -m unittest -v networking_calico.tests.test_workers and python -m subunit.run discover -t . networking_calico/tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhanz1
left a comment
There was a problem hiding this comment.
Make sense to me and thanks for fixing this!
Two complementary fixes for the failure mode where the election key is left in etcd after the previous master has gone away. Without them, a restarted neutron-server has to wait out the full lease TTL (60s) before any node can win the next election.
Stale-key cleanup on read (
election._check_master_process). When_votereads the current election key and the value parses as<server_id>:<pid>, look up/proc/<pid>; if the process is no longer running on this host, CAS-delete the key against the observed value and restart the election. This is the equivalent of the pre-12668 _check_master_process, restored verbatim aside from docstring and wrapping tidy-ups. Covers SIGKILL / silent greenlet death of a previous master on this host.Clean step-down on graceful shutdown (
CalicoManagerWorker.stop).CalicoManagerWorkernow keeps a back-reference to the mech driver and callsdriver.elector.stop()before chaining to the baseBaseWorker.stop(). The elector's stop() blocks until its greenlet has exited, which is what runs thefinally: _attempt_step_down()that deletes the election key. Without this hook the elector greenlet was just killed, the finally never ran, and the key lingered. Covers SIGTERM-initiated shutdown of the master.Unit tests in test_election.py cover the /proc cleanup paths: live PID (no-op), dead PID with successful CAS-delete, dead PID with CAS-mismatch (restart), dead PID with etcd exception (restart), different host (no /proc check), and unparseable key value (warn only).
Resolves CORE-13033.