networking-calico: clean up stale election keys on both ends by nelljerram · Pull Request #13069 · projectcalico/calico

nelljerram · 2026-06-24T14:21:01Z

Two complementary fixes for the failure mode where the election key is left in etcd after the previous master has gone away. Without them, a restarted neutron-server has to wait out the full lease TTL (60s) before any node can win the next election.

Stale-key cleanup on read (election._check_master_process). When _vote reads the current election key and the value parses as <server_id>:<pid>, look up /proc/<pid>; if the process is no longer running on this host, CAS-delete the key against the observed value and restart the election. This is the equivalent of the pre-12668 _check_master_process, restored verbatim aside from docstring and wrapping tidy-ups. Covers SIGKILL / silent greenlet death of a previous master on this host.
Clean step-down on graceful shutdown (CalicoManagerWorker.stop). CalicoManagerWorker now keeps a back-reference to the mech driver and calls driver.elector.stop() before chaining to the base BaseWorker.stop(). The elector's stop() blocks until its greenlet has exited, which is what runs the finally: _attempt_step_down() that deletes the election key. Without this hook the elector greenlet was just killed, the finally never ran, and the key lingered. Covers SIGTERM-initiated shutdown of the master.

Unit tests in test_election.py cover the /proc cleanup paths: live PID (no-op), dead PID with successful CAS-delete, dead PID with CAS-mismatch (restart), dead PID with etcd exception (restart), different host (no /proc check), and unparseable key value (warn only).

Resolves CORE-13033.

Two complementary fixes for the failure mode where the election key is left in etcd after the previous master has gone away. Without them, a restarted neutron-server has to wait out the full lease TTL (60s) before any node can win the next election. 1. Stale-key cleanup on read (`election._check_master_process`). When `_vote` reads the current election key and the value parses as `<server_id>:<pid>`, look up `/proc/<pid>`; if the process is no longer running on this host, CAS-delete the key against the observed value and restart the election. This is the equivalent of the pre-12668 _check_master_process, restored verbatim aside from docstring and wrapping tidy-ups. Covers SIGKILL / silent greenlet death of a previous master on this host. 2. Clean step-down on graceful shutdown (`CalicoManagerWorker.stop`). `CalicoManagerWorker` now keeps a back-reference to the mech driver and calls `driver.elector.stop()` before chaining to the base `BaseWorker.stop()`. The elector's stop() blocks until its greenlet has exited, which is what runs the `finally: _attempt_step_down()` that deletes the election key. Without this hook the elector greenlet was just killed, the finally never ran, and the key lingered. Covers SIGTERM-initiated shutdown of the master. Unit tests in test_election.py cover the /proc cleanup paths: live PID (no-op), dead PID with successful CAS-delete, dead PID with CAS-mismatch (restart), dead PID with etcd exception (restart), different host (no /proc check), and unparseable key value (warn only). Resolves CORE-13033. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses a leader-election failure mode in the Neutron Calico ML2 driver where an election key can be left behind in etcd after the elected master disappears, forcing subsequent elections to wait for the full lease TTL before progressing.

Changes:

Add stale-election-key cleanup when reading the current master key by checking /proc/<pid> for same-host master IDs and CAS-deleting the key if the process is gone.
Ensure graceful shutdown triggers a clean step-down by having CalicoManagerWorker.stop() call driver.elector.stop() before chaining to the base worker shutdown.
Add unit tests for the new _check_master_process stale-key cleanup behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py`	Adds a mech-driver back-reference and stops the elector during worker shutdown to ensure step-down cleanup runs.
`networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py`	Passes the mechanism driver into `CalicoManagerWorker` so the worker can access `self.elector` at shutdown.
`networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py`	Introduces `_check_master_process()` and invokes it on initial election-key read to proactively remove stale same-host election keys.
`networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py`	Adds direct unit coverage for `_check_master_process()` across success and failure paths.

nelljerram · 2026-06-24T15:41:36Z

@zhanz1 It has turned out that we need to reinstate the _check_master_process logic; otherwise there can be too long a gap in agent status watching when the Neutron server restarts - which in turn can lead to a Felix being considered as dead, and Neutron refusing to bind a port on that Felix's hypervisor.

Please could you review this PR and let me know your thoughts?

Cover the three branches the shutdown path can take: - happy path: driver.elector is set, so stop() calls elector.stop() exactly once and then chains to super().stop(); - elector absent: _driver is None / driver has no `elector` attribute / driver.elector is None -- stop() must not raise and must still chain to super(); - elector.stop() raises: the exception is logged and swallowed (we are on the shutdown path; the election key will expire with its lease either way) and super().stop() is still called. Without this, the regression-fix from abaefed ("networking-calico: clean up stale election keys on both ends") is unguarded. Raised in review of projectcalico#13069. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

zhanz1

Thanks for the fix and I think overall make sense, have some comments.

zhanz1 · 2026-06-24T17:23:49Z


    def stop(self):
-        """Stop service."""
+        """Stop service.


This makes sense. IIUC (correct me if I'm wrong), when we do something like systemctl stop neutron-server, it will send a SIGINT to all neutron-server processes, which each process's handler will catch that and call stop?

Yes, I believe so. So when a shutdown is "graceful" in this way, the master elector deletes its etcd key. However, in case that doesn't happen (for any reason), we also reinstate the check_master_process logic so that a stale etcd key can be immediately identified when the neutron server starts again.

zhanz1 · 2026-06-24T17:27:43Z

                )
                self._become_master()

+    def _check_master_process(self, master_id):


The "master" process over here is really the manager process right? If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?

This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability. Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL? It would probably make sense to make it configurable.

Yes, the master process is the "manager" process, because that is the one that includes the Elector.

If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?

I'm afraid I don't understand your query here. The specific system test scenario here was the neutron-server being restarted in order to pick up a config change. So the manager process in the old neutron-server is killed, and then a new manager process runs in the new neutron-server.

This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability.

No, I don't think that's right. It was working before because of the _check_master_process logic. That was removed in #12668 , and this PR now reinstates it.

Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL?

What TTL do you have in mind here? I'm afraid I don't understand your suggestion.

Ah I see what you are referring to here, I believe I misunderstood the intention of this function. Before Chao & I made any changes, IIUC, all neutron-server processes will run _post_fork_init and therefore run Elector. One of them will be elected as leader, and the rest will just do this _check_master_process. With this design, if the old leader process somehow died, then other processes can quickly delete the key and restart the election. When I was writing #12668, I thought this was the intention and hence removed it (because there will only be one process running Elector). I did not think about the restarts.

In terms of the TTL, I mean the lease TTL (MASTER_TIMEOUT). I think it would be helpful to make it configurable so that in case of a machine failure, where no one will delete the election key, we can shrink the time it takes for other machines to step in and become master.

I agree it might be useful for MASTER_TIMEOUT and MASTER_REFRESH_INTERVAL to be configurable. But I think that is independent of the current PR, isn't it? 10s is already quite low for MASTER_REFRESH_INTERVAL.

Yeah ofc, this can be a separate PR. While the refresh interval is 10s, I think it's the 60s timeout who's blocking longer for other neutron-server to step up?

Yes, but (in case it's not clear) the current PR already fixes that. When the Neutron server restarts:

It immediately creates the ManagerWorker process, -> _init_start_calico_manager() -> self.elector.start()

Elector calls _vote, which finds the key and calls _check_master_process

_check_master_process parses successfully, finds that host matches its own, but PID no longer running, and so deletes the key.

_vote now either sees the delete, or KeyNotFound, and so calls _become_master.

The important point is that that all happens immediately without waiting for any timeout or refresh interval.

Yes, this is happy path where the neutron-server is restarted and I think we are on the same page here :D. I'm more referring to when there is a machine failure (i.e., the machine crashed) and thus the neutron-server is never restarted (because it can't), and it would then take this 60s for other machines running neutron-server to become master - and in that case if we want to reduce the downtime, we'll need to reduce the 60s - but this should be another PR as we discussed.

Thanks, that makes sense. Would you like to prepare that PR?

Sure, can do when I get a chance.

…ection-fixes

The 5 TestCalicoManagerWorkerStop tests were failing when run via subunit.run discover under the plugins/ml2/drivers/calico/test/ directory, while passing when run via unittest in isolation. Root cause: that directory's lib.py replaces sys.modules['neutron_lib.worker'] with a MagicMock at module-import time, so that when sibling tests (test_compaction, test_election, test_mech_calico, ...) load lib first, the real neutron_lib.worker.BaseWorker is never imported. When workers.py is then imported via mech_calico's transitive chain, its `worker.BaseWorker` reference is a Mock attribute, and `class CalicoManagerWorker(BaseWorker)` ends up collapsing into a MagicMock instead of becoming a real class. At that point our test patches against a Mock and our super().stop() chain assertions cannot fire. Move the test file to the sibling networking_calico/tests/ directory, where lib.py is not imported and neutron_lib.worker stays real. The two test directories run in separate subunit processes per .testr.conf, so process isolation guarantees this fix holds regardless of test ordering. Also documents the rationale in the file's docstring so future readers don't move it back. Verified: all 5 tests pass under both python -m unittest -v networking_calico.tests.test_workers and python -m subunit.run discover -t . networking_calico/tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ection-fixes

zhanz1

Make sense to me and thanks for fixing this!

…-fixes

Copilot AI review requested due to automatic review settings June 24, 2026 14:21

nelljerram requested a review from a team as a code owner June 24, 2026 14:21

marvin-tigera added this to the Calico v3.33.0 milestone Jun 24, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 24, 2026

Copilot started reviewing on behalf of nelljerram June 24, 2026 14:21 View session

github-actions Bot added the cherry-pick-candidate label Jun 24, 2026

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py

nelljerram requested a review from Copilot June 24, 2026 16:11

Copilot started reviewing on behalf of nelljerram June 24, 2026 16:11 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

zhanz1 reviewed Jun 24, 2026

View reviewed changes

nelljerram and others added 3 commits June 24, 2026 23:43

Merge branch 'fix-eventlet-monkey-patch-at-import' into core-13033-el…

9718ece

…ection-fixes

Merge branch 'fix-eventlet-monkey-patch-at-import' into core-13033-el…

c31426c

…ection-fixes

zhanz1 reviewed Jun 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into core-13033-election…

f8c2c1d

…-fixes

coutinhop approved these changes Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

nelljerram commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

nelljerram commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

zhanz1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhanz1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants