Skip to content

networking-calico: clean up stale election keys on both ends#13069

Open
nelljerram wants to merge 6 commits into
projectcalico:masterfrom
nelljerram:core-13033-election-fixes
Open

networking-calico: clean up stale election keys on both ends#13069
nelljerram wants to merge 6 commits into
projectcalico:masterfrom
nelljerram:core-13033-election-fixes

Conversation

@nelljerram

Copy link
Copy Markdown
Member

Two complementary fixes for the failure mode where the election key is left in etcd after the previous master has gone away. Without them, a restarted neutron-server has to wait out the full lease TTL (60s) before any node can win the next election.

  1. Stale-key cleanup on read (election._check_master_process). When _vote reads the current election key and the value parses as <server_id>:<pid>, look up /proc/<pid>; if the process is no longer running on this host, CAS-delete the key against the observed value and restart the election. This is the equivalent of the pre-12668 _check_master_process, restored verbatim aside from docstring and wrapping tidy-ups. Covers SIGKILL / silent greenlet death of a previous master on this host.

  2. Clean step-down on graceful shutdown (CalicoManagerWorker.stop). CalicoManagerWorker now keeps a back-reference to the mech driver and calls driver.elector.stop() before chaining to the base BaseWorker.stop(). The elector's stop() blocks until its greenlet has exited, which is what runs the finally: _attempt_step_down() that deletes the election key. Without this hook the elector greenlet was just killed, the finally never ran, and the key lingered. Covers SIGTERM-initiated shutdown of the master.

Unit tests in test_election.py cover the /proc cleanup paths: live PID (no-op), dead PID with successful CAS-delete, dead PID with CAS-mismatch (restart), dead PID with etcd exception (restart), different host (no /proc check), and unparseable key value (warn only).

Resolves CORE-13033.

Two complementary fixes for the failure mode where the election key is
left in etcd after the previous master has gone away.  Without them, a
restarted neutron-server has to wait out the full lease TTL (60s) before
any node can win the next election.

1. Stale-key cleanup on read (`election._check_master_process`).  When
   `_vote` reads the current election key and the value parses as
   `<server_id>:<pid>`, look up `/proc/<pid>`; if the process is no
   longer running on this host, CAS-delete the key against the observed
   value and restart the election.  This is the equivalent of the
   pre-12668 _check_master_process, restored verbatim aside from
   docstring and wrapping tidy-ups.  Covers SIGKILL / silent greenlet
   death of a previous master on this host.

2. Clean step-down on graceful shutdown (`CalicoManagerWorker.stop`).
   `CalicoManagerWorker` now keeps a back-reference to the mech driver
   and calls `driver.elector.stop()` before chaining to the base
   `BaseWorker.stop()`.  The elector's stop() blocks until its greenlet
   has exited, which is what runs the `finally: _attempt_step_down()`
   that deletes the election key.  Without this hook the elector
   greenlet was just killed, the finally never ran, and the key
   lingered.  Covers SIGTERM-initiated shutdown of the master.

Unit tests in test_election.py cover the /proc cleanup paths:
live PID (no-op), dead PID with successful CAS-delete, dead PID with
CAS-mismatch (restart), dead PID with etcd exception (restart),
different host (no /proc check), and unparseable key value (warn only).

Resolves CORE-13033.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 24, 2026 14:21
@nelljerram nelljerram requested a review from a team as a code owner June 24, 2026 14:21
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Jun 24, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 24, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented cherry-pick-candidate labels Jun 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a leader-election failure mode in the Neutron Calico ML2 driver where an election key can be left behind in etcd after the elected master disappears, forcing subsequent elections to wait for the full lease TTL before progressing.

Changes:

  • Add stale-election-key cleanup when reading the current master key by checking /proc/<pid> for same-host master IDs and CAS-deleting the key if the process is gone.
  • Ensure graceful shutdown triggers a clean step-down by having CalicoManagerWorker.stop() call driver.elector.stop() before chaining to the base worker shutdown.
  • Add unit tests for the new _check_master_process stale-key cleanup behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py Adds a mech-driver back-reference and stops the elector during worker shutdown to ensure step-down cleanup runs.
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Passes the mechanism driver into CalicoManagerWorker so the worker can access self.elector at shutdown.
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py Introduces _check_master_process() and invokes it on initial election-key read to proactively remove stale same-host election keys.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py Adds direct unit coverage for _check_master_process() across success and failure paths.

@nelljerram

Copy link
Copy Markdown
Member Author

@zhanz1 It has turned out that we need to reinstate the _check_master_process logic; otherwise there can be too long a gap in agent status watching when the Neutron server restarts - which in turn can lead to a Felix being considered as dead, and Neutron refusing to bind a port on that Felix's hypervisor.

Please could you review this PR and let me know your thoughts?

Cover the three branches the shutdown path can take:

 - happy path: driver.elector is set, so stop() calls elector.stop()
   exactly once and then chains to super().stop();
 - elector absent: _driver is None / driver has no `elector` attribute
   / driver.elector is None -- stop() must not raise and must still
   chain to super();
 - elector.stop() raises: the exception is logged and swallowed (we
   are on the shutdown path; the election key will expire with its
   lease either way) and super().stop() is still called.

Without this, the regression-fix from
abaefed ("networking-calico: clean up stale election keys on both
ends") is unguarded.  Raised in review of projectcalico#13069.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@zhanz1 zhanz1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and I think overall make sense, have some comments.


def stop(self):
"""Stop service."""
"""Stop service.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. IIUC (correct me if I'm wrong), when we do something like systemctl stop neutron-server, it will send a SIGINT to all neutron-server processes, which each process's handler will catch that and call stop?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. So when a shutdown is "graceful" in this way, the master elector deletes its etcd key. However, in case that doesn't happen (for any reason), we also reinstate the check_master_process logic so that a stale etcd key can be immediately identified when the neutron server starts again.

)
self._become_master()

def _check_master_process(self, master_id):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "master" process over here is really the manager process right? If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?

This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability. Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL? It would probably make sense to make it configurable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the master process is the "manager" process, because that is the one that includes the Elector.

If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?

I'm afraid I don't understand your query here. The specific system test scenario here was the neutron-server being restarted in order to pick up a config change. So the manager process in the old neutron-server is killed, and then a new manager process runs in the new neutron-server.

This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability.

No, I don't think that's right. It was working before because of the _check_master_process logic. That was removed in #12668 , and this PR now reinstates it.

Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL?

What TTL do you have in mind here? I'm afraid I don't understand your suggestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you are referring to here, I believe I misunderstood the intention of this function. Before Chao & I made any changes, IIUC, all neutron-server processes will run _post_fork_init and therefore run Elector. One of them will be elected as leader, and the rest will just do this _check_master_process. With this design, if the old leader process somehow died, then other processes can quickly delete the key and restart the election. When I was writing #12668, I thought this was the intention and hence removed it (because there will only be one process running Elector). I did not think about the restarts.

In terms of the TTL, I mean the lease TTL (MASTER_TIMEOUT). I think it would be helpful to make it configurable so that in case of a machine failure, where no one will delete the election key, we can shrink the time it takes for other machines to step in and become master.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might be useful for MASTER_TIMEOUT and MASTER_REFRESH_INTERVAL to be configurable. But I think that is independent of the current PR, isn't it? 10s is already quite low for MASTER_REFRESH_INTERVAL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ofc, this can be a separate PR. While the refresh interval is 10s, I think it's the 60s timeout who's blocking longer for other neutron-server to step up?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but (in case it's not clear) the current PR already fixes that. When the Neutron server restarts:

  1. It immediately creates the ManagerWorker process, -> _init_start_calico_manager() -> self.elector.start()
  2. Elector calls _vote, which finds the key and calls _check_master_process
  3. _check_master_process parses successfully, finds that host matches its own, but PID no longer running, and so deletes the key.
  4. _vote now either sees the delete, or KeyNotFound, and so calls _become_master.

The important point is that that all happens immediately without waiting for any timeout or refresh interval.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is happy path where the neutron-server is restarted and I think we are on the same page here :D. I'm more referring to when there is a machine failure (i.e., the machine crashed) and thus the neutron-server is never restarted (because it can't), and it would then take this 60s for other machines running neutron-server to become master - and in that case if we want to reduce the downtime, we'll need to reduce the 60s - but this should be another PR as we discussed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. Would you like to prepare that PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do when I get a chance.

nelljerram and others added 3 commits June 24, 2026 23:43
The 5 TestCalicoManagerWorkerStop tests were failing when run via
subunit.run discover under the plugins/ml2/drivers/calico/test/
directory, while passing when run via unittest in isolation.

Root cause: that directory's lib.py replaces
sys.modules['neutron_lib.worker'] with a MagicMock at module-import
time, so that when sibling tests (test_compaction, test_election,
test_mech_calico, ...) load lib first, the real
neutron_lib.worker.BaseWorker is never imported.  When workers.py is
then imported via mech_calico's transitive chain, its `worker.BaseWorker`
reference is a Mock attribute, and `class CalicoManagerWorker(BaseWorker)`
ends up collapsing into a MagicMock instead of becoming a real class.
At that point our test patches against a Mock and our super().stop()
chain assertions cannot fire.

Move the test file to the sibling networking_calico/tests/ directory,
where lib.py is not imported and neutron_lib.worker stays real.  The
two test directories run in separate subunit processes per
.testr.conf, so process isolation guarantees this fix holds regardless
of test ordering.  Also documents the rationale in the file's
docstring so future readers don't move it back.

Verified: all 5 tests pass under both
  python -m unittest -v networking_calico.tests.test_workers
and
  python -m subunit.run discover -t . networking_calico/tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zhanz1 zhanz1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me and thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants