-
Notifications
You must be signed in to change notification settings - Fork 1.6k
networking-calico: clean up stale election keys on both ends #13069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
abaefed
a5a679c
9718ece
e0267b6
c31426c
f8c2c1d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -98,14 +98,40 @@ class CalicoManagerWorker(worker.BaseWorker): | |
| """Service for doing election and compaction. | ||
|
|
||
| The super class will trigger the post_fork_initialize in the mech driver. | ||
|
|
||
| Holds a back-reference to the mech driver so stop() can ask the elector | ||
| to step down cleanly. Without that, neutron-server's SIGTERM would just | ||
| kill the elector greenlet, the ``finally: _attempt_step_down()`` block | ||
| would not run, and the election key would linger in etcd until its lease | ||
| expired -- adding ttl seconds of delay before any node could win the | ||
| next election. | ||
| """ | ||
|
|
||
| def __init__(self, driver=None, *args, **kwargs): | ||
| super(CalicoManagerWorker, self).__init__(*args, **kwargs) | ||
| self._driver = driver | ||
|
|
||
| def start(self, name="calico-manager", desc=None): | ||
| """Start service.""" | ||
| super(CalicoManagerWorker, self).start(name, desc) | ||
|
|
||
| def stop(self): | ||
| """Stop service.""" | ||
| """Stop service. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes sense. IIUC (correct me if I'm wrong), when we do something like
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I believe so. So when a shutdown is "graceful" in this way, the master elector deletes its etcd key. However, in case that doesn't happen (for any reason), we also reinstate the check_master_process logic so that a stale etcd key can be immediately identified when the neutron server starts again. |
||
|
|
||
| Step down from mastership before chaining to the base ``stop()``. | ||
| The elector's stop() blocks until its greenlet has exited, which is | ||
| what runs the ``finally: _attempt_step_down()`` that deletes the | ||
| election key in etcd. Guarded against the case where stop() fires | ||
| before post_fork_initialize has populated ``driver.elector``. | ||
| """ | ||
| elector = getattr(self._driver, "elector", None) if self._driver else None | ||
| if elector is not None: | ||
| try: | ||
| elector.stop() | ||
| except Exception: | ||
| # Best-effort -- we are already on the shutdown path, and | ||
| # the key will expire with its lease either way. | ||
| LOG.exception("Error stopping elector during worker shutdown") | ||
| super(CalicoManagerWorker, self).stop() | ||
|
nelljerram marked this conversation as resolved.
|
||
|
|
||
| def wait(self): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # Copyright (c) 2026 Tigera, Inc. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
| # implied. See the License for the specific language governing | ||
| # permissions and limitations under the License. | ||
|
|
||
| """Unit tests for ``CalicoManagerWorker.stop()``. | ||
|
|
||
| The shutdown path is the regression fix in | ||
| https://github.com/projectcalico/calico/pull/13069 -- ``stop()`` must ask | ||
| the elector to step down so its ``finally: _attempt_step_down()`` clause | ||
| runs and the election key is removed from etcd promptly, rather than | ||
| lingering for the lease ttl. These tests verify all branches of that | ||
| path: elector present, elector absent, and elector raising. | ||
|
|
||
| Lives under ``networking_calico/tests/`` rather than | ||
| ``networking_calico/plugins/ml2/drivers/calico/test/`` because the | ||
| latter's ``lib.py`` replaces ``sys.modules['neutron_lib.worker']`` with | ||
| a MagicMock at import time -- which collapses | ||
| ``CalicoManagerWorker`` itself into a MagicMock and makes its real | ||
| shutdown code unreachable. Running here keeps ``neutron_lib.worker`` | ||
| real (the two test directories are run in separate subunit processes | ||
| per ``.testr.conf``). | ||
| """ | ||
|
|
||
| import unittest | ||
|
|
||
| import mock | ||
|
|
||
| from networking_calico.plugins.ml2.drivers.calico import workers | ||
|
|
||
|
|
||
| class TestCalicoManagerWorkerStop(unittest.TestCase): | ||
| """Verify ``CalicoManagerWorker.stop()`` shutdown semantics.""" | ||
|
|
||
| def setUp(self): | ||
| super(TestCalicoManagerWorkerStop, self).setUp() | ||
| # Patch the parent's stop() so we can assert it gets chained to | ||
| # without dragging in the real oslo_service shutdown plumbing. | ||
| # create=True because BaseWorker doesn't define stop() itself -- | ||
| # it inherits the abstract method from oslo_service.ServiceBase. | ||
| self.super_stop_p = mock.patch.object( | ||
| workers.worker.BaseWorker, "stop", create=True | ||
| ) | ||
| self.super_stop = self.super_stop_p.start() | ||
| self.addCleanup(self.super_stop_p.stop) | ||
|
|
||
| def _make_worker(self, driver): | ||
| # set_proctitle='off' avoids touching the real process title in | ||
| # the test runner. | ||
| return workers.CalicoManagerWorker(driver=driver, set_proctitle="off") | ||
|
|
||
| def test_stop_steps_down_elector_then_chains_to_super(self): | ||
| # Happy path: driver.elector is set, so stop() must call | ||
| # elector.stop() exactly once, then super().stop(). | ||
| driver = mock.Mock() | ||
| driver.elector = mock.Mock() | ||
| worker = self._make_worker(driver) | ||
|
|
||
| worker.stop() | ||
|
|
||
| driver.elector.stop.assert_called_once_with() | ||
| self.super_stop.assert_called_once_with() | ||
|
|
||
| def test_stop_tolerates_elector_attribute_missing(self): | ||
| # post_fork_initialize hasn't run yet -- driver has no `elector` | ||
| # attribute. stop() must not raise and must still chain to | ||
| # super(). | ||
| driver = mock.Mock(spec=[]) # empty spec => no elector attr | ||
| worker = self._make_worker(driver) | ||
|
|
||
| worker.stop() # must not raise | ||
|
|
||
| self.super_stop.assert_called_once_with() | ||
|
|
||
| def test_stop_tolerates_driver_is_none(self): | ||
| # Defensive case: _driver itself is None (worker was somehow | ||
| # constructed without a driver back-reference). | ||
| worker = self._make_worker(driver=None) | ||
|
|
||
| worker.stop() # must not raise | ||
|
|
||
| self.super_stop.assert_called_once_with() | ||
|
|
||
| def test_stop_tolerates_elector_being_none(self): | ||
| # post_fork_initialize cleared the elector (or never set it), | ||
| # leaving driver.elector = None. | ||
| driver = mock.Mock() | ||
| driver.elector = None | ||
| worker = self._make_worker(driver) | ||
|
|
||
| worker.stop() # must not raise | ||
|
|
||
| self.super_stop.assert_called_once_with() | ||
|
|
||
| def test_stop_swallows_elector_stop_exception(self): | ||
| # elector.stop() can fail (e.g. etcd unreachable during shutdown). | ||
| # The exception must be logged but not propagate, and super().stop() | ||
| # must still be called so the worker process exits cleanly. | ||
| driver = mock.Mock() | ||
| driver.elector.stop.side_effect = RuntimeError("etcd unreachable") | ||
| worker = self._make_worker(driver) | ||
|
|
||
| with mock.patch.object(workers.LOG, "exception") as mock_log_exc: | ||
| worker.stop() # must not raise | ||
|
|
||
| mock_log_exc.assert_called_once() | ||
| self.super_stop.assert_called_once_with() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "master" process over here is really the manager process right? If we run this function in the manager process itself (and it somehow died/malfunctioned), then presumably this function will not run either?
This was working before, IIUC, because each process can become master and they can check the master process is alive or not. Now with just one process can be master, we kinda lose this ability. Therefore, for checking malfunctioning/failures//crashes, I would think we should rely on TTL? It would probably make sense to make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the master process is the "manager" process, because that is the one that includes the Elector.
I'm afraid I don't understand your query here. The specific system test scenario here was the neutron-server being restarted in order to pick up a config change. So the manager process in the old neutron-server is killed, and then a new manager process runs in the new neutron-server.
No, I don't think that's right. It was working before because of the
_check_master_processlogic. That was removed in #12668 , and this PR now reinstates it.What TTL do you have in mind here? I'm afraid I don't understand your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you are referring to here, I believe I misunderstood the intention of this function. Before Chao & I made any changes, IIUC, all neutron-server processes will run
_post_fork_initand therefore runElector. One of them will be elected as leader, and the rest will just do this_check_master_process. With this design, if the old leader process somehow died, then other processes can quickly delete the key and restart the election. When I was writing #12668, I thought this was the intention and hence removed it (because there will only be one process running Elector). I did not think about the restarts.In terms of the TTL, I mean the lease TTL (
MASTER_TIMEOUT). I think it would be helpful to make it configurable so that in case of a machine failure, where no one will delete the election key, we can shrink the time it takes for other machines to step in and become master.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it might be useful for MASTER_TIMEOUT and MASTER_REFRESH_INTERVAL to be configurable. But I think that is independent of the current PR, isn't it? 10s is already quite low for MASTER_REFRESH_INTERVAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ofc, this can be a separate PR. While the refresh interval is 10s, I think it's the 60s timeout who's blocking longer for other neutron-server to step up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but (in case it's not clear) the current PR already fixes that. When the Neutron server restarts:
_init_start_calico_manager()->self.elector.start()_vote, which finds the key and calls_check_master_process_check_master_processparses successfully, finds that host matches its own, but PID no longer running, and so deletes the key._votenow either sees the delete, orKeyNotFound, and so calls_become_master.The important point is that that all happens immediately without waiting for any timeout or refresh interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is happy path where the
neutron-serveris restarted and I think we are on the same page here :D. I'm more referring to when there is a machine failure (i.e., the machine crashed) and thus theneutron-serveris never restarted (because it can't), and it would then take this 60s for other machines runningneutron-serverto become master - and in that case if we want to reduce the downtime, we'll need to reduce the 60s - but this should be another PR as we discussed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. Would you like to prepare that PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do when I get a chance.