networking-calico: Separate leader jobs into different processes by zhanz1 · Pull Request #12668 · projectcalico/calico

zhanz1 · 2026-04-30T22:00:15Z

WHY

This is an extension of #12582.

As clusters grow larger, it is hard for a single Python process to do resync, compaction, and status updating at the same time. To address this, let's separate the jobs into multiple processes.

WHAT

This change will introduce four new worker processes, each in charge of:

CalicoResourceSyncerWorker: Sync resources from Neutron to etcd.
CalicoManagerWorker: Do leader election and periodic compaction.
CalicoAgentStatusWatcherWorker: Watch agent status updates and report them to Neutron.
CalicoEndpointStatusWatcherWorker: Watch endpoint status updates and report them to Neutron.

HOW

Launch neutron-server with calico as the plugin and set up an OpenStack cluster.

TEST

Tested on a virtual OpenStack cluster with 3 instance of neutron-server, all seems to be working:

Details

# Leader election
### Node 1
2026-04-30 21:37:36.216 2219629 INFO networking_calico.plugins.ml2.drivers.calico.election [-] Successfully become master - key /calico/openstack/v2/no-region/neutron_election, value node-1:2219629
### Node 2
2026-04-30 21:37:38.576 2605430 INFO networking_calico.etcdv3 [-] etcdv3 get key=/calico/openstack/v2/no-region/neutron_election results=[(b'node-1:2219629', {'key': b'/calico/openstack/v2/no-region/neutron_election', 'create_revision': '5287496', 'mod_revision': '5287498', 'version': '2', 'lease': '8674386991848183953'})]

# Periodic resync
2026-04-30 21:45:49.593 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync
2026-04-30 21:48:09.642 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync
2026-04-30 21:50:29.484 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync

# Resync monitor
2026-04-30 21:43:29.522 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [-] I am master: monitoring periodic resync

# Compaction
2026-04-30 21:43:28.665 2605430 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:COMPACTION req-51698c4a-54f4-4537-b982-aab35936ee78 - - - - - -] I am master: doing periodic compaction

# Agent status
2026-04-30 21:56:20.202 2605432 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:3:STATUS_ETCD_WATCHER req-03c102c2-fdd6-4cd8-acf6-48ba45829e9f - - - - - -] Felix on host host-10 is alive; fanning out status report

# Port status
2026-04-30 21:58:03.728 2605433 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:7:STATUS_ETCD_WATCHER req-85d1dc7e-2ef7-4e35-b8d5-fdb0c99b1b7c - - - - - -] Status of port ('host-10', 'ba683f56-d996-4ad8-be1a-f5f294cb620d') on host host-10 changed to up

# Live migration
2026-04-30 21:44:27.425 2605433 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:9:UPDATE_PORT_POSTCOMMIT req-dbed5163-87e4-4b0f-89da-9200fe223980 - - - - - -] Live migration e394c4b2-0043-4009-bea8-db2101fa75e8: destination port 5f69b7d1-9a75-4831-bb0e-2e3a5dbdc038 active on host-10, notifying Nova

Also, updated the unit tests according to the new design.

MISC

Release note:

Feature: Split OpenStack driver's leader-only tasks into multiple processes.

nelljerram · 2026-05-01T10:47:18Z

/sem-approve

nelljerram · 2026-05-01T13:18:27Z

/sem-approve

nelljerram · 2026-05-01T13:19:33Z

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

zhanz1 · 2026-05-01T14:17:18Z

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

Thanks... I thought I passed make fmtpy but it seems something were missed.

nelljerram · 2026-05-01T14:41:31Z

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

Thanks... I thought I passed make fmtpy but it seems something were missed.

It's a bit confusing. fmtpy is an action that edits the files that need it but then returns success. Perhaps we should add a make check-dirty after the black line.

nelljerram · 2026-05-20T09:44:10Z

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

nelljerram · 2026-05-20T09:48:13Z

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

BTW my plan will be to look again at #12456 after this PR has merged. I think we will find that a lot of #12456 is no longer needed, but there may still be some useful pieces there.

zhanz1 · 2026-05-20T13:03:23Z

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

Yes, I'll work on it, no worries!

BTW my plan will be to look again at #12456 after this PR has merged. I think we will find that a lot of #12456 is no longer needed, but there may still be some useful pieces there.

Yes, I think the most important part is how we will be detecting dead greenthread after we split work onto multiple processes.

As clusters grow larger, it is hard for a single Python process to do resync, compaction, and status updating at the same time. To address this, separate the jobs into multiple processes. This change will introduce four new worker processes, each in charge of: * CalicoResourceSyncerWorker: Sync resources from Neutron to etcd. * CalicoManagerWorker: Do leader election and periodic compaction. * CalicoAgentStatusWatcherWorker: Watch agent status updates and report them to Neutron. * CalicoEndpointStatusWatcherWorker: Watch endpoint status updates and report them to Neutron. Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

zhanz1 · 2026-05-20T17:49:47Z

@nelljerram I've rebased and also modified relevant naming and comments a little bit (e.g., voting is no longer a thing), feel free to take a look, thanks!

nelljerram · 2026-05-26T10:07:05Z

/sem-approve

Copilot

Pull request overview

This PR extends the networking-calico Neutron ML2 driver architecture by splitting “leader-only” responsibilities into dedicated neutron-server worker processes (manager/election+compaction, agent status watching, endpoint status watching, plus the existing startup resync worker). The goal is to reduce contention in large OpenStack clusters by avoiding a single process doing all periodic and watcher work.

Changes:

Add new Neutron BaseWorker marker classes for manager and status-watcher worker processes.
Refactor CalicoMechanismDriver to initialize common state post-fork and to start role-specific greenlets per worker type; introduce a process-shared “is master” timestamp updated by the elector.
Split StatusWatcher into AgentStatusWatcher and EndpointStatusWatcher, and update unit tests accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py	Adds new worker classes intended to map to separate neutron-server forked processes.
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py	Dispatches worker responsibilities post-fork; adds process-shared master tracking and new init/start helpers.
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py	Splits status watching into agent vs endpoint watcher subclasses.
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py	Updates elector to publish master “freshness” via a shared value; removes old in-process master flag.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py	Updates plugin tests to match the new watcher/worker structure.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_mech_calico.py	Updates mech driver init tests for new init/start helper methods.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py	Updates election tests for the new elector API and master signaling.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py	Adjusts test stubs/mocks to support `Elector.run()` and `driver.is_master()`.

nelljerram

I haven't reviewed the test changes in full detail yet, but I think I have enough comments queued up to be worth releasing.

Overall, I really like the shape of this change, so thanks for proposing it. Just some detailed comments...

nelljerram · 2026-05-26T10:22:42Z

+    def _init_and_start_calico_resouce_syncer(self):
+        self.start_up_resync_thread = eventlet.spawn(self._do_startup_resync)
+
+    def _init_and_start_calico_manager(self):


Can we call this "elector" instead of "manager"?

Oh I see, you haven't called it "elector" because it also does compaction. WDYT about making another separate worker process for compaction? That's effectively what will happen when eventlet is removed anyway, and I think it would be cleaner already.

We can, but to me it feels a bit wasteful to spawn a standalone process that only does a check every X seconds. Even with eventlet removed, I feel like we should just turn this into two threads for the same reason. These two components (Elector and compaction) won't really scale as clusters grow larger.

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

…-into-processes

nelljerram · 2026-06-11T23:15:04Z

@zhanz1 I have prepared zhanz1#1 to merge latest master and resolve the conflicts on this PR. If you think that looks good, please merge it; then I'll re-approve workflows and re-review after that.

Merge current Calico master

zhanz1 · 2026-06-12T13:53:17Z

@zhanz1 I have prepared zhanz1#1 to merge latest master and resolve the conflicts on this PR. If you think that looks good, please merge it; then I'll re-approve workflows and re-review after that.

Much thanks! I have merged it.

nelljerram

Looking good. I still need to review test_plugin_etcd.py, but the comments below cover everything else. I'll also kick off CI now...

nelljerram · 2026-06-15T18:11:30Z

/sem-approve

nelljerram

Two small points for test_plugin_etcd.py

nelljerram · 2026-06-16T12:23:13Z

@zhanz1 Thanks so much for your work on this. CI is looking good, and there are just a few remaining small points:

removing subclass methods that seem to be unnecessary in the worker classes
var rename -> time_since_last_refreshed
removing an eventlet.spawn mock that appears to be not needed
understanding the c[0] == "" condition.

Happy with everything else, and looking forward to merging this!

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

nelljerram · 2026-06-17T09:57:39Z

/sem-approve

If the StatusWatcher is processing Felix uptime updates whose embedded "time" field is materially older than wall-clock now, this is evidence that we are running behind the rate of updates Felix is producing and a backlog is building up. Customers have hit this in production: Neutron ends up seeing agent up/down transitions hours after they actually happened, and the existing logs give no early warning while the backlog is growing. Add a rate-limited WARNING in AgentStatusWatcher._on_status_set -- skipped during initial-snapshot replay where old timestamps are expected -- so operators can see the backlog building up long before it grows to hours. This is the operator-facing piece of the CI-1892 hardening work that still stands on its own merits after PR #12668 split the mech driver into per-process workers. The other defensive fixes from that branch (elector watchdog, etcd-confirmed mastership) no longer carry their weight: the periodic-resync loop they were targeted at is gone, and the existing time-based check in is_master() already catches stale-elector mastership for the continuous loops that remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zhanz1 requested a review from a team as a code owner April 30, 2026 22:00

marvin-tigera added this to the Calico v3.33.0 milestone Apr 30, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 30, 2026

zhanz1 mentioned this pull request May 20, 2026

[PoC] networking-calico: Use a separate process & thread for resync #12582

Closed

zhanz1 and others added 4 commits May 20, 2026 15:14

misc: Minor test text fix

3111aa3

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

make -C networking-calico flake8 fmtpy

6f4e54e

fix: Misc comment updates and remove unnecessary code

a2a0cde

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

zhanz1 force-pushed the separate-leader-job-into-processes branch from 891fec1 to a2a0cde Compare May 20, 2026 15:50

zhanz1 added 3 commits May 20, 2026 17:21

fix: Implement reset for CalicoStartupResyncWorker

8edbb00

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

fix: Fix linter

173da3f

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

fix: Remove unneeded code and update comments

caf928e

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

nelljerram requested a review from Copilot May 26, 2026 10:06

Copilot started reviewing on behalf of nelljerram May 26, 2026 10:06 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

nelljerram reviewed May 26, 2026

View reviewed changes

zhanz1 and others added 2 commits May 26, 2026 15:59

fix: Addressing comments

eb205d4

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

Merge remote-tracking branch 'origin/master' into separate-leader-job…

0a09506

…-into-processes

Merge pull request #1 from nelljerram/separate-leader-job-into-processes

16d7cc2

Merge current Calico master

nelljerram requested changes Jun 15, 2026

View reviewed changes

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated

nelljerram reviewed Jun 16, 2026

View reviewed changes

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py Outdated

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py

nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jun 16, 2026

zhanz1 added 2 commits June 16, 2026 20:46

fix: Address comments

217220f

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

fix: Add stop and wait functions back

181f5cf

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>

nelljerram approved these changes Jun 17, 2026

View reviewed changes

nelljerram merged commit 3b42da5 into projectcalico:master Jun 17, 2026
3 checks passed

This was referenced Jun 18, 2026

networking-calico: warn on stale Felix status updates #13024

Merged

networking-calico: defensive fixes for silent elector death and status backlog #12456

Open

Conversation

zhanz1 commented Apr 30, 2026

WHY

WHAT

HOW

TEST

MISC

Uh oh!

nelljerram commented May 1, 2026

Uh oh!

nelljerram commented May 1, 2026

Uh oh!

nelljerram commented May 1, 2026

Uh oh!

zhanz1 commented May 1, 2026

Uh oh!

nelljerram commented May 1, 2026

Uh oh!

nelljerram commented May 20, 2026

Uh oh!

nelljerram commented May 20, 2026

Uh oh!

zhanz1 commented May 20, 2026

Uh oh!

zhanz1 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelljerram commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nelljerram left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nelljerram May 26, 2026

Choose a reason for hiding this comment

Uh oh!

nelljerram May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhanz1 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nelljerram commented Jun 11, 2026

Uh oh!

zhanz1 commented Jun 12, 2026

Uh oh!

nelljerram left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nelljerram commented Jun 15, 2026

Uh oh!

nelljerram left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nelljerram commented Jun 16, 2026

Uh oh!

nelljerram commented Jun 17, 2026

Uh oh!

zhanz1 commented May 20, 2026 •

edited

Loading