Skip to content

networking-calico: Separate leader jobs into different processes#12668

Merged
nelljerram merged 12 commits into
projectcalico:masterfrom
zhanz1:separate-leader-job-into-processes
Jun 17, 2026
Merged

networking-calico: Separate leader jobs into different processes#12668
nelljerram merged 12 commits into
projectcalico:masterfrom
zhanz1:separate-leader-job-into-processes

Conversation

@zhanz1

@zhanz1 zhanz1 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

WHY

This is an extension of #12582.

As clusters grow larger, it is hard for a single Python process to do resync, compaction, and status updating at the same time. To address this, let's separate the jobs into multiple processes.

WHAT

This change will introduce four new worker processes, each in charge of:

  • CalicoResourceSyncerWorker: Sync resources from Neutron to etcd.
  • CalicoManagerWorker: Do leader election and periodic compaction.
  • CalicoAgentStatusWatcherWorker: Watch agent status updates and report them to Neutron.
  • CalicoEndpointStatusWatcherWorker: Watch endpoint status updates and report them to Neutron.

HOW

Launch neutron-server with calico as the plugin and set up an OpenStack cluster.

TEST

Tested on a virtual OpenStack cluster with 3 instance of neutron-server, all seems to be working:

Details
# Leader election
### Node 1
2026-04-30 21:37:36.216 2219629 INFO networking_calico.plugins.ml2.drivers.calico.election [-] Successfully become master - key /calico/openstack/v2/no-region/neutron_election, value node-1:2219629
### Node 2
2026-04-30 21:37:38.576 2605430 INFO networking_calico.etcdv3 [-] etcdv3 get key=/calico/openstack/v2/no-region/neutron_election results=[(b'node-1:2219629', {'key': b'/calico/openstack/v2/no-region/neutron_election', 'create_revision': '5287496', 'mod_revision': '5287498', 'version': '2', 'lease': '8674386991848183953'})]

# Periodic resync
2026-04-30 21:45:49.593 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync
2026-04-30 21:48:09.642 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync
2026-04-30 21:50:29.484 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:RESYNC req-0fe39c86-ac77-4462-8dbe-1c863e1fc29c - - - - - -] I am master: doing periodic resync

# Resync monitor
2026-04-30 21:43:29.522 2605431 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [-] I am master: monitoring periodic resync

# Compaction
2026-04-30 21:43:28.665 2605430 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:2:COMPACTION req-51698c4a-54f4-4537-b982-aab35936ee78 - - - - - -] I am master: doing periodic compaction

# Agent status
2026-04-30 21:56:20.202 2605432 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:3:STATUS_ETCD_WATCHER req-03c102c2-fdd6-4cd8-acf6-48ba45829e9f - - - - - -] Felix on host host-10 is alive; fanning out status report

# Port status
2026-04-30 21:58:03.728 2605433 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:7:STATUS_ETCD_WATCHER req-85d1dc7e-2ef7-4e35-b8d5-fdb0c99b1b7c - - - - - -] Status of port ('host-10', 'ba683f56-d996-4ad8-be1a-f5f294cb620d') on host host-10 changed to up

# Live migration
2026-04-30 21:44:27.425 2605433 INFO networking_calico.plugins.ml2.drivers.calico.mech_calico [None CALICO:9:UPDATE_PORT_POSTCOMMIT req-dbed5163-87e4-4b0f-89da-9200fe223980 - - - - - -] Live migration e394c4b2-0043-4009-bea8-db2101fa75e8: destination port 5f69b7d1-9a75-4831-bb0e-2e3a5dbdc038 active on host-10, notifying Nova

Also, updated the unit tests according to the new design.

MISC

Release note:

Feature: Split OpenStack driver's leader-only tasks into multiple processes. 

@zhanz1 zhanz1 requested a review from a team as a code owner April 30, 2026 22:00
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 30, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 30, 2026
@nelljerram

Copy link
Copy Markdown
Member

/sem-approve

1 similar comment
@nelljerram

Copy link
Copy Markdown
Member

/sem-approve

@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

@zhanz1

zhanz1 commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

Thanks... I thought I passed make fmtpy but it seems something were missed.

@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 I pushed a formatting fix to this to satisfy our linters (black and flake8), and to allow CI to run.

Thanks... I thought I passed make fmtpy but it seems something were missed.

It's a bit confusing. fmtpy is an action that edits the files that need it but then returns success. Perhaps we should add a make check-dirty after the black line.

@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

BTW my plan will be to look again at #12456 after this PR has merged. I think we will find that a lot of #12456 is no longer needed, but there may still be some useful pieces there.

@zhanz1

zhanz1 commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@zhanz1 Please could you rework this PR on top of #12658 (given that the latter has now merged)? Or let me know if you would prefer me to do that.

Yes, I'll work on it, no worries!

BTW my plan will be to look again at #12456 after this PR has merged. I think we will find that a lot of #12456 is no longer needed, but there may still be some useful pieces there.

Yes, I think the most important part is how we will be detecting dead greenthread after we split work onto multiple processes.

zhanz1 and others added 4 commits May 20, 2026 15:14
As clusters grow larger, it is hard for a single Python process
to do resync, compaction, and status updating at the same time.
To address this, separate the jobs into multiple processes.

This change will introduce four new worker processes, each in
charge of:

* CalicoResourceSyncerWorker: Sync resources from Neutron to etcd.
* CalicoManagerWorker: Do leader election and periodic compaction.
* CalicoAgentStatusWatcherWorker: Watch agent status updates and
  report them to Neutron.
* CalicoEndpointStatusWatcherWorker: Watch endpoint status updates
  and report them to Neutron.

Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
@zhanz1 zhanz1 force-pushed the separate-leader-job-into-processes branch from 891fec1 to a2a0cde Compare May 20, 2026 15:50
zhanz1 added 3 commits May 20, 2026 17:21
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
@zhanz1

zhanz1 commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@nelljerram I've rebased and also modified relevant naming and comments a little bit (e.g., voting is no longer a thing), feel free to take a look, thanks!

@nelljerram

Copy link
Copy Markdown
Member

/sem-approve

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the networking-calico Neutron ML2 driver architecture by splitting “leader-only” responsibilities into dedicated neutron-server worker processes (manager/election+compaction, agent status watching, endpoint status watching, plus the existing startup resync worker). The goal is to reduce contention in large OpenStack clusters by avoiding a single process doing all periodic and watcher work.

Changes:

  • Add new Neutron BaseWorker marker classes for manager and status-watcher worker processes.
  • Refactor CalicoMechanismDriver to initialize common state post-fork and to start role-specific greenlets per worker type; introduce a process-shared “is master” timestamp updated by the elector.
  • Split StatusWatcher into AgentStatusWatcher and EndpointStatusWatcher, and update unit tests accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/workers.py Adds new worker classes intended to map to separate neutron-server forked processes.
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Dispatches worker responsibilities post-fork; adds process-shared master tracking and new init/start helpers.
networking-calico/networking_calico/plugins/ml2/drivers/calico/status.py Splits status watching into agent vs endpoint watcher subclasses.
networking-calico/networking_calico/plugins/ml2/drivers/calico/election.py Updates elector to publish master “freshness” via a shared value; removes old in-process master flag.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_plugin_etcd.py Updates plugin tests to match the new watcher/worker structure.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_mech_calico.py Updates mech driver init tests for new init/start helper methods.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/test_election.py Updates election tests for the new elector API and master signaling.
networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py Adjusts test stubs/mocks to support Elector.run() and driver.is_master().

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated

@nelljerram nelljerram left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the test changes in full detail yet, but I think I have enough comments queued up to be worth releasing.

Overall, I really like the shape of this change, so thanks for proposing it. Just some detailed comments...

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
def _init_and_start_calico_resouce_syncer(self):
self.start_up_resync_thread = eventlet.spawn(self._do_startup_resync)

def _init_and_start_calico_manager(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this "elector" instead of "manager"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you haven't called it "elector" because it also does compaction. WDYT about making another separate worker process for compaction? That's effectively what will happen when eventlet is removed anyway, and I think it would be cleaner already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but to me it feels a bit wasteful to spawn a standalone process that only does a check every X seconds. Even with eventlet removed, I feel like we should just turn this into two threads for the same reason. These two components (Elector and compaction) won't really scale as clusters grow larger.

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/test/lib.py Outdated
zhanz1 and others added 2 commits May 26, 2026 15:59
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 I have prepared zhanz1#1 to merge latest master and resolve the conflicts on this PR. If you think that looks good, please merge it; then I'll re-approve workflows and re-review after that.

@zhanz1

zhanz1 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@zhanz1 I have prepared zhanz1#1 to merge latest master and resolve the conflicts on this PR. If you think that looks good, please merge it; then I'll re-approve workflows and re-review after that.

Much thanks! I have merged it.

@nelljerram nelljerram left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I still need to review test_plugin_etcd.py, but the comments below cover everything else. I'll also kick off CI now...

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
@nelljerram

Copy link
Copy Markdown
Member

/sem-approve

@nelljerram nelljerram left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small points for test_plugin_etcd.py

@nelljerram nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jun 16, 2026
@nelljerram

Copy link
Copy Markdown
Member

@zhanz1 Thanks so much for your work on this. CI is looking good, and there are just a few remaining small points:

  • removing subclass methods that seem to be unnecessary in the worker classes
  • var rename -> time_since_last_refreshed
  • removing an eventlet.spawn mock that appears to be not needed
  • understanding the c[0] == "" condition.

Happy with everything else, and looking forward to merging this!

zhanz1 added 2 commits June 16, 2026 20:46
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
Signed-off-by: Zhan Zhang <zzhang953@bloomberg.net>
@nelljerram

Copy link
Copy Markdown
Member

/sem-approve

@nelljerram nelljerram merged commit 3b42da5 into projectcalico:master Jun 17, 2026
3 checks passed
nelljerram added a commit that referenced this pull request Jun 23, 2026
If the StatusWatcher is processing Felix uptime updates whose embedded
"time" field is materially older than wall-clock now, this is evidence
that we are running behind the rate of updates Felix is producing and a
backlog is building up.  Customers have hit this in production: Neutron
ends up seeing agent up/down transitions hours after they actually
happened, and the existing logs give no early warning while the backlog
is growing.

Add a rate-limited WARNING in AgentStatusWatcher._on_status_set --
skipped during initial-snapshot replay where old timestamps are
expected -- so operators can see the backlog building up long before it
grows to hours.

This is the operator-facing piece of the CI-1892 hardening work that
still stands on its own merits after PR #12668 split the mech driver
into per-process workers.  The other defensive fixes from that branch
(elector watchdog, etcd-confirmed mastership) no longer carry their
weight: the periodic-resync loop they were targeted at is gone, and the
existing time-based check in is_master() already catches stale-elector
mastership for the continuous loops that remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants