networking-calico: diagnostic instrumentation for fairy-GC-in-hub race by nelljerram · Pull Request #13022 · projectcalico/calico

nelljerram · 2026-06-18T12:22:25Z

We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:

Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
    and _finalize_fairy(
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
    pool.dispatch.checkin(connection, self)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
    fn(*args, **kw)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
    time.sleep(0)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
    hub.switch()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
    return self.greenlet.switch()
TimeoutError: timed out

This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the eventlet hub greenlet is the current greenlet triggers oslo.db's _thread_yield checkin listener to call time.sleep(0) -> hub.switch(), which deadlocks because the hub cannot switch to itself. The TimeoutError that eventually fires is silently swallowed by Python's "Exception ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves the connection record's pool state indeterminate.

I think #13015 fixes the primary root cause of this, by adding a transaction wrapper around raw context.session.query calls. The wrapper properly closes the session after those calls, instead of leaking it to GC. However, in case there are any remaining cases, e.g. because a Neutron-framework path outside our control drops a session unclosed, or because a future change reintroduces a code path that bypasses the using pattern -- we want a log line that points at the leaking code path rather than just the in-hub finalizer stack that the existing "Exception ignored in" trace gives us.

This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on sqlalchemy.pool.Pool:

A: checkout listener captures traceback.format_stack() at the moment of every connection-pool checkout, stashing it on connection_record.info["calico_checkout_stack"].
A: checkin listener with insert=True (prepends ahead of oslo.db's _thread_yield) checks whether eventlet.greenthread.getcurrent() is hub.greenlet. If yes, oslo.db is about to deadlock; we log a WARNING containing both the captured checkout-time stack and the current finalizer stack.

Default off, but enabled for DevStack CI. Enable per-deployment with::

[calico]
fairy_gc_diagnostics = True

in neutron.conf. The per-checkout stack capture costs ~50-100us per checkout, so worth turning off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic / scale-test runs.

We have sometimes been seeing this exception in journalctl.txt / neutron-server.log: Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290> Traceback (most recent call last): File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda> and _finalize_fairy( File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy connection_record.checkin() File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin pool.dispatch.checkin(connection, self) File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__ fn(*args, **kw) File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield time.sleep(0) File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep hub.switch() File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch return self.greenlet.switch() TimeoutError: timed out This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to itself. The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves the connection record's pool state indeterminate. I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw `context.session.query` calls. The wrapper properly closes the session after those calls, instead of leaking it to GC. However, in case there are any remaining cases, e.g. because a Neutron-framework path outside our control drops a session unclosed, or because a future change reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at the leaking code path rather than just the in-hub finalizer stack that the existing "Exception ignored in" trace gives us. This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on ``sqlalchemy.pool.Pool``: * A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``. * A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``) checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``. If yes, oslo.db is about to deadlock; we log a WARNING containing both the captured checkout-time stack and the current finalizer stack. Default off, but enabled for DevStack CI. Enable per-deployment with:: [calico] fairy_gc_diagnostics = True in ``neutron.conf``. The per-checkout stack capture costs ~50-100us per checkout, so worth turning off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic / scale-test runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds opt-in diagnostic instrumentation to the OpenStack networking-calico ML2 mechanism driver to help pinpoint connection/session leaks that lead to SQLAlchemy connection-fairy finalization occurring on the eventlet hub greenlet (triggering the oslo.db _thread_yield deadlock pattern).

Changes:

Introduces [calico] fairy_gc_diagnostics config option and installs diagnostics during driver initialization when enabled.
Adds a new fairy_gc_diagnostics.py module that registers SQLAlchemy Pool checkout/checkin event listeners for stack capture + hub-greenlet detection with WARNING logging.
Enables the diagnostic flag by default in DevStack CI via the plugin script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py`	Adds a config knob and conditionally installs the diagnostic listeners during parent-process initialization.
`networking-calico/networking_calico/plugins/ml2/drivers/calico/fairy_gc_diagnostics.py`	New diagnostics module that hooks SQLAlchemy Pool events, records checkout stacks, and logs when checkin runs on the eventlet hub.
`networking-calico/devstack/plugin.sh`	Turns the diagnostics on for DevStack CI runs.

If eventlet is not importable -- e.g. after the planned eventlet removal from neutron-server -- the per-checkin listener used to catch ImportError under a generic `except Exception` and route it to LOG.exception, producing a full traceback on every DB connection checkin. Move the eventlet import into install() so it runs once. If it fails, log a single WARNING and skip installing the listeners entirely: the race the listeners detect only fires under eventlet, so there is nothing to install when eventlet is absent. With imports verified up-front, the runtime try/except in the listener is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

The previous wording described forked workers as "sharing" module state and the listeners as inherited "via the shared module state". Both are inaccurate: a fork gives each child a copy of the parent's memory image, not a live shared reference. Rewrite the _INSTALLED flag comment to claim only what the flag actually delivers (intra-process idempotency), and rewrite the install-site comment in mech_calico.py to pin the inheritance story to the SQLAlchemy Pool class -- which is the actual mechanism by which a forked child sees the listeners without re-installing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Two follow-ups from Copilot review: 1. The eventlet-missing and not-monkey-patched early returns left _INSTALLED at False, so repeated install() calls would re-emit the WARNING each time -- contradicting the docstring's "single WARNING" promise. Rename the flag to _INSTALL_ATTEMPTED to reflect what it actually tracks ("decision made", not "listeners active") and set it before the early-return checks. 2. traceback.format_stack() on every pool checkout is unbounded. Cap the checkout-time capture at 50 frames (covers a typical Neutron handler chain, ~30 frames, with headroom) and the finalizer stack at 30 frames (the GC/weakref-finalizer path is shallower). Keeps per-checkout cost bounded and avoids unbounded strings on long-lived connection_record.info entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 18, 2026 12:22

nelljerram requested a review from a team as a code owner June 18, 2026 12:22

marvin-tigera added this to the Calico v3.33.0 milestone Jun 18, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 18, 2026

Copilot started reviewing on behalf of nelljerram June 18, 2026 12:22 View session

github-actions Bot added the cherry-pick-candidate label Jun 18, 2026

nelljerram force-pushed the fairy-gc-diagnostics branch from 648bdf9 to 1cd3658 Compare June 18, 2026 12:24