networking-calico: diagnostic instrumentation for fairy-GC-in-hub race#13022
Merged
nelljerram merged 6 commits intoJun 23, 2026
Conversation
We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:
Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
and _finalize_fairy(
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
connection_record.checkin()
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
pool.dispatch.checkin(connection, self)
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
fn(*args, **kw)
File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
time.sleep(0)
File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
hub.switch()
File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
return self.greenlet.switch()
TimeoutError: timed out
This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC
kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the
eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener
to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to
itself. The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception
ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves
the connection record's pool state indeterminate.
I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw
`context.session.query` calls. The wrapper properly closes the session after those calls, instead
of leaking it to GC. However, in case there are any remaining cases, e.g. because a
Neutron-framework path outside our control drops a session unclosed, or because a future change
reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at
the leaking code path rather than just the in-hub finalizer stack that the existing "Exception
ignored in" trace gives us.
This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on
``sqlalchemy.pool.Pool``:
* A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every
connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``.
* A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``)
checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``. If yes, oslo.db is about
to deadlock; we log a WARNING containing both the captured checkout-time stack and the current
finalizer stack.
Default off, but enabled for DevStack CI. Enable per-deployment with::
[calico]
fairy_gc_diagnostics = True
in ``neutron.conf``. The per-checkout stack capture costs ~50-100us per checkout, so worth turning
off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic /
scale-test runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
648bdf9 to
1cd3658
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds opt-in diagnostic instrumentation to the OpenStack networking-calico ML2 mechanism driver to help pinpoint connection/session leaks that lead to SQLAlchemy connection-fairy finalization occurring on the eventlet hub greenlet (triggering the oslo.db _thread_yield deadlock pattern).
Changes:
- Introduces
[calico] fairy_gc_diagnosticsconfig option and installs diagnostics during driver initialization when enabled. - Adds a new
fairy_gc_diagnostics.pymodule that registers SQLAlchemy Poolcheckout/checkinevent listeners for stack capture + hub-greenlet detection with WARNING logging. - Enables the diagnostic flag by default in DevStack CI via the plugin script.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py |
Adds a config knob and conditionally installs the diagnostic listeners during parent-process initialization. |
networking-calico/networking_calico/plugins/ml2/drivers/calico/fairy_gc_diagnostics.py |
New diagnostics module that hooks SQLAlchemy Pool events, records checkout stacks, and logs when checkin runs on the eventlet hub. |
networking-calico/devstack/plugin.sh |
Turns the diagnostics on for DevStack CI runs. |
If eventlet is not importable -- e.g. after the planned eventlet removal from neutron-server -- the per-checkin listener used to catch ImportError under a generic `except Exception` and route it to LOG.exception, producing a full traceback on every DB connection checkin. Move the eventlet import into install() so it runs once. If it fails, log a single WARNING and skip installing the listeners entirely: the race the listeners detect only fires under eventlet, so there is nothing to install when eventlet is absent. With imports verified up-front, the runtime try/except in the listener is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
The previous wording described forked workers as "sharing" module state and the listeners as inherited "via the shared module state". Both are inaccurate: a fork gives each child a copy of the parent's memory image, not a live shared reference. Rewrite the _INSTALLED flag comment to claim only what the flag actually delivers (intra-process idempotency), and rewrite the install-site comment in mech_calico.py to pin the inheritance story to the SQLAlchemy Pool class -- which is the actual mechanism by which a forked child sees the listeners without re-installing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Two follow-ups from Copilot review:
1. The eventlet-missing and not-monkey-patched early returns left
_INSTALLED at False, so repeated install() calls would re-emit the
WARNING each time -- contradicting the docstring's "single WARNING"
promise. Rename the flag to _INSTALL_ATTEMPTED to reflect what it
actually tracks ("decision made", not "listeners active") and set
it before the early-return checks.
2. traceback.format_stack() on every pool checkout is unbounded. Cap
the checkout-time capture at 50 frames (covers a typical Neutron
handler chain, ~30 frames, with headroom) and the finalizer stack
at 30 frames (the GC/weakref-finalizer path is shallower). Keeps
per-checkout cost bounded and avoids unbounded strings on
long-lived connection_record.info entries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coutinhop
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:
This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the eventlet hub greenlet is the current greenlet triggers oslo.db's
_thread_yieldcheckin listener to calltime.sleep(0)->hub.switch(), which deadlocks because the hub cannot switch to itself. TheTimeoutErrorthat eventually fires is silently swallowed by Python's "Exception ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves the connection record's pool state indeterminate.I think #13015 fixes the primary root cause of this, by adding a transaction wrapper around raw
context.session.querycalls. The wrapper properly closes the session after those calls, instead of leaking it to GC. However, in case there are any remaining cases, e.g. because a Neutron-framework path outside our control drops a session unclosed, or because a future change reintroduces a code path that bypasses theusingpattern -- we want a log line that points at the leaking code path rather than just the in-hub finalizer stack that the existing "Exception ignored in" trace gives us.This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on
sqlalchemy.pool.Pool:A:
checkoutlistener capturestraceback.format_stack()at the moment of every connection-pool checkout, stashing it onconnection_record.info["calico_checkout_stack"].A:
checkinlistener withinsert=True(prepends ahead of oslo.db's_thread_yield) checks whethereventlet.greenthread.getcurrent() is hub.greenlet. If yes, oslo.db is about to deadlock; we log a WARNING containing both the captured checkout-time stack and the current finalizer stack.Default off, but enabled for DevStack CI. Enable per-deployment with::
in
neutron.conf. The per-checkout stack capture costs ~50-100us per checkout, so worth turning off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic / scale-test runs.