Skip to content

networking-calico: diagnostic instrumentation for fairy-GC-in-hub race#13022

Merged
nelljerram merged 6 commits into
projectcalico:masterfrom
nelljerram:fairy-gc-diagnostics
Jun 23, 2026
Merged

networking-calico: diagnostic instrumentation for fairy-GC-in-hub race#13022
nelljerram merged 6 commits into
projectcalico:masterfrom
nelljerram:fairy-gc-diagnostics

Conversation

@nelljerram

Copy link
Copy Markdown
Member

We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:

Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
    and _finalize_fairy(
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
    pool.dispatch.checkin(connection, self)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
    fn(*args, **kw)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
    time.sleep(0)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
    hub.switch()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
    return self.greenlet.switch()
TimeoutError: timed out

This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the eventlet hub greenlet is the current greenlet triggers oslo.db's _thread_yield checkin listener to call time.sleep(0) -> hub.switch(), which deadlocks because the hub cannot switch to itself. The TimeoutError that eventually fires is silently swallowed by Python's "Exception ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves the connection record's pool state indeterminate.

I think #13015 fixes the primary root cause of this, by adding a transaction wrapper around raw context.session.query calls. The wrapper properly closes the session after those calls, instead of leaking it to GC. However, in case there are any remaining cases, e.g. because a Neutron-framework path outside our control drops a session unclosed, or because a future change reintroduces a code path that bypasses the using pattern -- we want a log line that points at the leaking code path rather than just the in-hub finalizer stack that the existing "Exception ignored in" trace gives us.

This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on sqlalchemy.pool.Pool:

  • A: checkout listener captures traceback.format_stack() at the moment of every connection-pool checkout, stashing it on connection_record.info["calico_checkout_stack"].

  • A: checkin listener with insert=True (prepends ahead of oslo.db's _thread_yield) checks whether eventlet.greenthread.getcurrent() is hub.greenlet. If yes, oslo.db is about to deadlock; we log a WARNING containing both the captured checkout-time stack and the current finalizer stack.

Default off, but enabled for DevStack CI. Enable per-deployment with::

[calico]
fairy_gc_diagnostics = True

in neutron.conf. The per-checkout stack capture costs ~50-100us per checkout, so worth turning off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic / scale-test runs.

Copilot AI review requested due to automatic review settings June 18, 2026 12:22
@nelljerram nelljerram requested a review from a team as a code owner June 18, 2026 12:22
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Jun 18, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 18, 2026
We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:

Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
    and _finalize_fairy(
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
    pool.dispatch.checkin(connection, self)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
    fn(*args, **kw)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
    time.sleep(0)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
    hub.switch()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
    return self.greenlet.switch()
TimeoutError: timed out

This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC
kicks in on eventlet's "hub" greenlet.  An sqlalchemy connection fairy that is GC'd while the
eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener
to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to
itself.  The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception
ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves
the connection record's pool state indeterminate.

I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw
`context.session.query` calls.  The wrapper properly closes the session after those calls, instead
of leaking it to GC.  However, in case there are any remaining cases, e.g. because a
Neutron-framework path outside our control drops a session unclosed, or because a future change
reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at
the leaking code path rather than just the in-hub finalizer stack that the existing "Exception
ignored in" trace gives us.

This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on
``sqlalchemy.pool.Pool``:

* A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every
  connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``.

* A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``)
  checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``.  If yes, oslo.db is about
  to deadlock; we log a WARNING containing both the captured checkout-time stack and the current
  finalizer stack.

Default off, but enabled for DevStack CI.  Enable per-deployment with::

    [calico]
    fairy_gc_diagnostics = True

in ``neutron.conf``.  The per-checkout stack capture costs ~50-100us per checkout, so worth turning
off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic /
scale-test runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram force-pushed the fairy-gc-diagnostics branch from 648bdf9 to 1cd3658 Compare June 18, 2026 12:24
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented cherry-pick-candidate labels Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in diagnostic instrumentation to the OpenStack networking-calico ML2 mechanism driver to help pinpoint connection/session leaks that lead to SQLAlchemy connection-fairy finalization occurring on the eventlet hub greenlet (triggering the oslo.db _thread_yield deadlock pattern).

Changes:

  • Introduces [calico] fairy_gc_diagnostics config option and installs diagnostics during driver initialization when enabled.
  • Adds a new fairy_gc_diagnostics.py module that registers SQLAlchemy Pool checkout/checkin event listeners for stack capture + hub-greenlet detection with WARNING logging.
  • Enables the diagnostic flag by default in DevStack CI via the plugin script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Adds a config knob and conditionally installs the diagnostic listeners during parent-process initialization.
networking-calico/networking_calico/plugins/ml2/drivers/calico/fairy_gc_diagnostics.py New diagnostics module that hooks SQLAlchemy Pool events, records checkout stacks, and logs when checkin runs on the eventlet hub.
networking-calico/devstack/plugin.sh Turns the diagnostics on for DevStack CI runs.

If eventlet is not importable -- e.g. after the planned eventlet
removal from neutron-server -- the per-checkin listener used to catch
ImportError under a generic `except Exception` and route it to
LOG.exception, producing a full traceback on every DB connection
checkin.

Move the eventlet import into install() so it runs once.  If it
fails, log a single WARNING and skip installing the listeners
entirely: the race the listeners detect only fires under eventlet,
so there is nothing to install when eventlet is absent.  With
imports verified up-front, the runtime try/except in the listener
is no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/mech_calico.py Outdated
The previous wording described forked workers as "sharing" module state
and the listeners as inherited "via the shared module state".  Both are
inaccurate: a fork gives each child a copy of the parent's memory
image, not a live shared reference.

Rewrite the _INSTALLED flag comment to claim only what the flag
actually delivers (intra-process idempotency), and rewrite the
install-site comment in mech_calico.py to pin the inheritance story to
the SQLAlchemy Pool class -- which is the actual mechanism by which a
forked child sees the listeners without re-installing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@nelljerram nelljerram requested a review from Copilot June 18, 2026 14:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Two follow-ups from Copilot review:

1. The eventlet-missing and not-monkey-patched early returns left
   _INSTALLED at False, so repeated install() calls would re-emit the
   WARNING each time -- contradicting the docstring's "single WARNING"
   promise.  Rename the flag to _INSTALL_ATTEMPTED to reflect what it
   actually tracks ("decision made", not "listeners active") and set
   it before the early-return checks.

2. traceback.format_stack() on every pool checkout is unbounded.  Cap
   the checkout-time capture at 50 frames (covers a typical Neutron
   handler chain, ~30 frames, with headroom) and the finalizer stack
   at 30 frames (the GC/weakref-finalizer path is shallower).  Keeps
   per-checkout cost bounded and avoids unbounded strings on
   long-lived connection_record.info entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@nelljerram nelljerram merged commit 50be41e into projectcalico:master Jun 23, 2026
3 checks passed
@nelljerram nelljerram deleted the fairy-gc-diagnostics branch June 23, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants