networking-calico: eliminate "SQL execution without transaction" warnings#13015
Merged
nelljerram merged 5 commits intoJun 23, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces SQLAlchemy “SQL execution without transaction” debug warnings in the OpenStack Calico ML2 driver by ensuring raw SQLAlchemy session.query(...) reads run under an explicit CONTEXT_READER transaction, while keeping calls that rely on Neutron’s @retry_if_session_inactive behavior outside any outer transaction.
Changes:
- Wrap raw per-port DB reads in
get_extra_port_information()withdb_api.CONTEXT_READER.using(context)to ensuresession.in_transaction()is true at execute time. - Refactor QoS handling: inline QoS rule fetch in callers and convert
add_port_qosinto a purebuild_qos_controls(bw_rules, pr_rules)helper. - Inline and remove several small helper methods that previously performed raw per-port queries.
…ings DevStack runs neutron-server with ``[DEFAULT] debug = True``, which enables an ORM-execute listener in ``neutron.objects.base`` that warns on every ``session.execute(...)`` where ``session.in_transaction()`` is False: ORM session: SQL execution without transaction in progress, traceback: ... It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are running in implicit-transaction mode (auto-begun and auto-discarded per statement) instead of the explicit reader/writer pattern. The warning is debug-only and doesn't affect correctness today, but it's real 2.0-readiness signal and very noisy in the journal during scale runs. After PR projectcalico#12930 removed our outer ``CONTEXT_WRITER.using`` wrapper from the mech driver's postcommit body, every raw ``context.session.query(...)`` in ``endpoints.WorkloadEndpointSyncer.get_extra_port_information`` ran with no enginefacade-recognised transaction active, and so tripped the listener. The PR-projectcalico#12930 lesson stands: we can't restore a single broad wrapper, because the body also calls ``@retry_if_session_inactive``- decorated upstream methods (``self.db.get_subnet`` via ``add_port_gateways``, ``self.db.get_security_groups`` via ``add_port_sg_names``) whose retry recovery for the ``_ensure_default_security_group`` IntegrityError race is exactly what gets disabled by an outer transaction. Restructure ``get_extra_port_information`` into three phases by transaction shape: * Phase A -- raw ``context.session.query(...)`` reads (fixed IPs, floating IPs, network, QoS rules) plus ``self.db._get_port_security_group_bindings`` (private upstream helper, no ``@retry_if_session_inactive``). These get a single ``CONTEXT_READER.using(context)`` block so ``session.in_transaction()`` is True at execute time. Safe to wrap because none of these calls goes through a retry-decorated upstream method that would have its retry disabled. All five reads are now inlined into the ``with`` block. The QoS rule queries use ``list(...)`` to force materialisation inside the reader; the other four are list-comprehensions that iterate eagerly before the ``with`` exits. * Phase B -- ``add_port_gateways`` and ``add_port_sg_names`` call into ``@retry_if_session_inactive``-decorated upstream methods and stay outside Phase A's reader for the same reason PR projectcalico#12930 stripped the old wrapper: keep the inner retry decorator enabled. Each upstream call manages its own reader transaction internally, so the warning doesn't fire here either. * Phase C -- ``add_port_interface_name`` (pure compute) and ``add_port_project_data`` (Keystone-backed in-process cache) touch no DB; position relative to A/B is correctness-irrelevant. The QoS handling is also factored: ``add_port_qos`` becomes a pure static method ``build_qos_controls(bw_rules, pr_rules)`` that takes already-materialised rule lists, and the rule fetch lives in the caller -- inside Phase A for the postcommit path, from ``self._bulk`` for the resync path. This removes the previous bulk-vs-postcommit branch from inside the method body and makes the consumer testable in isolation. The four tiny one-call helpers (``get_security_groups_for_port``, ``get_fixed_ips_for_port``, ``get_floating_ips_for_port``, ``get_network_properties_for_port``) are removed entirely; their bodies are inlined. Misleading docstrings on ``add_port_gateways``, ``add_port_sg_names``, and the former ``add_port_qos`` -- left over from the pre-projectcalico#12930 era when an outer ``_txn_from_context`` / ``CONTEXT_WRITER.using`` wrapped the entire postcommit body -- claimed *"This method assumes it's being called from within a database transaction and does not take out another one"*. That assumption was true then; it isn't now. Updated to describe the actual Phase A / Phase B placement. Tests: 185 pass (``make tox-caracal``). No mech-driver semantics change for either the postcommit path or the resync (bulk) path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
868f225 to
6992a84
Compare
Move ``port_extra.qos = self.build_qos_controls(bw_rules, pr_rules)`` from after the ``CONTEXT_READER.using`` block to inside it. Building the QoSControls dict iterates each rule and reads its ``direction``, ``max_kbps``, ``max_burst_kbps`` / ``max_kpps`` columns; doing this after the reader exits relies on the rule ORM objects still being readable in detached state -- which holds today (the columns are eager-loaded and oslo.db's reader-mode default leaves them in place) but would tie our correctness to oslo.db's ``expire_on_commit`` / ``rollback_reader_sessions`` internals. Doing it inside the reader, while the rules are still attached, sidesteps that question. With ``build_qos_controls`` now inside the reader, the ``list(...)`` wrapping that previously materialised the queries is no longer needed -- the lazy ``Query`` objects are iterated by ``build_qos_controls`` while the transaction is still active, so their SQL fires safely. ``build_qos_controls`` becomes shape-symmetric across paths: the postcommit path passes lazy Query objects, the bulk path passes plain Python lists, and either iterates correctly. Tests: tox-caracal passes (185 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
nelljerram
added a commit
to nelljerram/calico
that referenced
this pull request
Jun 18, 2026
We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:
Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
and _finalize_fairy(
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
connection_record.checkin()
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
pool.dispatch.checkin(connection, self)
File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
fn(*args, **kw)
File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
time.sleep(0)
File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
hub.switch()
File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
return self.greenlet.switch()
TimeoutError: timed out
This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC
kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the
eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener
to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to
itself. The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception
ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves
the connection record's pool state indeterminate.
I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw
`context.session.query` calls. The wrapper properly closes the session after those calls, instead
of leaking it to GC. However, in case there are any remaining cases, e.g. because a
Neutron-framework path outside our control drops a session unclosed, or because a future change
reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at
the leaking code path rather than just the in-hub finalizer stack that the existing "Exception
ignored in" trace gives us.
This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on
``sqlalchemy.pool.Pool``:
* A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every
connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``.
* A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``)
checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``. If yes, oslo.db is about
to deadlock; we log a WARNING containing both the captured checkout-time stack and the current
finalizer stack.
Default off, but enabled for DevStack CI. Enable per-deployment with::
[calico]
fairy_gc_diagnostics = True
in ``neutron.conf``. The per-checkout stack capture costs ~50-100us per checkout, so worth turning
off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic /
scale-test runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coutinhop
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DevStack runs neutron-server with
[DEFAULT] debug = True, which enables an ORM-execute listener inneutron.objects.basethat warns on everysession.execute(...)wheresession.in_transaction()is False:ORM session: SQL execution without transaction in progress, traceback:
...
It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are running in implicit-transaction mode (auto-begun and auto-discarded per statement) instead of the explicit reader/writer pattern. The warning is debug-only and doesn't affect correctness today, but it's real 2.0-readiness signal and very noisy in the journal during scale runs.
After PR #12930 removed our outer
CONTEXT_WRITER.usingwrapper from the mech driver's postcommit body, every rawcontext.session.query(...)inendpoints.WorkloadEndpointSyncer.get_extra_port_informationran with no enginefacade-recognised transaction active, and so tripped the listener. The PR-#12930 lesson stands: we can't restore a single broad wrapper, because the body also calls@retry_if_session_inactive- decorated upstream methods (self.db.get_subnetviaadd_port_gateways,self.db.get_security_groupsviaadd_port_sg_names) whose retry recovery for the_ensure_default_security_groupIntegrityError race is exactly what gets disabled by an outer transaction.Restructure
get_extra_port_informationinto three phases by transaction shape:Phase A -- raw
context.session.query(...)reads (fixed IPs, floating IPs, network, QoS rules) plusself.db._get_port_security_group_bindings(private upstream helper, no@retry_if_session_inactive). These get a singleCONTEXT_READER.using(context)block sosession.in_transaction()is True at execute time. Safe to wrap because none of these calls goes through a retry-decorated upstream method that would have its retry disabled.All five reads are now inlined into the
withblock. The QoS rule queries uselist(...)to force materialisation inside the reader; the other four are list-comprehensions that iterate eagerly before thewithexits.Phase B --
add_port_gatewaysandadd_port_sg_namescall into@retry_if_session_inactive-decorated upstream methods and stay outside Phase A's reader for the same reason PR networking-calico: stop wrapping postcommit hooks in CONTEXT_WRITER #12930 stripped the old wrapper: keep the inner retry decorator enabled. Each upstream call manages its own reader transaction internally, so the warning doesn't fire here either.Phase C --
add_port_interface_name(pure compute) andadd_port_project_data(Keystone-backed in-process cache) touch no DB; position relative to A/B is correctness-irrelevant.The QoS handling is also factored:
add_port_qosbecomes a pure static methodbuild_qos_controls(bw_rules, pr_rules)that takes already-materialised rule lists, and the rule fetch lives in the caller -- inside Phase A for the postcommit path, fromself._bulkfor the resync path. This removes the previous bulk-vs-postcommit branch from inside the method body and makes the consumer testable in isolation.The four tiny one-call helpers (
get_security_groups_for_port,get_fixed_ips_for_port,get_floating_ips_for_port,get_network_properties_for_port) are removed entirely; their bodies are inlined.Misleading docstrings on
add_port_gateways,add_port_sg_names, and the formeradd_port_qos-- left over from the pre-#12930 era when an outer_txn_from_context/CONTEXT_WRITER.usingwrapped the entire postcommit body -- claimed "This method assumes it's being called from within a database transaction and does not take out another one". That assumption was true then; it isn't now. Updated to describe the actual Phase A / Phase B placement.Tests: 185 pass (
make tox-caracal). No mech-driver semantics change for either the postcommit path or the resync (bulk) path.Release note: