networking-calico: eliminate "SQL execution without transaction" warnings by nelljerram · Pull Request #13015 · projectcalico/calico

nelljerram · 2026-06-17T17:31:14Z

DevStack runs neutron-server with [DEFAULT] debug = True, which enables an ORM-execute listener in neutron.objects.base that warns on every session.execute(...) where session.in_transaction() is False:

ORM session: SQL execution without transaction in progress, traceback:
...

It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are running in implicit-transaction mode (auto-begun and auto-discarded per statement) instead of the explicit reader/writer pattern. The warning is debug-only and doesn't affect correctness today, but it's real 2.0-readiness signal and very noisy in the journal during scale runs.

After PR #12930 removed our outer CONTEXT_WRITER.using wrapper from the mech driver's postcommit body, every raw context.session.query(...) in endpoints.WorkloadEndpointSyncer.get_extra_port_information ran with no enginefacade-recognised transaction active, and so tripped the listener. The PR-#12930 lesson stands: we can't restore a single broad wrapper, because the body also calls @retry_if_session_inactive- decorated upstream methods (self.db.get_subnet via add_port_gateways, self.db.get_security_groups via add_port_sg_names) whose retry recovery for the _ensure_default_security_group IntegrityError race is exactly what gets disabled by an outer transaction.

Restructure get_extra_port_information into three phases by transaction shape:

Phase A -- raw context.session.query(...) reads (fixed IPs, floating IPs, network, QoS rules) plus self.db._get_port_security_group_bindings (private upstream helper, no @retry_if_session_inactive). These get a single CONTEXT_READER.using(context) block so session.in_transaction() is True at execute time. Safe to wrap because none of these calls goes through a retry-decorated upstream method that would have its retry disabled.

All five reads are now inlined into the with block. The QoS rule queries use list(...) to force materialisation inside the reader; the other four are list-comprehensions that iterate eagerly before the with exits.
Phase B -- add_port_gateways and add_port_sg_names call into @retry_if_session_inactive-decorated upstream methods and stay outside Phase A's reader for the same reason PR networking-calico: stop wrapping postcommit hooks in CONTEXT_WRITER #12930 stripped the old wrapper: keep the inner retry decorator enabled. Each upstream call manages its own reader transaction internally, so the warning doesn't fire here either.
Phase C -- add_port_interface_name (pure compute) and add_port_project_data (Keystone-backed in-process cache) touch no DB; position relative to A/B is correctness-irrelevant.

The QoS handling is also factored: add_port_qos becomes a pure static method build_qos_controls(bw_rules, pr_rules) that takes already-materialised rule lists, and the rule fetch lives in the caller -- inside Phase A for the postcommit path, from self._bulk for the resync path. This removes the previous bulk-vs-postcommit branch from inside the method body and makes the consumer testable in isolation.

The four tiny one-call helpers (get_security_groups_for_port, get_fixed_ips_for_port, get_floating_ips_for_port, get_network_properties_for_port) are removed entirely; their bodies are inlined.

Misleading docstrings on add_port_gateways, add_port_sg_names, and the former add_port_qos -- left over from the pre-#12930 era when an outer _txn_from_context / CONTEXT_WRITER.using wrapped the entire postcommit body -- claimed "This method assumes it's being called from within a database transaction and does not take out another one". That assumption was true then; it isn't now. Updated to describe the actual Phase A / Phase B placement.

Tests: 185 pass (make tox-caracal). No mech-driver semantics change for either the postcommit path or the resync (bulk) path.

Release note:

TBD

Copilot

Pull request overview

This PR reduces SQLAlchemy “SQL execution without transaction” debug warnings in the OpenStack Calico ML2 driver by ensuring raw SQLAlchemy session.query(...) reads run under an explicit CONTEXT_READER transaction, while keeping calls that rely on Neutron’s @retry_if_session_inactive behavior outside any outer transaction.

Changes:

Wrap raw per-port DB reads in get_extra_port_information() with db_api.CONTEXT_READER.using(context) to ensure session.in_transaction() is true at execute time.
Refactor QoS handling: inline QoS rule fetch in callers and convert add_port_qos into a pure build_qos_controls(bw_rules, pr_rules) helper.
Inline and remove several small helper methods that previously performed raw per-port queries.

…ings DevStack runs neutron-server with ``[DEFAULT] debug = True``, which enables an ORM-execute listener in ``neutron.objects.base`` that warns on every ``session.execute(...)`` where ``session.in_transaction()`` is False: ORM session: SQL execution without transaction in progress, traceback: ... It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are running in implicit-transaction mode (auto-begun and auto-discarded per statement) instead of the explicit reader/writer pattern. The warning is debug-only and doesn't affect correctness today, but it's real 2.0-readiness signal and very noisy in the journal during scale runs. After PR projectcalico#12930 removed our outer ``CONTEXT_WRITER.using`` wrapper from the mech driver's postcommit body, every raw ``context.session.query(...)`` in ``endpoints.WorkloadEndpointSyncer.get_extra_port_information`` ran with no enginefacade-recognised transaction active, and so tripped the listener. The PR-projectcalico#12930 lesson stands: we can't restore a single broad wrapper, because the body also calls ``@retry_if_session_inactive``- decorated upstream methods (``self.db.get_subnet`` via ``add_port_gateways``, ``self.db.get_security_groups`` via ``add_port_sg_names``) whose retry recovery for the ``_ensure_default_security_group`` IntegrityError race is exactly what gets disabled by an outer transaction. Restructure ``get_extra_port_information`` into three phases by transaction shape: * Phase A -- raw ``context.session.query(...)`` reads (fixed IPs, floating IPs, network, QoS rules) plus ``self.db._get_port_security_group_bindings`` (private upstream helper, no ``@retry_if_session_inactive``). These get a single ``CONTEXT_READER.using(context)`` block so ``session.in_transaction()`` is True at execute time. Safe to wrap because none of these calls goes through a retry-decorated upstream method that would have its retry disabled. All five reads are now inlined into the ``with`` block. The QoS rule queries use ``list(...)`` to force materialisation inside the reader; the other four are list-comprehensions that iterate eagerly before the ``with`` exits. * Phase B -- ``add_port_gateways`` and ``add_port_sg_names`` call into ``@retry_if_session_inactive``-decorated upstream methods and stay outside Phase A's reader for the same reason PR projectcalico#12930 stripped the old wrapper: keep the inner retry decorator enabled. Each upstream call manages its own reader transaction internally, so the warning doesn't fire here either. * Phase C -- ``add_port_interface_name`` (pure compute) and ``add_port_project_data`` (Keystone-backed in-process cache) touch no DB; position relative to A/B is correctness-irrelevant. The QoS handling is also factored: ``add_port_qos`` becomes a pure static method ``build_qos_controls(bw_rules, pr_rules)`` that takes already-materialised rule lists, and the rule fetch lives in the caller -- inside Phase A for the postcommit path, from ``self._bulk`` for the resync path. This removes the previous bulk-vs-postcommit branch from inside the method body and makes the consumer testable in isolation. The four tiny one-call helpers (``get_security_groups_for_port``, ``get_fixed_ips_for_port``, ``get_floating_ips_for_port``, ``get_network_properties_for_port``) are removed entirely; their bodies are inlined. Misleading docstrings on ``add_port_gateways``, ``add_port_sg_names``, and the former ``add_port_qos`` -- left over from the pre-projectcalico#12930 era when an outer ``_txn_from_context`` / ``CONTEXT_WRITER.using`` wrapped the entire postcommit body -- claimed *"This method assumes it's being called from within a database transaction and does not take out another one"*. That assumption was true then; it isn't now. Updated to describe the actual Phase A / Phase B placement. Tests: 185 pass (``make tox-caracal``). No mech-driver semantics change for either the postcommit path or the resync (bulk) path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move ``port_extra.qos = self.build_qos_controls(bw_rules, pr_rules)`` from after the ``CONTEXT_READER.using`` block to inside it. Building the QoSControls dict iterates each rule and reads its ``direction``, ``max_kbps``, ``max_burst_kbps`` / ``max_kpps`` columns; doing this after the reader exits relies on the rule ORM objects still being readable in detached state -- which holds today (the columns are eager-loaded and oslo.db's reader-mode default leaves them in place) but would tie our correctness to oslo.db's ``expire_on_commit`` / ``rollback_reader_sessions`` internals. Doing it inside the reader, while the rules are still attached, sidesteps that question. With ``build_qos_controls`` now inside the reader, the ``list(...)`` wrapping that previously materialised the queries is no longer needed -- the lazy ``Query`` objects are iterated by ``build_qos_controls`` while the transaction is still active, so their SQL fires safely. ``build_qos_controls`` becomes shape-symmetric across paths: the postcommit path passes lazy Query objects, the bulk path passes plain Python lists, and either iterates correctly. Tests: tox-caracal passes (185 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

We have sometimes been seeing this exception in journalctl.txt / neutron-server.log: Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290> Traceback (most recent call last): File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda> and _finalize_fairy( File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy connection_record.checkin() File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin pool.dispatch.checkin(connection, self) File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__ fn(*args, **kw) File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield time.sleep(0) File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep hub.switch() File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch return self.greenlet.switch() TimeoutError: timed out This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC kicks in on eventlet's "hub" greenlet. An sqlalchemy connection fairy that is GC'd while the eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to itself. The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves the connection record's pool state indeterminate. I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw `context.session.query` calls. The wrapper properly closes the session after those calls, instead of leaking it to GC. However, in case there are any remaining cases, e.g. because a Neutron-framework path outside our control drops a session unclosed, or because a future change reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at the leaking code path rather than just the in-hub finalizer stack that the existing "Exception ignored in" trace gives us. This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on ``sqlalchemy.pool.Pool``: * A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``. * A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``) checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``. If yes, oslo.db is about to deadlock; we log a WARNING containing both the captured checkout-time stack and the current finalizer stack. Default off, but enabled for DevStack CI. Enable per-deployment with:: [calico] fairy_gc_diagnostics = True in ``neutron.conf``. The per-checkout stack capture costs ~50-100us per checkout, so worth turning off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic / scale-test runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 17, 2026 17:31

nelljerram requested a review from a team as a code owner June 17, 2026 17:31

marvin-tigera added this to the Calico v3.33.0 milestone Jun 17, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 17, 2026

Copilot started reviewing on behalf of nelljerram June 17, 2026 17:31 View session

github-actions Bot added the cherry-pick-candidate label Jun 17, 2026

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py Outdated

nelljerram force-pushed the eliminate-orm-warnings branch from 868f225 to 6992a84 Compare June 17, 2026 21:05

nelljerram and others added 2 commits June 18, 2026 11:34

Fix awkward wording in comment

ad3e847

nelljerram requested a review from Copilot June 18, 2026 10:45

Copilot started reviewing on behalf of nelljerram June 18, 2026 10:45 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py Outdated

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py

nelljerram and others added 2 commits June 18, 2026 11:51

Potential fix for pull request finding

191e77e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix imprecise comment about build_qos_controls being pure compute

c088407

nelljerram requested a review from Copilot June 18, 2026 10:56

Copilot started reviewing on behalf of nelljerram June 18, 2026 10:57 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

nelljerram mentioned this pull request Jun 18, 2026

networking-calico: diagnostic instrumentation for fairy-GC-in-hub race #13022

Merged

coutinhop approved these changes Jun 22, 2026

View reviewed changes

nelljerram merged commit 59a786e into projectcalico:master Jun 23, 2026
3 checks passed

nelljerram deleted the eliminate-orm-warnings branch June 23, 2026 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

networking-calico: eliminate "SQL execution without transaction" warnings#13015

networking-calico: eliminate "SQL execution without transaction" warnings#13015
nelljerram merged 5 commits into
projectcalico:masterfrom
nelljerram:eliminate-orm-warnings

nelljerram commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nelljerram commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants