Skip to content

networking-calico: eliminate "SQL execution without transaction" warnings#13015

Merged
nelljerram merged 5 commits into
projectcalico:masterfrom
nelljerram:eliminate-orm-warnings
Jun 23, 2026
Merged

networking-calico: eliminate "SQL execution without transaction" warnings#13015
nelljerram merged 5 commits into
projectcalico:masterfrom
nelljerram:eliminate-orm-warnings

Conversation

@nelljerram

Copy link
Copy Markdown
Member

DevStack runs neutron-server with [DEFAULT] debug = True, which enables an ORM-execute listener in neutron.objects.base that warns on every session.execute(...) where session.in_transaction() is False:

ORM session: SQL execution without transaction in progress, traceback:
...

It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are running in implicit-transaction mode (auto-begun and auto-discarded per statement) instead of the explicit reader/writer pattern. The warning is debug-only and doesn't affect correctness today, but it's real 2.0-readiness signal and very noisy in the journal during scale runs.

After PR #12930 removed our outer CONTEXT_WRITER.using wrapper from the mech driver's postcommit body, every raw context.session.query(...) in endpoints.WorkloadEndpointSyncer.get_extra_port_information ran with no enginefacade-recognised transaction active, and so tripped the listener. The PR-#12930 lesson stands: we can't restore a single broad wrapper, because the body also calls @retry_if_session_inactive- decorated upstream methods (self.db.get_subnet via add_port_gateways, self.db.get_security_groups via add_port_sg_names) whose retry recovery for the _ensure_default_security_group IntegrityError race is exactly what gets disabled by an outer transaction.

Restructure get_extra_port_information into three phases by transaction shape:

  • Phase A -- raw context.session.query(...) reads (fixed IPs, floating IPs, network, QoS rules) plus self.db._get_port_security_group_bindings (private upstream helper, no @retry_if_session_inactive). These get a single CONTEXT_READER.using(context) block so session.in_transaction() is True at execute time. Safe to wrap because none of these calls goes through a retry-decorated upstream method that would have its retry disabled.

    All five reads are now inlined into the with block. The QoS rule queries use list(...) to force materialisation inside the reader; the other four are list-comprehensions that iterate eagerly before the with exits.

  • Phase B -- add_port_gateways and add_port_sg_names call into @retry_if_session_inactive-decorated upstream methods and stay outside Phase A's reader for the same reason PR networking-calico: stop wrapping postcommit hooks in CONTEXT_WRITER #12930 stripped the old wrapper: keep the inner retry decorator enabled. Each upstream call manages its own reader transaction internally, so the warning doesn't fire here either.

  • Phase C -- add_port_interface_name (pure compute) and add_port_project_data (Keystone-backed in-process cache) touch no DB; position relative to A/B is correctness-irrelevant.

The QoS handling is also factored: add_port_qos becomes a pure static method build_qos_controls(bw_rules, pr_rules) that takes already-materialised rule lists, and the rule fetch lives in the caller -- inside Phase A for the postcommit path, from self._bulk for the resync path. This removes the previous bulk-vs-postcommit branch from inside the method body and makes the consumer testable in isolation.

The four tiny one-call helpers (get_security_groups_for_port, get_fixed_ips_for_port, get_floating_ips_for_port, get_network_properties_for_port) are removed entirely; their bodies are inlined.

Misleading docstrings on add_port_gateways, add_port_sg_names, and the former add_port_qos -- left over from the pre-#12930 era when an outer _txn_from_context / CONTEXT_WRITER.using wrapped the entire postcommit body -- claimed "This method assumes it's being called from within a database transaction and does not take out another one". That assumption was true then; it isn't now. Updated to describe the actual Phase A / Phase B placement.

Tests: 185 pass (make tox-caracal). No mech-driver semantics change for either the postcommit path or the resync (bulk) path.

Release note:

TBD

Copilot AI review requested due to automatic review settings June 17, 2026 17:31
@nelljerram nelljerram requested a review from a team as a code owner June 17, 2026 17:31
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Jun 17, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 17, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented cherry-pick-candidate labels Jun 17, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented cherry-pick-candidate and removed docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact labels Jun 17, 2026
@nelljerram nelljerram added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented cherry-pick-candidate labels Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces SQLAlchemy “SQL execution without transaction” debug warnings in the OpenStack Calico ML2 driver by ensuring raw SQLAlchemy session.query(...) reads run under an explicit CONTEXT_READER transaction, while keeping calls that rely on Neutron’s @retry_if_session_inactive behavior outside any outer transaction.

Changes:

  • Wrap raw per-port DB reads in get_extra_port_information() with db_api.CONTEXT_READER.using(context) to ensure session.in_transaction() is true at execute time.
  • Refactor QoS handling: inline QoS rule fetch in callers and convert add_port_qos into a pure build_qos_controls(bw_rules, pr_rules) helper.
  • Inline and remove several small helper methods that previously performed raw per-port queries.

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py Outdated
…ings

DevStack runs neutron-server with ``[DEFAULT] debug = True``, which
enables an ORM-execute listener in ``neutron.objects.base`` that warns
on every ``session.execute(...)`` where ``session.in_transaction()`` is
False:

  ORM session: SQL execution without transaction in progress, traceback:
  ...

It's a SQLAlchemy-2.0-compliance tripwire: it flags statements that are
running in implicit-transaction mode (auto-begun and auto-discarded per
statement) instead of the explicit reader/writer pattern.  The warning
is debug-only and doesn't affect correctness today, but it's real
2.0-readiness signal and very noisy in the journal during scale runs.

After PR projectcalico#12930 removed our outer ``CONTEXT_WRITER.using`` wrapper from
the mech driver's postcommit body, every raw ``context.session.query(...)``
in ``endpoints.WorkloadEndpointSyncer.get_extra_port_information`` ran
with no enginefacade-recognised transaction active, and so tripped the
listener.  The PR-projectcalico#12930 lesson stands: we can't restore a single broad
wrapper, because the body also calls ``@retry_if_session_inactive``-
decorated upstream methods (``self.db.get_subnet`` via
``add_port_gateways``, ``self.db.get_security_groups`` via
``add_port_sg_names``) whose retry recovery for the
``_ensure_default_security_group`` IntegrityError race is exactly what
gets disabled by an outer transaction.

Restructure ``get_extra_port_information`` into three phases by
transaction shape:

* Phase A -- raw ``context.session.query(...)`` reads (fixed IPs,
  floating IPs, network, QoS rules) plus
  ``self.db._get_port_security_group_bindings`` (private upstream
  helper, no ``@retry_if_session_inactive``).  These get a single
  ``CONTEXT_READER.using(context)`` block so ``session.in_transaction()``
  is True at execute time.  Safe to wrap because none of these calls
  goes through a retry-decorated upstream method that would have its
  retry disabled.

  All five reads are now inlined into the ``with`` block.  The QoS
  rule queries use ``list(...)`` to force materialisation inside the
  reader; the other four are list-comprehensions that iterate eagerly
  before the ``with`` exits.

* Phase B -- ``add_port_gateways`` and ``add_port_sg_names`` call into
  ``@retry_if_session_inactive``-decorated upstream methods and stay
  outside Phase A's reader for the same reason PR projectcalico#12930 stripped the
  old wrapper: keep the inner retry decorator enabled.  Each upstream
  call manages its own reader transaction internally, so the warning
  doesn't fire here either.

* Phase C -- ``add_port_interface_name`` (pure compute) and
  ``add_port_project_data`` (Keystone-backed in-process cache) touch
  no DB; position relative to A/B is correctness-irrelevant.

The QoS handling is also factored: ``add_port_qos`` becomes a pure
static method ``build_qos_controls(bw_rules, pr_rules)`` that takes
already-materialised rule lists, and the rule fetch lives in the
caller -- inside Phase A for the postcommit path, from ``self._bulk``
for the resync path.  This removes the previous bulk-vs-postcommit
branch from inside the method body and makes the consumer testable in
isolation.

The four tiny one-call helpers (``get_security_groups_for_port``,
``get_fixed_ips_for_port``, ``get_floating_ips_for_port``,
``get_network_properties_for_port``) are removed entirely; their
bodies are inlined.

Misleading docstrings on ``add_port_gateways``, ``add_port_sg_names``,
and the former ``add_port_qos`` -- left over from the pre-projectcalico#12930 era
when an outer ``_txn_from_context`` / ``CONTEXT_WRITER.using`` wrapped
the entire postcommit body -- claimed *"This method assumes it's being
called from within a database transaction and does not take out
another one"*.  That assumption was true then; it isn't now.  Updated
to describe the actual Phase A / Phase B placement.

Tests: 185 pass (``make tox-caracal``).  No mech-driver semantics
change for either the postcommit path or the resync (bulk) path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram force-pushed the eliminate-orm-warnings branch from 868f225 to 6992a84 Compare June 17, 2026 21:05
nelljerram and others added 2 commits June 18, 2026 11:34
Move ``port_extra.qos = self.build_qos_controls(bw_rules, pr_rules)``
from after the ``CONTEXT_READER.using`` block to inside it.  Building
the QoSControls dict iterates each rule and reads its ``direction``,
``max_kbps``, ``max_burst_kbps`` / ``max_kpps`` columns; doing this
after the reader exits relies on the rule ORM objects still being
readable in detached state -- which holds today (the columns are
eager-loaded and oslo.db's reader-mode default leaves them in place)
but would tie our correctness to oslo.db's
``expire_on_commit`` / ``rollback_reader_sessions`` internals.  Doing
it inside the reader, while the rules are still attached, sidesteps
that question.

With ``build_qos_controls`` now inside the reader, the ``list(...)``
wrapping that previously materialised the queries is no longer
needed -- the lazy ``Query`` objects are iterated by
``build_qos_controls`` while the transaction is still active, so
their SQL fires safely.  ``build_qos_controls`` becomes
shape-symmetric across paths: the postcommit path passes lazy Query
objects, the bulk path passes plain Python lists, and either
iterates correctly.

Tests: tox-caracal passes (185 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread networking-calico/networking_calico/plugins/ml2/drivers/calico/endpoints.py Outdated
nelljerram and others added 2 commits June 18, 2026 11:51
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

nelljerram added a commit to nelljerram/calico that referenced this pull request Jun 18, 2026
We have sometimes been seeing this exception in journalctl.txt / neutron-server.log:

Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda> at 0x72810792e290>
Traceback (most recent call last):
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 509, in <lambda>
    and _finalize_fairy(
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 544, in checkin
    pool.dispatch.checkin(connection, self)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 346, in __call__
    fn(*args, **kw)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_db/sqlalchemy/engines.py", line 52, in _thread_yield
    time.sleep(0)
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 37, in sleep
    hub.switch()
  File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
    return self.greenlet.switch()
TimeoutError: timed out

This occurs when a Neutron DB context uses a session (for some reason) and then leaks it, and GC
kicks in on eventlet's "hub" greenlet.  An sqlalchemy connection fairy that is GC'd while the
eventlet hub greenlet is the current greenlet triggers oslo.db's ``_thread_yield`` checkin listener
to call ``time.sleep(0)`` -> ``hub.switch()``, which deadlocks because the hub cannot switch to
itself.  The ``TimeoutError`` that eventually fires is silently swallowed by Python's "Exception
ignored in" finalizer-exception mechanism, but each occurrence wedges the hub for ~10s and leaves
the connection record's pool state indeterminate.

I _think_ projectcalico#13015 fixes the primary root cause of this, by adding a transaction wrapper around raw
`context.session.query` calls.  The wrapper properly closes the session after those calls, instead
of leaking it to GC.  However, in case there are any remaining cases, e.g. because a
Neutron-framework path outside our control drops a session unclosed, or because a future change
reintroduces a code path that bypasses the ``using`` pattern -- we want a log line that points at
the leaking code path rather than just the in-hub finalizer stack that the existing "Exception
ignored in" trace gives us.

This commit adds opt-in diagnostics that installs two SQLAlchemy event listeners on
``sqlalchemy.pool.Pool``:

* A: ``checkout`` listener captures ``traceback.format_stack()`` at the moment of every
  connection-pool checkout, stashing it on ``connection_record.info["calico_checkout_stack"]``.

* A: ``checkin`` listener with ``insert=True`` (prepends ahead of oslo.db's ``_thread_yield``)
  checks whether ``eventlet.greenthread.getcurrent() is hub.greenlet``.  If yes, oslo.db is about
  to deadlock; we log a WARNING containing both the captured checkout-time stack and the current
  finalizer stack.

Default off, but enabled for DevStack CI.  Enable per-deployment with::

    [calico]
    fairy_gc_diagnostics = True

in ``neutron.conf``.  The per-checkout stack capture costs ~50-100us per checkout, so worth turning
off in normal operation once the issue is diagnosed; left on indefinitely for diagnostic /
scale-test runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelljerram nelljerram merged commit 59a786e into projectcalico:master Jun 23, 2026
3 checks passed
@nelljerram nelljerram deleted the eliminate-orm-warnings branch June 23, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants