fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379)#3392
Conversation
…sk cancel-scope error (bytedance#3379) Adopt an owner-task lifecycle for pooled MCP ClientSessions so each session is entered, initialized, and exited within a single asyncio task on its owning event loop. This eliminates the anyio "Attempted to exit cancel scope in a different task than it was entered in" RuntimeError that surfaced when stdio MCP tools were used via the sync tool wrapper (which spins up and tears down event loops across tasks). Also harden the pool lifecycle: - track in-flight session creation per (server, scope) to dedupe concurrent get_session() calls for the same key - make close_scope/close_server/close_all/close_all_sync cover both established entries and in-flight creations so sessions cannot be resurrected or leaked after close - handle cross-loop preemption of an in-flight creation by cancelling the stale owner task instead of only signalling it - define close_all_sync() semantics for a running loop on the current thread (signal-only, async completion) and route reset_mcp_tools_cache through a deterministic async close in that case
|
Triaging this from the issue side. Up front, so it's not hidden: I have a competing PR open for the same bug (#3384, the per-call route), so read my notes with that bias in mind. I'll keep it to facts. The root-cause analysis is right, and the owner-task model is a legitimate way to keep sessions persistent under anyio's same-task rule. It's essentially what hermes-agent does, and persistent MCP connections are the norm elsewhere too — claude-code memoizes one client per server, codex holds an Two things I hit while reviewing: 1. import asyncio, concurrent.futures
from unittest.mock import AsyncMock, MagicMock, patch
from deerflow.mcp.session_pool import MCPSessionPool
def fake_cm(*a, **k):
cm = MagicMock(); s = AsyncMock(); s.initialize = AsyncMock()
cm.__aenter__ = AsyncMock(return_value=s); cm.__aexit__ = AsyncMock(return_value=False)
return cm
async def main():
pool = MCPSessionPool()
with patch("langchain_mcp_adapters.sessions.create_session", side_effect=fake_cm):
await pool.get_session("s", "t1", {"transport": "stdio", "command": "x", "args": []}) # entry owned by THIS loop
ex = concurrent.futures.ThreadPoolExecutor(max_workers=1)
fut = ex.submit(asyncio.run, pool.close_all())
fut.result(timeout=8) # raises TimeoutError here; reset_mcp_tools_cache's .result() has no timeout -> hangs forever
asyncio.run(main())Run that and The branch exists specifically for the "reset while a loop is running" case, and 2. The sync-tool path eats most of the persistence benefit. To be fair to the case for persistence: the per-call cost is real. I measured roughly 0.4–0.7s per call for a Python stdio server (subprocess spawn + The real question is for the maintainers: is a stateful/high-frequency stdio server a load-bearing case today, or a future one? The shipped config has github and postgres, both disabled, no Playwright. If it's a future need, it might be cleaner to land persistence as a deliberate single-owner-loop feature rather than threading it through the existing multi-loop pool. If it's needed now, this PR is the right direction — the deadlock just needs sorting first. Root cause and the secondary fixes are solid regardless. Mostly flagging the deadlock so it doesn't bite later. |
There was a problem hiding this comment.
Pull request overview
This PR fixes MCP stdio session pooling teardown by switching to an “owner-task lifecycle” model: each pooled ClientSession is entered/initialized/exited within a dedicated owner task on its owning event loop, preventing anyio’s cross-task cancel-scope RuntimeError (issue #3379). It also adds in-flight creation de-duplication and broadens close paths to cover both established and in-flight sessions.
Changes:
- Reworked
MCPSessionPoolto run session lifecycle inside an owner task, signal-based shutdown, and in-flight creation de-dupe. - Hardened close behavior across
close_scope/close_server/close_all/close_all_sync, including in-flight cancellation. - Added regression tests for cross-task/loop teardown, cancellation mid-init, in-flight close behavior, and
close_all_sync()behavior from a running loop.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
backend/packages/harness/deerflow/mcp/session_pool.py |
Implements owner-task session lifecycle, adds _inflight registry, and revises close/eviction behavior to avoid cross-task cancel-scope exits. |
backend/packages/harness/deerflow/mcp/cache.py |
Updates cache reset to account for new close_all_sync() semantics and avoid teardown issues when resetting cached MCP tools. |
backend/tests/test_mcp_session_pool.py |
Adds comprehensive regression coverage for cross-task/loop teardown and in-flight/cancellation scenarios related to #3379. |
| else: | ||
| # Owning loop exists but is idle; drive it to completion. | ||
| loop.run_until_complete(self._shutdown(close_evt, task, cancel)) |
| try: | ||
| running_loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| running_loop = None | ||
|
|
||
| if running_loop is not None: | ||
| # Inside a running loop, close_all_sync() can only *signal* teardown | ||
| # of sessions owned by this loop and would complete asynchronously. | ||
| # Drive a deterministic close on a separate thread so sessions are | ||
| # fully torn down before reset_session_pool() drops the pool. | ||
| import concurrent.futures | ||
|
|
||
| with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor: | ||
| executor.submit(asyncio.run, pool.close_all()).result() | ||
| else: | ||
| pool.close_all_sync() |
|
Closing my own per-call PR (#3384). Going per-call would regress #3054 — Playwright loses its browser context between calls — which @WillemJiang rightly flagged. This owner-task approach is the one that holds both constraints at once (#3054 persistence + #3379 same-task close), so I'm backing it. The one blocker from my earlier review is still the |
|
Updated this PR to address the remaining reset deadlock called out in review. What changed:
Validation:
|
Thanks @fancyboi999. |
|
@18062706139fcz Please check out the review comments below and fix the lint error.
_CloseTrackingCm is defined twice in the test file — once locally inside test_close_all_sync_from_running_loop_does_not_wait_on_itself (line ~1105) and again at module scope (line ~1151). The module-scope definition shadows the local one for subsequent tests. This works but is confusing and should be deduplicated.
At session_pool.py line ~433–450, when the owning loop "exists but is neither the current loop nor running," the code falls back to call_soon_threadsafe and returns without waiting. The comment says "not expected in practice," but if it ever does happen, the session leaks until the loop runs again. Consider logging at warning level instead of debug to make this discoverable in production:
The new reset_mcp_tools_cache_async function is exported and tested, but appears to have no production call site in this PR. If it's being added speculatively, consider noting that in the PR description. If there's a known future caller, mentioning it would help reviewers understand the intent. |
@WillemJiang Thanks for the review. I pushed an update addressing the remaining feedback:
Validation:
|
Summary
Fixes #3379 — using a stdio MCP tool would raise:
The root cause is an event-loop / task lifecycle mismatch in the MCP session pool. This PR reworks the pool around an owner-task lifecycle model so every pooled
ClientSessionis entered, initialized, and exited within a single asyncio task on its owning loop, and hardens every close path that the new model touches.The problem
ClientSessionis implemented on top of ananyiotask group / cancel scope.anyiorequires that a cancel scope be exited in the same asyncio task that entered it; otherwise it raises theRuntimeErrorabove.The previous pool stored
(session, loop)and re-entered/closed the session's async context manager from whatever task happened to callclose_*. That works for a long-lived async caller, but the sync tool path (make_sync_tool_wrapper→asyncio.run(...)) repeatedly creates and tears down event loops on different tasks/threads. When a pooled session created under one loop/task was later exited from a different task,anyiotripped the cross-task cancel-scope check.Secondary issues uncovered while fixing this:
get_session()mid-initialization could leak the owner coroutine and never close the session (CancelledErroris aBaseException, so it slipped pastexcept Exception).get_session()calls for the same(server, scope)could each build a separate session.close_*only cleaned up established entries, so an in-flight creation could "resurrect" a session after close, or leave a creation task hung oninitialize().close_all_sync()from a loop running on the current thread wouldrun_coroutine_threadsafe(...).result(timeout)on itself → block for the full timeout and return before the session was actually closed.The fix and why
Owner-task lifecycle (core fix). Each session is owned by a dedicated
_run_sessiontask.__aenter__/initialize()/__aexit__all run inside that one task on its loop. Closing is now done by signalling the owner (close_event.set()/task.cancel()) instead of re-entering the context manager from a foreign task. The__aexit__always runs in the owning task'sfinally, so the anyio cancel scope is never exited cross-task. This directly removes the #3379 RuntimeError.In-flight de-duplication. A per-
(server, scope)_inflightregistry lets concurrent callers for the same key await a single shared creation instead of each building their own session.Unified close paths.
close_scope/close_server/close_all/close_all_syncnow cover both_entriesand_inflight. In-flight creations (which may be blocked ininitialize()and therefore deaf toclose_event) are cancelled so they can't be resurrected or hang.Cross-loop preemption. When an in-flight creation belongs to a different/closed loop, it is treated as stale: the stale owner task is cancelled and the current caller becomes the new creator, eliminating a previously possible
AssertionErrorand a hung owner task.close_all_sync()running-loop semantics. Synchronously waiting on a coroutine scheduled onto the loop that is currently running on this very thread is a self-deadlock. The function now detects that case and only signals teardown (completing asynchronously once control returns to the loop); its docstring states this contract. For callers that need a deterministic close from inside a running loop — notablyreset_mcp_tools_cache()— we driveawait close_all()on a dedicated worker thread so sessions are fully torn down before the pool is dropped.Alternatives considered
anyio.from_thread/portal. Adds an anyio portal dependency and still fights the framework's task affinity. The owner-task model expresses the constraint directly.close_all_sync()regardless. Impossible without deadlock when the owning loop is the current running loop; hence the signal-only contract plus the worker-thread deterministic path forreset_mcp_tools_cache().Compatibility / impact
get_session,close_scope,close_server,close_all,close_all_sync,get_session_pool,reset_session_poolkeep their signatures._entriesnow carries owner task + close event;_context_managersremoved;_inflightadded). The only in-repo code that touched private fields is the test suite, which is updated here. No other production module depends on these internals.mcp/tools.py(session reuse) andmcp/cache.py(cache reset); both keep working through the public API.Extensibility / maintenance notes
reset_mcp_tools_cache(); if more teardown call sites appear, that logic is a good candidate to consolidate into a pool helper.Futureper in-flight creation — traded for concurrency correctness and reliable resource cleanup.Tests
backend/tests/test_mcp_session_pool.pycovering: cross-task close, cross-loop close, LRU eviction, in-flight cancellation, init-failure cleanup, same-key concurrency dedupe, cross-loop preemption of a blocked in-flight creation, andclose_all_sync()from a running loop.pytest tests/test_mcp_session_pool.py→ 29 passed.-k "mcp or sync_tool or session or cache"→ 148 passed, 1 skipped.ruff checkclean on the changed files.