fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379) by 18062706139fcz · Pull Request #3392 · bytedance/deer-flow

18062706139fcz · 2026-06-04T08:39:01Z

Summary

Fixes #3379 — using a stdio MCP tool would raise:

RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

The root cause is an event-loop / task lifecycle mismatch in the MCP session pool. This PR reworks the pool around an owner-task lifecycle model so every pooled ClientSession is entered, initialized, and exited within a single asyncio task on its owning loop, and hardens every close path that the new model touches.

The problem

ClientSession is implemented on top of an anyio task group / cancel scope. anyio requires that a cancel scope be exited in the same asyncio task that entered it; otherwise it raises the RuntimeError above.

The previous pool stored (session, loop) and re-entered/closed the session's async context manager from whatever task happened to call close_*. That works for a long-lived async caller, but the sync tool path (make_sync_tool_wrapper → asyncio.run(...)) repeatedly creates and tears down event loops on different tasks/threads. When a pooled session created under one loop/task was later exited from a different task, anyio tripped the cross-task cancel-scope check.

Secondary issues uncovered while fixing this:

A cancellation of get_session() mid-initialization could leak the owner coroutine and never close the session (CancelledError is a BaseException, so it slipped past except Exception).
Concurrent get_session() calls for the same (server, scope) could each build a separate session.
close_* only cleaned up established entries, so an in-flight creation could "resurrect" a session after close, or leave a creation task hung on initialize().
Calling close_all_sync() from a loop running on the current thread would run_coroutine_threadsafe(...).result(timeout) on itself → block for the full timeout and return before the session was actually closed.

The fix and why

Owner-task lifecycle (core fix). Each session is owned by a dedicated _run_session task. __aenter__ / initialize() / __aexit__ all run inside that one task on its loop. Closing is now done by signalling the owner (close_event.set() / task.cancel()) instead of re-entering the context manager from a foreign task. The __aexit__ always runs in the owning task's finally, so the anyio cancel scope is never exited cross-task. This directly removes the #3379 RuntimeError.

In-flight de-duplication. A per-(server, scope) _inflight registry lets concurrent callers for the same key await a single shared creation instead of each building their own session.

Unified close paths. close_scope / close_server / close_all / close_all_sync now cover both _entries and _inflight. In-flight creations (which may be blocked in initialize() and therefore deaf to close_event) are cancelled so they can't be resurrected or hang.

Cross-loop preemption. When an in-flight creation belongs to a different/closed loop, it is treated as stale: the stale owner task is cancelled and the current caller becomes the new creator, eliminating a previously possible AssertionError and a hung owner task.

close_all_sync() running-loop semantics. Synchronously waiting on a coroutine scheduled onto the loop that is currently running on this very thread is a self-deadlock. The function now detects that case and only signals teardown (completing asynchronously once control returns to the loop); its docstring states this contract. For callers that need a deterministic close from inside a running loop — notably reset_mcp_tools_cache() — we drive await close_all() on a dedicated worker thread so sessions are fully torn down before the pool is dropped.

Alternatives considered

Reuse a single global event loop for all MCP sessions. Would also avoid cross-task exit, but forces every sync caller through one shared loop, adds cross-thread scheduling on the hot path, and is a much larger behavioral change. Rejected as over-reaching for this bug.
Re-enter the context manager from the closing task with anyio.from_thread/portal. Adds an anyio portal dependency and still fights the framework's task affinity. The owner-task model expresses the constraint directly.
Skip pooling for stdio entirely. Trivially avoids the crash but throws away the connection-reuse benefit the pool exists for.
Block in close_all_sync() regardless. Impossible without deadlock when the owning loop is the current running loop; hence the signal-only contract plus the worker-thread deterministic path for reset_mcp_tools_cache().

Compatibility / impact

Public API is unchanged: get_session, close_scope, close_server, close_all, close_all_sync, get_session_pool, reset_session_pool keep their signatures.
Only internal state changed (_entries now carries owner task + close event; _context_managers removed; _inflight added). The only in-repo code that touched private fields is the test suite, which is updated here. No other production module depends on these internals.
Direct production consumers are limited to mcp/tools.py (session reuse) and mcp/cache.py (cache reset); both keep working through the public API.

Extensibility / maintenance notes

The owner-task model makes per-session lifecycle explicit, so future transports (beyond stdio) can pool sessions with the same correctness guarantees.
The loop-aware close decision currently lives in reset_mcp_tools_cache(); if more teardown call sites appear, that logic is a good candidate to consolidate into a pool helper.
Steady-state cost is a small constant per session (an owner task + close event) and a transient Future per in-flight creation — traded for concurrency correctness and reliable resource cleanup.

Tests

New/expanded regression tests in backend/tests/test_mcp_session_pool.py covering: cross-task close, cross-loop close, LRU eviction, in-flight cancellation, init-failure cleanup, same-key concurrency dedupe, cross-loop preemption of a blocked in-flight creation, and close_all_sync() from a running loop.
pytest tests/test_mcp_session_pool.py → 29 passed.
Related suite -k "mcp or sync_tool or session or cache" → 148 passed, 1 skipped.
ruff check clean on the changed files.

…sk cancel-scope error (bytedance#3379) Adopt an owner-task lifecycle for pooled MCP ClientSessions so each session is entered, initialized, and exited within a single asyncio task on its owning event loop. This eliminates the anyio "Attempted to exit cancel scope in a different task than it was entered in" RuntimeError that surfaced when stdio MCP tools were used via the sync tool wrapper (which spins up and tears down event loops across tasks). Also harden the pool lifecycle: - track in-flight session creation per (server, scope) to dedupe concurrent get_session() calls for the same key - make close_scope/close_server/close_all/close_all_sync cover both established entries and in-flight creations so sessions cannot be resurrected or leaked after close - handle cross-loop preemption of an in-flight creation by cancelling the stale owner task instead of only signalling it - define close_all_sync() semantics for a running loop on the current thread (signal-only, async completion) and route reset_mcp_tools_cache through a deterministic async close in that case

fancyboi999 · 2026-06-04T10:02:47Z

Triaging this from the issue side. Up front, so it's not hidden: I have a competing PR open for the same bug (#3384, the per-call route), so read my notes with that bias in mind. I'll keep it to facts.

The root-cause analysis is right, and the owner-task model is a legitimate way to keep sessions persistent under anyio's same-task rule. It's essentially what hermes-agent does, and persistent MCP connections are the norm elsewhere too — claude-code memoizes one client per server, codex holds an Arc'd service for the session. So the goal here isn't gold-plating. The secondary fixes are real bugs worth having: CancelledError slipping past except Exception, the concurrent-creation dedup, and the resurrect-after-close guard.

Two things I hit while reviewing:

1. reset_mcp_tools_cache() can deadlock. The new running-loop branch in cache.py starts a worker thread and runs asyncio.run(pool.close_all()). If the calling loop owns any of the entries, close_all() routes their teardown back onto that loop with run_coroutine_threadsafe(...) — but the loop is blocked on the unbounded .result(), so it never runs the scheduled _shutdown. The worker waits forever. Minimal repro against this branch:

import asyncio, concurrent.futures
from unittest.mock import AsyncMock, MagicMock, patch
from deerflow.mcp.session_pool import MCPSessionPool

def fake_cm(*a, **k):
    cm = MagicMock(); s = AsyncMock(); s.initialize = AsyncMock()
    cm.__aenter__ = AsyncMock(return_value=s); cm.__aexit__ = AsyncMock(return_value=False)
    return cm

async def main():
    pool = MCPSessionPool()
    with patch("langchain_mcp_adapters.sessions.create_session", side_effect=fake_cm):
        await pool.get_session("s", "t1", {"transport": "stdio", "command": "x", "args": []})  # entry owned by THIS loop
        ex = concurrent.futures.ThreadPoolExecutor(max_workers=1)
        fut = ex.submit(asyncio.run, pool.close_all())
        fut.result(timeout=8)   # raises TimeoutError here; reset_mcp_tools_cache's .result() has no timeout -> hangs forever

asyncio.run(main())

Run that and close_all() never returns (the timeout=8 is only there to turn the hang into a visible TimeoutError; the production call in reset_mcp_tools_cache has no timeout).

The branch exists specifically for the "reset while a loop is running" case, and get_cached_mcp_tools() calls reset_mcp_tools_cache() on a stale config, so it looks reachable in practice with live sessions around — I didn't trace the exact production trigger end to end, but the mechanism reproduces in isolation. A timeout on .result() would bound the hang, though the close still wouldn't finish.

2. The sync-tool path eats most of the persistence benefit. make_sync_tool_wrapper drives each call through a fresh asyncio.run loop. A pooled session is bound to the loop that created it, so on that path the cross-loop preemption evicts and rebuilds the session on every call — you pay the full setup cost each time, plus the extra teardown. The persistence win really only lands on the long-lived Gateway loop. hermes-agent gets around this by putting all MCP on one dedicated loop/thread; without that consolidation the pool is carrying a lot of machinery for a payoff that's only partial across deer-flow's two execution contexts.

To be fair to the case for persistence: the per-call cost is real. I measured roughly 0.4–0.7s per call for a Python stdio server (subprocess spawn + initialize) against ~2ms for a warm reuse, and an npx-based server would be worse. So persistence genuinely matters for chatty or slow servers.

The real question is for the maintainers: is a stateful/high-frequency stdio server a load-bearing case today, or a future one? The shipped config has github and postgres, both disabled, no Playwright. If it's a future need, it might be cleaner to land persistence as a deliberate single-owner-loop feature rather than threading it through the existing multi-loop pool. If it's needed now, this PR is the right direction — the deadlock just needs sorting first.

Root cause and the secondary fixes are solid regardless. Mostly flagging the deadlock so it doesn't bite later.

Copilot

Pull request overview

This PR fixes MCP stdio session pooling teardown by switching to an “owner-task lifecycle” model: each pooled ClientSession is entered/initialized/exited within a dedicated owner task on its owning event loop, preventing anyio’s cross-task cancel-scope RuntimeError (issue #3379). It also adds in-flight creation de-duplication and broadens close paths to cover both established and in-flight sessions.

Changes:

Reworked MCPSessionPool to run session lifecycle inside an owner task, signal-based shutdown, and in-flight creation de-dupe.
Hardened close behavior across close_scope / close_server / close_all / close_all_sync, including in-flight cancellation.
Added regression tests for cross-task/loop teardown, cancellation mid-init, in-flight close behavior, and close_all_sync() behavior from a running loop.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`backend/packages/harness/deerflow/mcp/session_pool.py`	Implements owner-task session lifecycle, adds `_inflight` registry, and revises close/eviction behavior to avoid cross-task cancel-scope exits.
`backend/packages/harness/deerflow/mcp/cache.py`	Updates cache reset to account for new `close_all_sync()` semantics and avoid teardown issues when resetting cached MCP tools.
`backend/tests/test_mcp_session_pool.py`	Adds comprehensive regression coverage for cross-task/loop teardown and in-flight/cancellation scenarios related to #3379.

+        else:
+            # Owning loop exists but is idle; drive it to completion.
+            loop.run_until_complete(self._shutdown(close_evt, task, cancel))


+        try:
+            running_loop = asyncio.get_running_loop()
+        except RuntimeError:
+            running_loop = None
+
+        if running_loop is not None:
+            # Inside a running loop, close_all_sync() can only *signal* teardown
+            # of sessions owned by this loop and would complete asynchronously.
+            # Drive a deterministic close on a separate thread so sessions are
+            # fully torn down before reset_session_pool() drops the pool.
+            import concurrent.futures
+
+            with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
+                executor.submit(asyncio.run, pool.close_all()).result()
+        else:
+            pool.close_all_sync()


fancyboi999 · 2026-06-05T02:07:54Z

Closing my own per-call PR (#3384). Going per-call would regress #3054 — Playwright loses its browser context between calls — which @WillemJiang rightly flagged. This owner-task approach is the one that holds both constraints at once (#3054 persistence + #3379 same-task close), so I'm backing it.

The one blocker from my earlier review is still the reset_mcp_tools_cache() deadlock (repro in the comment above): when reset runs inside a loop that owns sessions, the worker thread's close_all() schedules teardown back onto that blocked loop and never completes. Worth fixing before merge — happy to help if useful.

18062706139fcz · 2026-06-05T03:41:39Z

Updated this PR to address the remaining reset deadlock called out in review.

What changed:

Replaced the running-loop worker-thread path in reset_mcp_tools_cache() with close_all_sync(), which already has the correct owner-loop-aware shutdown behavior
Added reset_mcp_tools_cache_async() for async callers that need deterministic teardown without self-deadlocking
Hardened the idle/non-current branch in _shutdown_entry() to avoid run_until_complete() from async context
Added regression coverage for the running-loop reset deadlock and for deterministic async reset teardown

Validation:

uv run pytest tests/test_mcp_session_pool.py -p no:cacheprovider -q
uv run pytest tests/test_mcp_session_pool.py tests/test_mcp_sync_wrapper.py tests/test_mcp_custom_interceptors.py -p no:cacheprovider -q
uv run ruff check --no-cache packages/harness/deerflow/mcp/cache.py packages/harness/deerflow/mcp/session_pool.py packages/harness/deerflow/mcp/__init__.py tests/test_mcp_session_pool.py

18062706139fcz · 2026-06-05T03:45:17Z

Closing my own per-call PR (#3384). Going per-call would regress #3054 — Playwright loses its browser context between calls — which @WillemJiang rightly flagged. This owner-task approach is the one that holds both constraints at once (#3054 persistence + #3379 same-task close), so I'm backing it.

The one blocker from my earlier review is still the reset_mcp_tools_cache() deadlock (repro in the comment above): when reset runs inside a loop that owns sessions, the worker thread's close_all() schedules teardown back onto that blocked loop and never completes. Worth fixing before merge — happy to help if useful.

Thanks @fancyboi999.
I fixed this by removing the running-loop worker-thread path from reset_mcp_tools_cache() and routing it back through close_all_sync() , which already handles owner-loop shutdown correctly and avoids the self-deadlock.
I also added an async reset_mcp_tools_cache_async() for callers that need deterministic teardown, plus regression tests covering the running-loop reset case.
This preserves the owner-task lifecycle fix for #3379 and does not regress the persistent-session behavior needed for #3054 .
Could you take another look when you have a moment?

WillemJiang · 2026-06-05T14:58:54Z

@18062706139fcz Please check out the review comments below and fix the lint error.

Duplicate _CloseTrackingCm class in tests (minor)

_CloseTrackingCm is defined twice in the test file — once locally inside test_close_all_sync_from_running_loop_does_not_wait_on_itself (line ~1105) and again at module scope (line ~1151). The module-scope definition shadows the local one for subsequent tests. This works but is confusing and should be deduplicated.

_shutdown_entry idle-loop branch is a silent no-op

At session_pool.py line ~433–450, when the owning loop "exists but is neither the current loop nor running," the code falls back to call_soon_threadsafe and returns without waiting. The comment says "not expected in practice," but if it ever does happen, the session leaks until the loop runs again. Consider logging at warning level instead of debug to make this discoverable in production:

  logger.warning(
      "Owning loop for MCP session is idle; signalling close best-effort. "
      "Session may leak until the loop runs again."
  )

reset_mcp_tools_cache_async has no production caller

The new reset_mcp_tools_cache_async function is exported and tested, but appears to have no production call site in this PR. If it's being added speculatively, consider noting that in the PR description. If there's a known future caller, mentioning it would help reviewers understand the intent.

18062706139fcz · 2026-06-07T07:34:14Z

@18062706139fcz Please check out the review comments below and fix the lint error.

Duplicate _CloseTrackingCm class in tests (minor)

_CloseTrackingCm is defined twice in the test file — once locally inside test_close_all_sync_from_running_loop_does_not_wait_on_itself (line ~1105) and again at module scope (line ~1151). The module-scope definition shadows the local one for subsequent tests. This works but is confusing and should be deduplicated.

_shutdown_entry idle-loop branch is a silent no-op

At session_pool.py line ~433–450, when the owning loop "exists but is neither the current loop nor running," the code falls back to call_soon_threadsafe and returns without waiting. The comment says "not expected in practice," but if it ever does happen, the session leaks until the loop runs again. Consider logging at warning level instead of debug to make this discoverable in production:

l``` ogger.warning( "Owning loop for MCP session is idle; signalling close best-effort. " "Session may leak until the loop runs again." )
  3. reset_mcp_tools_cache_async has no production caller

  The new reset_mcp_tools_cache_async function is exported and tested, but appears to have no production call site in this PR. If it's being added speculatively, consider noting that in the PR description. If there's a known future caller, mentioning it would help reviewers understand the intent.

@WillemJiang Thanks for the review. I pushed an update addressing the remaining feedback:

Deduplicated _CloseTrackingCm in test_mcp_session_pool.py so the tests now use a single module-scope helper.
Changed the idle owning-loop branch in _shutdown_entry() from debug logging to warning logging, explicitly noting that the session may leak until the loop runs again.
Removed the speculative reset_mcp_tools_cache_async() API since it had no production caller, including its export and dedicated test coverage.
Kept the running-loop reset regression coverage for reset_mcp_tools_cache() to ensure the sync reset path remains bounded and does not reintroduce the worker-thread deadlock.

Validation:

uv run pytest tests/test_mcp_session_pool.py tests/test_mcp_sync_wrapper.py tests/test_mcp_custom_interceptors.py -p no:cacheprovider -q -> 46 passed
uv run ruff check --no-cache . && uv run ruff format --no-cache --check . -> passed

github-actions Bot added risk:medium Medium risk: regular code changes size/XL PR changes 700+ lines area:mcp Model Context Protocol integration and removed size/XL PR changes 700+ lines risk:medium Medium risk: regular code changes labels Jun 4, 2026

WillemJiang requested a review from Copilot June 4, 2026 15:22

Copilot started reviewing on behalf of WillemJiang June 4, 2026 15:23 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

fancyboi999 mentioned this pull request Jun 5, 2026

fix(mcp): use per-call stdio sessions to fix cross-task cancel-scope error (#3379) #3384

Closed

3 tasks

fix(mcp): avoid reset deadlock on running loop cache reset

532630a

github-actions Bot added risk:medium Medium risk: regular code changes size/XL PR changes 700+ lines labels Jun 5, 2026

fix(mcp): address session pool review feedback

d856786

WillemJiang approved these changes Jun 7, 2026

View reviewed changes

WillemJiang merged commit d8b728f into bytedance:main Jun 7, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379)#3392

fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379)#3392
WillemJiang merged 3 commits into
bytedance:mainfrom
18062706139fcz:fix/mcp-session-pool-cross-task-3379

18062706139fcz commented Jun 4, 2026

Uh oh!

fancyboi999 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

fancyboi999 commented Jun 5, 2026

Uh oh!

18062706139fcz commented Jun 5, 2026

Uh oh!

18062706139fcz commented Jun 5, 2026

Uh oh!

WillemJiang commented Jun 5, 2026 •

edited

Loading

Uh oh!

18062706139fcz commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

18062706139fcz commented Jun 4, 2026

Summary

The problem

The fix and why

Alternatives considered

Compatibility / impact

Extensibility / maintenance notes

Tests

Uh oh!

fancyboi999 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

fancyboi999 commented Jun 5, 2026

Uh oh!

18062706139fcz commented Jun 5, 2026

Uh oh!

18062706139fcz commented Jun 5, 2026

Uh oh!

WillemJiang commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

18062706139fcz commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WillemJiang commented Jun 5, 2026 •

edited

Loading