fix(mcp): use per-call stdio sessions to fix cross-task cancel-scope error (#3379)#3384
fix(mcp): use per-call stdio sessions to fix cross-task cancel-scope error (#3379)#3384fancyboi999 wants to merge 2 commits into
Conversation
…error (bytedance#3379) stdio MCP transport (stdio_client + ClientSession) is built on anyio.create_task_group(), whose cancel scope must be entered AND exited in the same asyncio task. MCPSessionPool entered a session in the task handling one tool call and closed it from a different task (LRU eviction, close_all_sync via run_coroutine_threadsafe on config-mtime reset, or the event loop's async-generator GC finalizer), raising "Attempted to exit cancel scope in a different task than it was entered in" — logged as an unretrieved Task exception. bytedance#3203 wrongly assumed only HTTP/SSE used anyio task groups and excluded only them from pooling; stdio's ClientSession (and stdio_client itself) use task groups too, so stdio stayed broken. Remove MCPSessionPool entirely. stdio tools now open+call+close their session within a single coroutine/task (the langchain-mcp-adapters default), consistent with HTTP/SSE. A thin per-call wrapper is kept for stdio only to preserve interceptor-injected header forwarding via MCP call meta (bytedance#3294); HTTP/SSE forward headers natively and pass through unwrapped. Trade-off: stdio tool calls now spawn one subprocess per call and no longer share cross-call server-side state (reverts the bytedance#3089 stateful-session behavior, which was broken anyway). If persistent stdio sessions are ever needed, the correct design is a long-lived owner-task that opens and closes the session in one task — not a shared pool. Tests: delete test_mcp_session_pool.py (it mocked create_session with taskgroup-less AsyncMock context managers, which is exactly why the cross-task bug was never caught). Add test_mcp_stdio_per_call.py with a REAL stdio subprocess cross-task regression test plus a committed minimal stdio server.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds regression coverage and updates MCP tool loading to eliminate stdio session pooling by switching to per-call sessions, preventing anyio cancel-scope cross-task cleanup errors (issue #3379).
Changes:
- Replace stdio MCP session pooling with a per-call wrapper (
_make_per_call_mcp_tool) while keeping HTTP/SSE tools unwrapped. - Remove the
MCPSessionPoolimplementation and its test suite; update cache reset behavior accordingly. - Add new tests (including a real stdio MCP subprocess) to reproduce and lock the cross-task cancel-scope failure mode.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/packages/harness/deerflow/mcp/tools.py | Removes pooling, introduces per-call stdio wrapper that also forwards interceptor headers via meta. |
| backend/packages/harness/deerflow/mcp/session_pool.py | Deletes the persistent session pool implementation. |
| backend/packages/harness/deerflow/mcp/cache.py | Updates cache reset docs/behavior now that there’s no session pool to close. |
| backend/tests/test_mcp_session_pool.py | Removes tests that exercised the deleted session pool behavior. |
| backend/tests/test_mcp_stdio_per_call.py | Adds new unit/integration tests for per-call stdio sessions and cross-task safety (incl. real subprocess). |
| backend/tests/support/stdio_mcp_server.py | Adds a minimal real stdio MCP server used by tests. |
| backend/CLAUDE.md | Documents the new per-call session behavior and rationale. |
Save and restore loop.get_exception_handler() around the real-stdio cross-task test so it cannot leak into subsequent tests (Copilot review).
|
@fancyboi999 The session pool was introduced to fix #3054; if we use per-call stdio sessions, we could introduce the regression issue of #3054. |
|
@WillemJiang you're right, thanks for catching this. I did trace the pool back to #3089 when I wrote this, but I read its docstring as "Playwright, hypothetically" and talked myself out of it because the default config ships no stdio server that needs shared state. Wrong call — #3054 is a real, fixed bug, and going per-call brings it straight back: each call gets its own Playwright browser context, so navigate-then-click lands on a fresh page again. So this PR trades one bug for another, which isn't worth merging. The actual constraint is both things at once: sessions have to stay persistent (#3054) and be closed in the task that opened them (#3379). The owner-task approach in #3392 is what satisfies both, so I'd rather see that land than this. Closing in favor of #3392. One thing worth carrying over: while reviewing #3392 I hit a deadlock in its |
Fixes #3379
Why
Calling any stdio MCP tool (e.g.
semantic-scholar-mcp'ssearch_papers) succeeds, but cleanup then raises:logged as an unretrieved asyncio
Task exception. It reproduces on the latestmain, on both Linux and Windows, in dev and prod.Root cause (first principles): the stdio transport stack is built on
anyio.create_task_group()at two layers —stdio_client(mcp/client/stdio/__init__.py) andClientSession/BaseSession(mcp/shared/session.py). An anyio cancel scope / task group has a hard invariant: it must be entered (__aenter__) and exited (__aexit__) in the same asyncio task — same event loop is not enough.MCPSessionPool(a process-global singleton keyed by(server_name, thread_id)) enters a session in the task handling one tool call, then closes it from a different task:close_all_sync()→run_coroutine_threadsafe(cm.__aexit__, loop)schedules the close as a new task (this is the only wired teardown caller —reset_mcp_tools_cache()invokes it on config-mtime change);get_session()closes another task's session from the current tool-call task;aclose()on the abandonedcreate_sessiongenerator in the finalizer task — this is the exact<async_generator_athrow>task in the issue traceback.Every cross-task exit raises the error, and also leaks
CancelledErrorinto unrelated tasks on the same loop, so simply swallowing theRuntimeErroris not a safe fix.PR #3203 already hit this and excluded HTTP/SSE from pooling — but it assumed only HTTP/SSE used anyio task groups. stdio uses them too, so stdio stayed broken. This PR finishes that line of reasoning.
What changed
langchain-mcp-adaptersdefault and is now uniform across stdio and HTTP/SSE.MCPSessionPoolis removed entirely (~200 lines).meta(PR fix(mcp): add auth interceptor with channel user_id and keep header propagation to mcp tools #3294) —langchain-mcp-adaptersforwards modified headers natively for HTTP/SSE but drops them for stdio. HTTP/SSE tools pass through unwrapped.reset_mcp_tools_cache()no longer closes any sessions — there are none to close.Behavior change to call out: stdio servers previously kept a pooled session across calls within a thread; they now get a fresh session per call. No shipped/default MCP server depends on cross-call stdio state (the default config ships
github/postgres, both disabled), and the pool was crashing on cleanup anyway. If a genuinely stateful stdio server (e.g. Playwright keeping a browser open) is ever needed, the correct design is a long-lived owner-task that opens and closes the session in one task — not a shared pool. That can be added later as an opt-in without reintroducing the cross-task hazard.Surface area
backend/packages/harness/deerflow/mcp/backend/CLAUDE.mdMCP section + test suiteBug fix verification
backend/tests/test_mcp_stdio_per_call.py::test_real_stdio_tool_is_cross_task_safe— spawns a real stdio MCP subprocess (backend/tests/support/stdio_mcp_server.py), invokes the tool from multiple distinct tasks viaasyncio.gather(mirroring LangGraph's parallel tool execution), forces GC, and asserts no cancel-scope error leaks.test_mcp_session_pool.pymockedcreate_sessionwith plainAsyncMockcontext managers that have no task group, so a cross-task__aexit__"worked" in tests. That mock-only suite is deleted and replaced with tests that exercise a real stdio session.mainwith a real stdio session (enter in task A, exit in task B → the exactRuntimeError). The committed tests reference the new per-call wrapper, so they exercise the fixed path and pass on this branch; the pre-fix red is the standalone real-stdio reproduction described above.Validation
End-to-end, against a real stdio MCP server through the actual
initialize_mcp_tools()→MultiServerMCPClient.get_tools()path:reset_mcp_tools_cache()(the former crash trigger) + forced GC produce zero cancel-scope errors and zero unretrieved-task exceptions;