Skip to content

fix(mcp): use per-call stdio sessions to fix cross-task cancel-scope error (#3379)#3384

Closed
fancyboi999 wants to merge 2 commits into
bytedance:mainfrom
fancyboi999:fix/3379-mcp-stdio-per-call-session
Closed

fix(mcp): use per-call stdio sessions to fix cross-task cancel-scope error (#3379)#3384
fancyboi999 wants to merge 2 commits into
bytedance:mainfrom
fancyboi999:fix/3379-mcp-stdio-per-call-session

Conversation

@fancyboi999
Copy link
Copy Markdown
Collaborator

Fixes #3379

Why

Calling any stdio MCP tool (e.g. semantic-scholar-mcp's search_papers) succeeds, but cleanup then raises:

RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

logged as an unretrieved asyncio Task exception. It reproduces on the latest main, on both Linux and Windows, in dev and prod.

Root cause (first principles): the stdio transport stack is built on anyio.create_task_group() at two layers — stdio_client (mcp/client/stdio/__init__.py) and ClientSession/BaseSession (mcp/shared/session.py). An anyio cancel scope / task group has a hard invariant: it must be entered (__aenter__) and exited (__aexit__) in the same asyncio task — same event loop is not enough.

MCPSessionPool (a process-global singleton keyed by (server_name, thread_id)) enters a session in the task handling one tool call, then closes it from a different task:

  • close_all_sync()run_coroutine_threadsafe(cm.__aexit__, loop) schedules the close as a new task (this is the only wired teardown caller — reset_mcp_tools_cache() invokes it on config-mtime change);
  • LRU eviction inside get_session() closes another task's session from the current tool-call task;
  • the event loop's async-generator GC finalizer runs aclose() on the abandoned create_session generator in the finalizer task — this is the exact <async_generator_athrow> task in the issue traceback.

Every cross-task exit raises the error, and also leaks CancelledError into unrelated tasks on the same loop, so simply swallowing the RuntimeError is not a safe fix.

PR #3203 already hit this and excluded HTTP/SSE from pooling — but it assumed only HTTP/SSE used anyio task groups. stdio uses them too, so stdio stayed broken. This PR finishes that line of reasoning.

What changed

  • MCP tool calls now use a per-call session: each invocation opens, uses, and closes its own session inside one coroutine (one task). This is the langchain-mcp-adapters default and is now uniform across stdio and HTTP/SSE.
  • The cross-task MCPSessionPool is removed entirely (~200 lines).
  • A thin per-call wrapper is kept for stdio only, solely to preserve interceptor-injected header forwarding via MCP call meta (PR fix(mcp): add auth interceptor with channel user_id and keep header propagation to mcp tools #3294) — langchain-mcp-adapters forwards modified headers natively for HTTP/SSE but drops them for stdio. HTTP/SSE tools pass through unwrapped.
  • reset_mcp_tools_cache() no longer closes any sessions — there are none to close.

Behavior change to call out: stdio servers previously kept a pooled session across calls within a thread; they now get a fresh session per call. No shipped/default MCP server depends on cross-call stdio state (the default config ships github/postgres, both disabled), and the pool was crashing on cleanup anyway. If a genuinely stateful stdio server (e.g. Playwright keeping a browser open) is ever needed, the correct design is a long-lived owner-task that opens and closes the session in one task — not a shared pool. That can be added later as an opt-in without reintroducing the cross-task hazard.

Surface area

  • Agents / LangGraph — MCP tool execution / session lifecycle under backend/packages/harness/deerflow/mcp/
  • Default behavior change — stdio MCP sessions are now per-call instead of pooled/persistent
  • Docs / testsbackend/CLAUDE.md MCP section + test suite

Bug fix verification

  • Regression test: backend/tests/test_mcp_stdio_per_call.py::test_real_stdio_tool_is_cross_task_safe — spawns a real stdio MCP subprocess (backend/tests/support/stdio_mcp_server.py), invokes the tool from multiple distinct tasks via asyncio.gather (mirroring LangGraph's parallel tool execution), forces GC, and asserts no cancel-scope error leaks.
  • Why the bug was never caught before: the previous test_mcp_session_pool.py mocked create_session with plain AsyncMock context managers that have no task group, so a cross-task __aexit__ "worked" in tests. That mock-only suite is deleted and replaced with tests that exercise a real stdio session.
  • Red→green: the bug is independently reproducible on main with a real stdio session (enter in task A, exit in task B → the exact RuntimeError). The committed tests reference the new per-call wrapper, so they exercise the fixed path and pass on this branch; the pre-fix red is the standalone real-stdio reproduction described above.

Validation

cd backend && make lint          # ruff check + ruff format --check → All checks passed
cd backend && uv run pytest tests/ -k mcp -q   # 83 passed (incl. real-stdio cross-task test)

End-to-end, against a real stdio MCP server through the actual initialize_mcp_tools()MultiServerMCPClient.get_tools() path:

  • parallel cross-task tool calls succeed;
  • reset_mcp_tools_cache() (the former crash trigger) + forced GC produce zero cancel-scope errors and zero unretrieved-task exceptions;
  • a real agent turn that invokes the stdio tool returns the server-produced output verbatim with no error surfaced.

…error (bytedance#3379)

stdio MCP transport (stdio_client + ClientSession) is built on
anyio.create_task_group(), whose cancel scope must be entered AND exited in the
same asyncio task. MCPSessionPool entered a session in the task handling one
tool call and closed it from a different task (LRU eviction, close_all_sync via
run_coroutine_threadsafe on config-mtime reset, or the event loop's
async-generator GC finalizer), raising "Attempted to exit cancel scope in a
different task than it was entered in" — logged as an unretrieved Task exception.

bytedance#3203 wrongly assumed only HTTP/SSE used anyio task groups and excluded only
them from pooling; stdio's ClientSession (and stdio_client itself) use task
groups too, so stdio stayed broken.

Remove MCPSessionPool entirely. stdio tools now open+call+close their session
within a single coroutine/task (the langchain-mcp-adapters default), consistent
with HTTP/SSE. A thin per-call wrapper is kept for stdio only to preserve
interceptor-injected header forwarding via MCP call meta (bytedance#3294); HTTP/SSE
forward headers natively and pass through unwrapped.

Trade-off: stdio tool calls now spawn one subprocess per call and no longer
share cross-call server-side state (reverts the bytedance#3089 stateful-session
behavior, which was broken anyway). If persistent stdio sessions are ever
needed, the correct design is a long-lived owner-task that opens and closes the
session in one task — not a shared pool.

Tests: delete test_mcp_session_pool.py (it mocked create_session with
taskgroup-less AsyncMock context managers, which is exactly why the cross-task
bug was never caught). Add test_mcp_stdio_per_call.py with a REAL stdio
subprocess cross-task regression test plus a committed minimal stdio server.
@github-actions github-actions Bot added risk:medium Medium risk: regular code changes size/XL PR changes 700+ lines area:mcp Model Context Protocol integration area:docs Documentation and Markdown only and removed size/XL PR changes 700+ lines risk:medium Medium risk: regular code changes labels Jun 4, 2026
@fancyboi999 fancyboi999 requested a review from Copilot June 4, 2026 06:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression coverage and updates MCP tool loading to eliminate stdio session pooling by switching to per-call sessions, preventing anyio cancel-scope cross-task cleanup errors (issue #3379).

Changes:

  • Replace stdio MCP session pooling with a per-call wrapper (_make_per_call_mcp_tool) while keeping HTTP/SSE tools unwrapped.
  • Remove the MCPSessionPool implementation and its test suite; update cache reset behavior accordingly.
  • Add new tests (including a real stdio MCP subprocess) to reproduce and lock the cross-task cancel-scope failure mode.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
backend/packages/harness/deerflow/mcp/tools.py Removes pooling, introduces per-call stdio wrapper that also forwards interceptor headers via meta.
backend/packages/harness/deerflow/mcp/session_pool.py Deletes the persistent session pool implementation.
backend/packages/harness/deerflow/mcp/cache.py Updates cache reset docs/behavior now that there’s no session pool to close.
backend/tests/test_mcp_session_pool.py Removes tests that exercised the deleted session pool behavior.
backend/tests/test_mcp_stdio_per_call.py Adds new unit/integration tests for per-call stdio sessions and cross-task safety (incl. real subprocess).
backend/tests/support/stdio_mcp_server.py Adds a minimal real stdio MCP server used by tests.
backend/CLAUDE.md Documents the new per-call session behavior and rationale.

Comment thread backend/packages/harness/deerflow/mcp/tools.py
Comment thread backend/tests/test_mcp_stdio_per_call.py
Save and restore loop.get_exception_handler() around the real-stdio
cross-task test so it cannot leak into subsequent tests (Copilot review).
@WillemJiang
Copy link
Copy Markdown
Collaborator

WillemJiang commented Jun 4, 2026

@fancyboi999 The session pool was introduced to fix #3054; if we use per-call stdio sessions, we could introduce the regression issue of #3054.

@fancyboi999
Copy link
Copy Markdown
Collaborator Author

@WillemJiang you're right, thanks for catching this. I did trace the pool back to #3089 when I wrote this, but I read its docstring as "Playwright, hypothetically" and talked myself out of it because the default config ships no stdio server that needs shared state. Wrong call — #3054 is a real, fixed bug, and going per-call brings it straight back: each call gets its own Playwright browser context, so navigate-then-click lands on a fresh page again.

So this PR trades one bug for another, which isn't worth merging. The actual constraint is both things at once: sessions have to stay persistent (#3054) and be closed in the task that opened them (#3379). The owner-task approach in #3392 is what satisfies both, so I'd rather see that land than this.

Closing in favor of #3392. One thing worth carrying over: while reviewing #3392 I hit a deadlock in its reset_mcp_tools_cache() path (repro is over there) — that should be sorted before it merges.

@fancyboi999 fancyboi999 closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentation and Markdown only area:mcp Model Context Protocol integration risk:medium Medium risk: regular code changes size/XL PR changes 700+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] 使用 stdio MCP 工具后出现 "Attempted to exit cancel scope in a different task" RuntimeError

3 participants