Skip to content

Add ToolOrphanRepair capability#184

Open
DouweM wants to merge 4 commits into
mainfrom
capability/tool-orphan-repair
Open

Add ToolOrphanRepair capability#184
DouweM wants to merge 4 commits into
mainfrom
capability/tool-orphan-repair

Conversation

@DouweM
Copy link
Copy Markdown
Contributor

@DouweM DouweM commented Apr 10, 2026

Summary

  • Implements ToolOrphanRepair, a capability that sanitizes message history to fix orphaned tool calls and results before each model request
  • Injects synthetic ToolReturnPart / BuiltinToolReturnPart for calls without matching results, strips ToolReturnPart / RetryPromptPart whose tool_call_id doesn't match any call, and handles trailing responses and empty request edge cases
  • Exports as from pydantic_harness import ToolOrphanRepair

Refs: pydantic/pydantic-ai#4728

Test plan

  • 25 tests covering all repair scenarios: orphaned calls, orphaned returns, orphaned builtin calls, trailing responses, empty requests, warnings, multi-turn conversations, and no-op passthrough
  • pyright strict mode: 0 errors
  • ruff lint + format: clean

🤖 Generated with Claude Code

DouweM and others added 4 commits April 2, 2026 05:27
…sults

Implements a capability that hooks into before_model_request to repair
structurally invalid message history caused by orphaned tool calls and
results in multi-turn conversations. This prevents providers (especially
Anthropic) from rejecting poisoned conversation history with 400 errors.

Repairs: orphaned ToolCallPart (injects synthetic ToolReturnPart),
orphaned BuiltinToolCallPart (injects BuiltinToolReturnPart in same
response), and orphaned ToolReturnPart/RetryPromptPart (strips them).
Also handles trailing responses and empty request edge cases.

Refs: pydantic/pydantic-ai#4728

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each repair site now emits a `logging.debug()` message describing the
specific action taken (synthetic return injected, orphaned return
stripped, trailing response dropped, etc.), complementing the existing
summary UserWarning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…onses

- Test the `before_model_request` capability hook directly
- Test consecutive ModelResponse messages (no interleaved request)
- Mark defensive Phase 6 code as `# pragma: no cover` (unreachable)
- Mark unused `extra_parts` helper branch as `# pragma: no cover`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@DouweM
Copy link
Copy Markdown
Contributor Author

DouweM commented Apr 10, 2026

Originally posted by @DouweM in #132 comment (PR was recreated)

Audit vs prior art: ToolOrphanRepair

Worth adding now:

  • Duplicate tool_call_id deduplication
  • Debug-level logging of specific repairs (not just count)

Follow-up opportunities:

  • Integration with Compaction to catch orphans created by summarization

@DouweM
Copy link
Copy Markdown
Contributor Author

DouweM commented Apr 10, 2026

Originally posted by @adtyavrdhn in #132 comment (PR was recreated)

Notes from comparing with Hermes, Pi-mono, and Mastra

Looked at how other frameworks handle the same problem. A few things worth noting:

Synthetic returns should be marked as errors
Pi-mono marks injected results with isError: true so the model knows the tool didn't actually succeed — it's not a normal result, it's a "this never ran" signal. We can't do this yet because ToolReturnPart doesn't have an is_error field. That's tracked in pydantic/pydantic-ai#4363. Once that lands, the synthetic returns here should set is_error=True.

Duplicate tool_call_id handling
Already flagged in the audit comment. The current set-based matching means if two calls share an ID (provider bug, frontend-generated IDs), only one synthetic return gets created and the other call stays orphaned. Worth adding detection + a warning at minimum.

Tool ID sanitization is not this PR's job
Hermes has an explicit _sanitize_tool_id() that replaces non-[a-zA-Z0-9_-] chars for Anthropic compliance. That's a provider adapter concern — belongs in pydantic-ai's Anthropic model, not in history repair. Mentioning it here just because it causes the same symptom (Anthropic 400s).

Errored/aborted turns — framework handles it (mostly)
Pi-mono skips entire responses with stopReason === "error". Pydantic-ai already handles finish_reason='length' (raises IncompleteToolCall) and auto-generates tool call IDs if missing, so partial tool calls aren't a concern for this capability. finish_reason='error' responses do silently enter history though — that's a framework-level gap, not something for here.

@sarth6
Copy link
Copy Markdown

sarth6 commented May 14, 2026

@DouweM I had an agent compare our internal version of this to yours and see what yours is doing better and vice versa, found a few tests that are probably worth adding - some are tests we added by hitting real production bugs (ex. #1)

  1. End-to-end smoke through the actual provider that motivated the PR. The suite asserts a lot about the structure of the repaired output, but nothing exercises whether Anthropic actually accepts it. We’ve hit cases where the repaired history was “structurally valid” (every call had a return, every return had a call) and still got 400’d because part ordering inside a single message was wrong — for example, tool_result blocks ending up after a UserPromptPart in the same user turn, when Anthropic requires tool_results at the start. A single VCR cassette round-tripping a repaired payload through the Anthropic adapter would catch this whole class of bug. (Right now _repair_response_request_pair appends synthetic returns to kept_parts — worth checking what part order that produces when the original request already has a UserPromptPart.)

  2. Idempotence as an explicit invariant. Re-running the repair on its own output appears to be a no-op given the current algorithm, but it’s a load-bearing property — the moment someone adds part normalization or reordering in a future change, it’s easy to break silently. A repair(repair(x)) == repair(x) test for a few non-trivial shapes locks it.

  3. At least one test seeded from a real captured history. Every test case is hand-constructed. In our experience, the most pernicious bugs come from shapes that real producers generate (streaming timeouts dropping builtin parts mid-stream, frontend-generated IDs that don’t match server expectations, deferred-tool callbacks landing in unexpected order). Synthetic tests can be 100% green while a single captured production payload still fails. Even one fixture file from a real broken conversation would meaningfully increase the suite’s defensive value.

  4. Multi-segment / stitched histories. When a conversation gets reassembled across boundaries — durable workflow resumption, session import, multi-run checkpoints — a tool call and its return can legitimately live in what looks like a discontinuous structure but is actually valid. Worth a test that confirms the algorithm doesn’t false-positive on a valid pair whose call/return parts originated in different runs but were stitched together correctly. (We have specifically seen the “broken cross-run pair” and “valid cross-run pair” cases produce different correct behaviors and they’re easy to conflate.)

  5. Pair-scope choice as a documented test. test_consecutive_model_responses covers two responses in a row, but it doesn’t fully document the algorithm’s narrow (response, request) scope. Worth tests for: three consecutive responses; response → response → request where the first response’s calls would be matched by the eventual request; an interleaving with a system message in between. Your current algorithm drops the first response’s calls in those cases, which may be deliberate — but the test makes the choice explicit and protects it from accidental “fixing” later.

  6. Synthetic-part identifiability for downstream consumers. The empty-request placeholder uses UserPromptPart(content="Continue."). Once that lands in request_context.messages, any downstream consumer (chat UI, transcript export, conversation analytics) has no way to tell whether “Continue.” was a real user message or synthesized by the repair. It’s not strictly a test for the repair logic, but it’s worth a thought-experiment test that asserts how a consumer is supposed to identify synthetic parts — otherwise this string will end up surfaced in user-facing rendering somewhere down the line.

@agnim25
Copy link
Copy Markdown

agnim25 commented May 31, 2026

Super interested in getting this change through! Let me know if there's anything where I can help. Would generally be interested in this and the other reliability initiatives where we'd want this as a capability. @DouweM @sarth6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants