Add ToolOrphanRepair capability by DouweM · Pull Request #184 · pydantic/pydantic-ai-harness

DouweM · 2026-04-10T01:02:49Z

Summary

Implements ToolOrphanRepair, a capability that sanitizes message history to fix orphaned tool calls and results before each model request
Injects synthetic ToolReturnPart / BuiltinToolReturnPart for calls without matching results, strips ToolReturnPart / RetryPromptPart whose tool_call_id doesn't match any call, and handles trailing responses and empty request edge cases
Exports as from pydantic_harness import ToolOrphanRepair

Test plan

25 tests covering all repair scenarios: orphaned calls, orphaned returns, orphaned builtin calls, trailing responses, empty requests, warnings, multi-turn conversations, and no-op passthrough
pyright strict mode: 0 errors
ruff lint + format: clean

🤖 Generated with Claude Code

…sults Implements a capability that hooks into before_model_request to repair structurally invalid message history caused by orphaned tool calls and results in multi-turn conversations. This prevents providers (especially Anthropic) from rejecting poisoned conversation history with 400 errors. Repairs: orphaned ToolCallPart (injects synthetic ToolReturnPart), orphaned BuiltinToolCallPart (injects BuiltinToolReturnPart in same response), and orphaned ToolReturnPart/RetryPromptPart (strips them). Also handles trailing responses and empty request edge cases. Refs: pydantic/pydantic-ai#4728 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each repair site now emits a `logging.debug()` message describing the specific action taken (synthetic return injected, orphaned return stripped, trailing response dropped, etc.), complementing the existing summary UserWarning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…onses - Test the `before_model_request` capability hook directly - Test consecutive ModelResponse messages (no interleaved request) - Mark defensive Phase 6 code as `# pragma: no cover` (unreachable) - Mark unused `extra_parts` helper branch as `# pragma: no cover` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

DouweM · 2026-04-10T15:09:29Z

Originally posted by @DouweM in #132 comment (PR was recreated)

Audit vs prior art: ToolOrphanRepair

Worth adding now:

Duplicate tool_call_id deduplication
Debug-level logging of specific repairs (not just count)

Follow-up opportunities:

Integration with Compaction to catch orphans created by summarization

DouweM · 2026-04-10T15:09:30Z

Originally posted by @adtyavrdhn in #132 comment (PR was recreated)

Notes from comparing with Hermes, Pi-mono, and Mastra

Looked at how other frameworks handle the same problem. A few things worth noting:

Synthetic returns should be marked as errors
Pi-mono marks injected results with isError: true so the model knows the tool didn't actually succeed — it's not a normal result, it's a "this never ran" signal. We can't do this yet because ToolReturnPart doesn't have an is_error field. That's tracked in pydantic/pydantic-ai#4363. Once that lands, the synthetic returns here should set is_error=True.

Duplicate tool_call_id handling
Already flagged in the audit comment. The current set-based matching means if two calls share an ID (provider bug, frontend-generated IDs), only one synthetic return gets created and the other call stays orphaned. Worth adding detection + a warning at minimum.

Tool ID sanitization is not this PR's job
Hermes has an explicit _sanitize_tool_id() that replaces non-[a-zA-Z0-9_-] chars for Anthropic compliance. That's a provider adapter concern — belongs in pydantic-ai's Anthropic model, not in history repair. Mentioning it here just because it causes the same symptom (Anthropic 400s).

Errored/aborted turns — framework handles it (mostly)
Pi-mono skips entire responses with stopReason === "error". Pydantic-ai already handles finish_reason='length' (raises IncompleteToolCall) and auto-generates tool call IDs if missing, so partial tool calls aren't a concern for this capability. finish_reason='error' responses do silently enter history though — that's a framework-level gap, not something for here.

sarth6 · 2026-05-14T19:16:24Z

@DouweM I had an agent compare our internal version of this to yours and see what yours is doing better and vice versa, found a few tests that are probably worth adding - some are tests we added by hitting real production bugs (ex. #1)

End-to-end smoke through the actual provider that motivated the PR. The suite asserts a lot about the structure of the repaired output, but nothing exercises whether Anthropic actually accepts it. We’ve hit cases where the repaired history was “structurally valid” (every call had a return, every return had a call) and still got 400’d because part ordering inside a single message was wrong — for example, tool_result blocks ending up after a UserPromptPart in the same user turn, when Anthropic requires tool_results at the start. A single VCR cassette round-tripping a repaired payload through the Anthropic adapter would catch this whole class of bug. (Right now _repair_response_request_pair appends synthetic returns to kept_parts — worth checking what part order that produces when the original request already has a UserPromptPart.)
Idempotence as an explicit invariant. Re-running the repair on its own output appears to be a no-op given the current algorithm, but it’s a load-bearing property — the moment someone adds part normalization or reordering in a future change, it’s easy to break silently. A repair(repair(x)) == repair(x) test for a few non-trivial shapes locks it.
At least one test seeded from a real captured history. Every test case is hand-constructed. In our experience, the most pernicious bugs come from shapes that real producers generate (streaming timeouts dropping builtin parts mid-stream, frontend-generated IDs that don’t match server expectations, deferred-tool callbacks landing in unexpected order). Synthetic tests can be 100% green while a single captured production payload still fails. Even one fixture file from a real broken conversation would meaningfully increase the suite’s defensive value.
Multi-segment / stitched histories. When a conversation gets reassembled across boundaries — durable workflow resumption, session import, multi-run checkpoints — a tool call and its return can legitimately live in what looks like a discontinuous structure but is actually valid. Worth a test that confirms the algorithm doesn’t false-positive on a valid pair whose call/return parts originated in different runs but were stitched together correctly. (We have specifically seen the “broken cross-run pair” and “valid cross-run pair” cases produce different correct behaviors and they’re easy to conflate.)
Pair-scope choice as a documented test. test_consecutive_model_responses covers two responses in a row, but it doesn’t fully document the algorithm’s narrow (response, request) scope. Worth tests for: three consecutive responses; response → response → request where the first response’s calls would be matched by the eventual request; an interleaving with a system message in between. Your current algorithm drops the first response’s calls in those cases, which may be deliberate — but the test makes the choice explicit and protects it from accidental “fixing” later.
Synthetic-part identifiability for downstream consumers. The empty-request placeholder uses UserPromptPart(content="Continue."). Once that lands in request_context.messages, any downstream consumer (chat UI, transcript export, conversation analytics) has no way to tell whether “Continue.” was a real user message or synthesized by the repair. It’s not strictly a test for the repair logic, but it’s worth a thought-experiment test that asserts how a consumer is supposed to identify synthetic parts — otherwise this string will end up surfaced in user-facing rendering somewhere down the line.

agnim25 · 2026-05-31T05:51:03Z

Super interested in getting this change through! Let me know if there's anything where I can help. Would generally be interested in this and the other reliability initiatives where we'd want this as a capability. @DouweM @sarth6

DouweM and others added 4 commits April 2, 2026 05:27

fix: ruff import ordering

94c4242

DouweM requested review from Kludex, adtyavrdhn, dmontagu, dsfaccini and samuelcolvin as code owners April 10, 2026 01:02

devin-ai-integration Bot reviewed Apr 10, 2026

View reviewed changes

DouweM assigned adtyavrdhn Apr 10, 2026

DouweM removed request for Kludex, adtyavrdhn, dmontagu, dsfaccini and samuelcolvin April 10, 2026 15:12

adtyavrdhn mentioned this pull request Apr 15, 2026

feat: add built-in repair_orphaned_tool_parts history processor pydantic/pydantic-ai#5090

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ToolOrphanRepair capability#184

Add ToolOrphanRepair capability#184
DouweM wants to merge 4 commits into
mainfrom
capability/tool-orphan-repair

DouweM commented Apr 10, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

DouweM commented Apr 10, 2026

Uh oh!

DouweM commented Apr 10, 2026

Uh oh!

sarth6 commented May 14, 2026

Uh oh!

agnim25 commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DouweM commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

DouweM commented Apr 10, 2026

Audit vs prior art: ToolOrphanRepair

Uh oh!

DouweM commented Apr 10, 2026

Notes from comparing with Hermes, Pi-mono, and Mastra

Uh oh!

sarth6 commented May 14, 2026

Uh oh!

agnim25 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DouweM commented Apr 10, 2026 •

edited

Loading

agnim25 commented May 31, 2026 •

edited

Loading