Skip to content

feat: input and output guardrails (block, redact, retry)#249

Open
dsfaccini wants to merge 17 commits into
mainfrom
guardrails
Open

feat: input and output guardrails (block, redact, retry)#249
dsfaccini wants to merge 17 commits into
mainfrom
guardrails

Conversation

@dsfaccini
Copy link
Copy Markdown
Contributor

@dsfaccini dsfaccini commented May 21, 2026

Adds two guardrail capabilities with a minimal, callable-based API.

Supersedes #219. Original work by @DEENUU1 (Vstorm) — this PR is #219 plus a main merge and a review-driven redesign, opened from a branch on this repo so it can land while the original fork branch is unreachable for push. Full credit to @DEENUU1; review input from @Kludex and @adtyavrdhn carried over.

What it does

  • InputGuard(guard, parallel=False) — runs before the first model request.
  • OutputGuard(guard) — runs as the model output is processed (after_output_process).

A guard is any sync/async callable. It receives the inspected value — the prompt, or the output — optionally preceded by a RunContext (signature-detected, like pydantic-ai's output validators). It returns a bare bool (True = allow) or a GuardResult.

Guard outcomes

GuardResult, built via classmethods:

Outcome InputGuard OutputGuard
allow() send the prompt return the output
block(message=None) skip the model call, refusal message becomes the response (SkipModelRequest) raise OutputBlocked
replace(value) rewrite the prompt sent to the model (redaction) substitute a sanitized output
retry(message) — usage error send the output back to the model (ModelRetry)

A guard that raises propagates the exception as a hard failure.

Observability

replace and block emit spans on the run's OpenTelemetry tracer (guardrail redacted input, guardrail blocked output, …) with guardrail.* attributes, so redactions and refusals are visible in Logfire. Redacted content is attached only when RunContext.trace_include_content is set. retry needs no special tracing — the retried request appears in the trace on its own.

Notable design points

  • OutputGuard uses after_output_process (not after_run) so it can redact and trigger ModelRetry; it runs on the final output only, not streaming partials.
  • InputGuard replace requires sequential mode (a parallel guard races a model call already started with the original prompt); retry is rejected for input.
  • parallel=True trades tokens for latency — sequential never calls the model on a blocked prompt.

Follow-ups

Prepackaged LLM-based guardrails and Presidio/Azure/OpenAI moderation docs are tracked in #248.

Checks

make format && make lint && make typecheck clean. Guardrails: 49 tests, 100% branch coverage; full suite green.

DEENUU1 and others added 11 commits April 24, 2026 12:05
block_message now accepts a callable so the refusal text can reflect the
prompt/output that tripped the guard, rather than being frozen at
construction time.

InputGuard's sequential path moves from before_model_request into
wrap_model_request, so a single hook covers both sequential and parallel
modes instead of two hooks each branching on `parallel`.

Tests move to tests/guardrails/ to match the tests/<capability>/ layout.
dsfaccini added 3 commits May 21, 2026 15:51
A guard now returns either a bare bool or a GuardResult carrying a
refusal message, replacing the separate block_message constructor
field. The message is produced when the guard decides, so it can
reflect the guard's own reasoning rather than a string frozen at
construction time.

Guards (and the GuardResult path) may optionally take a RunContext as
a first parameter, detected from the signature like pydantic-ai's
output validators, so deps- and history-aware guards are possible
without closing over globals. Prompt/output-only guards are unchanged.
A guard now reports one of four outcomes via GuardResult classmethods
(bool shorthand still works): allow, block, replace, retry.

- replace lets a guard redact rather than refuse — InputGuard rewrites
  the prompt sent to the model, OutputGuard substitutes the output.
- retry lets OutputGuard send a bad output back to the model; OutputGuard
  moves from after_run to after_output_process so it can raise ModelRetry
  and return a modified output.
- replace and block emit spans on the run tracer so a redaction or
  refusal is visible in Logfire; redacted content is included only when
  RunContext.trace_include_content is set.

InputGuard replace requires sequential mode and retry is rejected as a
usage error, since neither is meaningful for input.
The coverage gate measures test files too; the _prompt_text helper had
unreached branches. Drop it and assert on message parts inline.
@dsfaccini dsfaccini changed the title feat: add input and output guardrails feat: input and output guardrails (block, redact, retry) May 21, 2026
dsfaccini added 2 commits May 21, 2026 17:33
A pydantic-ai-correctness review surfaced two streaming points. Verified
both empirically:

- InputGuard(parallel=True) works under run_stream() — no deadlock.
- OutputGuard GuardResult.retry() is unsupported under run_stream():
  pydantic-ai does not retry output while streaming, so a retry verdict
  surfaces as UnexpectedModelBehavior.

Document the retry limitation and that OutputGuard screens only the final
output (partial chunks reach the caller first while streaming). Note that
input redaction also rewrites persisted history and targets text prompts.
Add streaming tests for both guards to lock the behavior in.
The test proving InputGuard(parallel=True) does not deadlock under
run_stream() had no timeout — a reintroduced deadlock would hang CI
instead of failing. Wrap it in asyncio.wait_for and document the
reviewed concern it guards against.
"""


def _takes_ctx(func: Callable[..., object]) -> bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could import this from pydantic-ai although private we run the suite so we won't break harness

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: pydantic-ai does not expose a shared takes_ctx helper — _output.py::OutputValidator and _system_prompt.py each inline the same len(inspect.signature(...).parameters) > N check. Our _takes_ctx mirrors that convention exactly. Happy to swap if a public helper lands.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah it doesn't we'll need to import from a private module which I am fine with in this case given it is our private module



GuardOutcome = bool | GuardResult
"""What a guard callable returns: a bare `bool` (`True` = allow), or a `GuardResult`."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public docstrings could take a look, it could be simpler

outcome = guard(ctx, value) if _takes_ctx(guard) else guard(value)
if inspect.isawaitable(outcome):
outcome = await outcome
if isinstance(outcome, GuardResult):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this then enforced based on the type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: the type annotation is just a hint — runtime discrimination of the bool | GuardResult union needs the isinstance check to narrow. Python doesn't enforce union types at runtime.

original prompt.

Scope: the guard runs exactly once per run — on the first model request —
and evaluates the original user prompt. Subsequent model requests in the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude has become increasingly verbose. I am sure it could explain the same thing in fewer words. I dunno if they wanna increase their token costs with this ://

raise UserError('InputGuard could not find a user prompt to redact in the request.')
_trace_redaction(ctx, direction='input', original=prompt, replacement=replacement)

async def wrap_model_request(
Copy link
Copy Markdown
Member

@adtyavrdhn adtyavrdhn May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see something tricky here, I think the position for this capability needs to be innermost so that no user morphed messages can sneak by and we can be confident this runs only after all the other capabilities are done.

OutputGuard similarly would need to be outermost but there are outermost capabilities already so it needs to be wrapped by instrumentation or something? Look at deferred caps PR(pydantic/pydantic-ai#5230) for reference, ended up doing something similar without breaking instrumentation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: applied. InputGuard.get_ordering()position='innermost' so any message-morphing capability runs first and the guard sees the final prompt. OutputGuard.get_ordering()position='outermost', wrapped_by=[Instrumentation] so the guard's block/redact spans are always captured by an enclosing Instrumentation span regardless of user list order. Tests added (TestInputGuardOrdering, TestOutputGuardOrdering).

if ctx.partial_output:
return output
verdict = await _evaluate(self.guard, ctx, output)
if verdict.action == 'allow':
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use match instead?

Also, there are multiple if conditions, but this is a verdict, right? More than one cannot be true.

"""Record a zero-duration span marking a guardrail refusal."""
ctx.tracer.start_span(
f'guardrail blocked {direction}',
attributes={'guardrail.direction': direction, 'guardrail.action': 'block', 'guardrail.message': message},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it all right for us to dump the message as is? Is there a risk that the LLM explains in the message why it blocked it and ends up leaking it anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: applied — _trace_block now gates guardrail.message behind RunContext.trace_include_content, matching how _trace_redaction handles content. Default traces stay content-free; debugging can opt in via include_content. The user-facing refusal text (the SkipModelRequest response / OutputBlocked exception) is unchanged.

_DEFAULT_OUTPUT_RETRY_MESSAGE = 'Output rejected by output guardrail.'


@dataclass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use frozen and kw_only


replacement: object | None = None
"""For `replace`, the value substituted for the inspected one."""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do a __post__init to verify the integrity of this result

finally:
for task in (guard_task, handler_task):
if not task.done():
task.cancel()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth letting Claude check if this needs a shield, anyio level cancellations from the graph can seep in and throw cancellations here which would crash the cancellation. I am still wrapping my head around it so I am not sure if that is 100% possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: investigated with a regression test (test_no_dangling_tasks_on_outer_cancellation) that cancels the outer task mid-flight and asserts no leftover guard/handler tasks. It passes without asyncio.shield — the outer cancel is consumed by the asyncio.wait above the finally, so task.cancel() + await asyncio.gather(...) complete cleanly. Adding shield would be defensive but could mask a genuine re-cancellation; the test locks the current behavior in.

if verdict.action == 'block':
message = verdict.message or _DEFAULT_INPUT_BLOCK_MESSAGE
_trace_block(ctx, direction='input', message=message)
raise SkipModelRequest(ModelResponse(parts=[TextPart(content=message)]))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this then the message remains in history and the model can see it in the next turn because we only go through the messages and pick the last prompt for InputGuard?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: this is intended SkipModelRequest behavior in pydantic-ai — the canned ModelResponse is the assistant's turn, and a multi-turn chat that skips it would be broken. Per the SkipModelRequest docstring, the response is preserved in history. Callers that don't want the refusal carried forward can filter result.all_messages() or start a fresh agent for the next turn.

Copy link
Copy Markdown
Member

@adtyavrdhn adtyavrdhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally great, have a few concerns.

Address @adtyavrdhn's review on #249.

- `InputGuard.get_ordering()` → `position='innermost'` so any
  message-morphing capability runs first and the guard sees the final
  prompt the model will receive.
- `OutputGuard.get_ordering()` → `position='outermost', wrapped_by=
  [Instrumentation]` so the guard's block/redact spans are always
  captured by an enclosing `Instrumentation` span regardless of user
  list order.
- `GuardResult` is `frozen=True, kw_only=True` with a `__post_init__`
  that rejects field combinations the four-outcome contract does not
  allow (e.g. `replace` without a replacement). `block` with no message
  stays valid — the default kicks in at the use site.
- `_run_guard` and `after_output_process` dispatch via `match action:`
  with `assert_never` exhaustiveness guards.
- `_trace_block` gates the refusal `message` attribute behind
  `ctx.trace_include_content`, matching `_trace_redaction` — the
  message can quote sensitive content from the guarded value.
- New tests: ordering declarations, `__post_init__` validation, frozen
  enforcement, outer-cancellation no-leak regression guard (which
  confirms `asyncio.shield` around the cleanup `gather` is not needed —
  the outer cancel is already consumed by `asyncio.wait`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@adtyavrdhn adtyavrdhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving post discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants