feat: add FileSystem and Shell capabilities by strawgate · Pull Request #260 · pydantic/pydantic-ai-harness

strawgate · 2026-05-27T23:52:13Z

Summary

Adds two new batteries-included capabilities built on the AbstractCapability / FunctionToolset primitives.

`FileSystem`

Sandboxed file-system access scoped to a root directory.

All paths resolved and containment-checked before any I/O (traversal prevention, symlink-aware)
Allow / deny / protected pattern filtering (globs), matched against the canonical path (a leading **/ also covers the root)
Protected defaults: .git/, .env, *.pem, *.key, secrets
Tools: read_file, write_file, edit_file (hash-based optimistic concurrency), list_directory, search_files, find_files, create_directory, file_info
__post_init__ validates that integer limit fields (max_read_lines, max_search_results, max_find_results) are positive non-bool ints

`Shell`

Configurable command execution for agents.

Allow/deny command lists; deny operator list (e.g. >, >>, |)
Interactive command detection (vi, sudo, ssh, …)
Synchronous (run_command) and background (start_command / check_command / stop_command) execution
Output truncated and labelled with stdout/stderr/exit-code markers
Optional persist_cwd so cd is sticky across calls — the working directory is captured out-of-band (a private temp file), so command output can't spoof it
__aexit__ terminates all remaining background processes — the agent runtime enters toolsets via AsyncExitStack, so cleanup is guaranteed on run end (success or error)
for_run returns a fresh toolset per run, so persistent cwd and background processes are isolated across concurrent runs

Robustness

Model-correctable tool errors (missing file, denied path, stale edit, denied command) surface as ModelRetry so the agent can self-correct instead of aborting the run
Both capabilities and their toolsets are generic over AgentDepsT

Test coverage

100% branch coverage (enforced by fail_under = 100)
Mutation testing: 89.7% kill rate (60 provably-equivalent survivors documented in docs/mutation-testing.md)
pyright strict: 0 errors

Validation

make lint && make typecheck && make testcov

All pass.

Follow-ups (tracked separately, out of scope here)

ripgrep-backed search/find/list for large-repo performance (deferred per review)
max_file_size guard + search timeout (ReDoS / OOM hardening)
read-only mode / per-tool selection; deps-scoped roots; HITL approval for destructive ops

- FileSystemToolset: 8 tools (read, write, edit, list, search, find, mkdir, info) with path-traversal prevention, allow/deny patterns, optimistic concurrency - ShellToolset: 1 tool (run_command) with command validation, timeout handling, and async subprocess execution via anyio - 152 tests passing on asyncio backend, 100% branch coverage - Mutation testing: 524/584 killed (89.7%), 60 equivalent mutants documented - All survivors proven equivalent (trampoline defaults, encoding case-insensitivity, unreachable except blocks, dead branches, name=None fallback behavior)

strawgate

have not run through the tests yet

strawgate · 2026-05-28T00:05:32Z

+                continue
+            if _is_binary(raw):
+                continue
+            text = raw.decode('utf-8', errors='replace')


It might be worth seeing if we want to shell out to ripgrep. I have a library https://github.com/strawgate/rpygrep that manages some of that but we can probably find another alternative.

Ripgrep 100%

- Remove bool guard from __post_init__ validation (pedantic AI slop) - Pre-calculate _real_root in __init__ instead of per-call - Extract _first_matching_pattern helper, use in _check_access - Extract _first_denied_operator helper, use in _check_command - Change _format_lines signature from list[str] to Sequence[str] and split lines once at call site in read_file - write_file: raise FileNotFoundError for missing parents instead of auto-creating - list_directory: add pragma: no cover comment on OSError branch - _check_command docstring: document best-effort security boundary - Remove debug = true from [tool.mutmut] - Update tests to match: remove test_bool_max_read_lines_rejected, change test_write_creates_parents -> test_write_nonexistent_parent_raises, add isolation tests for _first_matching_pattern and _first_denied_operator

mpfaffenberger · 2026-05-28T10:40:25Z

+            raise NotADirectoryError(f'Not a directory: {path}')
+
+        entries: list[str] = []
+        for entry in sorted(resolved.iterdir()):


Recursive listing would be extremely slow on large projects. I would recommend adding ripgrep as a hard dependency (even for listing files)

If you want to see Python harness tooling that is battle tested (millions of uses per day) please look at https://github.com/mpfaffenberger/code_puppy

mpfaffenberger · 2026-05-28T10:41:02Z

+                continue
+            if _is_binary(raw):
+                continue
+            text = raw.decode('utf-8', errors='replace')


Ripgrep 100%

mutmut pulls in a large dependency tree and is only used to validate test quality, not for normal development or CI. Keep its config in pyproject.toml (mutmut v3 has no CLI flag to override the config path) but install it ephemerally via 'uv run --with' from a small script. The [tool.mutmut] block is unchanged: paths_to_mutate, tests_dir, also_copy, and pytest_add_cli_args still live in pyproject.toml. Only the dev dependency is removed. docs/mutation-testing.md is updated to reference the new script.

strawgate flagged 'getattr(self, name)' in __post_init__ as 'bad claude'. The runtime isinstance validation is still useful (dataclass field annotations are advisory, not enforced) but the getattr is unnecessary and obscures the intent. Iterate fields directly via a typed dict so pyright doesn't narrow the isinstance check away.

strawgate flagged that 'default: max_read_lines' in a docstring is unhelpful — readers can't tell what the actual default is without scrolling to the class definition. Replace with the literal value in both filesystem and shell toolsets.

dsfaccini · 2026-06-01T18:11:41Z

Claude here: full review of #260 across correctness, security, repo conventions, pyai-idiomatic API usage, and extensibility. CI is green (all 18 checks), but a few things the green build can't catch are worth resolving before merge. Delegated parts of this to subagents (a security pass, a convention-drift pass) and to a fresh Claude session running inside the pydantic-ai core checkout for the API-idiom questions.

I've tried to be concrete and to back the load-bearing claims with sources. Two of these (B3, B4) I know you're skeptical of — I've put the strongest evidence I have on those.

Blockers I'd want fixed before merge

B1 — `search_files` / `find_files` / `list_directory` bypass the entire allow/deny/protected model

FileSystemToolset._check_access runs only on the directory argument, never on the per-result entries produced by rglob / glob / iterdir. Those results are filtered only by part.startswith('.'). Consequences:

denied_patterns does nothing for these three enumeration tools.
The default protected set (*.pem, *.key, **/secrets*) — none of which are dotfiles — is fully readable. search_files(pattern="PRIVATE KEY") will read and return matching lines from every key file in the tree.

This contradicts the security model the toolset's own docstring advertises. Fix: run _check_access (read mode) on each candidate and skip on PermissionError. (This is also where the ripgrep discussion below lands — a ripgrep-backed implementation would re-do this traversal anyway, so it's worth solving them together.)

B2 — recoverable tool errors abort the run instead of letting the model retry

Every tool raises bare PermissionError / FileNotFoundError / ValueError / NotADirectoryError. pyai core only turns ModelRetry (and ValidationError) into a retry prompt the model can recover from — see _tool_manager.py, where only SkipToolExecution / CallDeferred / ApprovalRequired / ModelRetry are special-cased and everything else propagates and ends the run.

So "file not found", "old_text not found", "command denied", "invalid regex" — all conditions the model should be able to correct on its own — currently crash the agent. The new code raises ModelRetry nowhere. Fix: raise ModelRetry for model-correctable conditions; reserve raw exceptions for genuine internal failures.

B3 — mutable run-state on a shared toolset, no `for_run` isolation

This is your "are toolsets supposed to be shareable?" question — and the answer from core is: yes, by default they're shared, and that is exactly why core gives you for_run to opt out when you hold mutable state. Quoting the base contract (AbstractToolset.for_run, pydantic-ai v1.95.1):

Called once per run, before __aenter__. Override this to return a fresh instance for per-run state isolation. Default: return self (shared across runs).

Source: https://github.com/pydantic/pydantic-ai/blob/v1.95.1/pydantic_ai_slim/pydantic_ai/toolsets/abstract.py#L110-L116

So a stateless toolset is fine to share — you're right about that. The issue is specific: get_toolset is called once at agent construction, so a single ShellToolset instance is reused across every run of that agent, and it carries mutable per-run state — self._background (the background-process dict) and self._cwd (persistent cwd). Two concurrent runs of the same agent will corrupt each other: run A's cd moves run B's cwd, and run A's __aexit__ terminates run B's background processes.

The in-repo precedent is our own exemplar: CodeModeToolset.for_run returns replace(self, …) precisely to give each run a fresh instance with reset state, and documents why. The two new toolsets override neither for_run nor for_run_step. Fix: implement for_run to return a fresh instance with an empty _background and reset _cwd. (This also makes the otherwise well-written __aexit__ cleanup sound — __aexit__ fires per run, so on a shared instance it's tearing down state that may belong to another run.)

B4 — `persist_cwd` sentinel is spoofable / silently self-disabling

I know you flagged this thread as "may be dubious… leave for now." I think it's a real correctness footgun, less as a security hole and more as a silent-wrong-behavior one:

_wrap_command_for_cwd skips wrapping whenever ';' in command. That substring test also fires on a ; inside a quoted string (echo "a;b"), so cwd tracking silently turns off for perfectly benign commands.
_extract_cwd_from_output parses a fixed literal __HARNESS_PWD__ out of stdout. cd /tmp; echo __HARNESS_PWD__/etc runs unwrapped (because of the ; skip), exits 0, and the parser moves the persistent cwd to /etc. Any command that emits that literal can steer subsequent commands' working directory.

The exploit window for malicious redirection is narrow (the genuine appended sentinel usually wins via rfind), so I wouldn't sell this as a security hole — but the silent-disable and the spoof-on-;-skip paths are real wrong behavior. Fix: write pwd to an out-of-band channel (a temp file the wrapper writes, read by the toolset) instead of parsing it out of stdout.

Unresolved review threads — are they blockers?

All 14 are still open. Breakdown:

6 are outdated (already addressed by the "non-controversial feedback" commit but never marked resolved): calculate during init, _in_pattern_list helper, split-lines, throw-if-dir-missing, the "best-effort, not a security boundary" doc, and the line-259 comment. Not blockers — just worth resolving so the PR stops looking like it has 14 live objections.
ripgrep (raised by both reviewers, on search_files and list_directory): "recursive listing extremely slow on large projects." Effectively a blocker given two independent asks — either adopt ripgrep or get explicit agreement to defer. Ties into B1.
"make directory protection optional" (_capability.py root_dir): protection is already opt-out via an empty protected_patterns — worth clarifying whether you want protections off by default (opt-in) instead. Design call.
"bad claude" (the __post_init__ getattr-loop validation): trivial — replace the reflective loop with explicit validation.
"constants in docstrings aren't helpful" (read_file docstring default: max_read_lines): doc nit.
pyproject.toml:154 "Remove": ties to D3 below (strip mutmut config from this PR).
the cwd thread: see B4.

No new surprises hiding in the threads — ripgrep and the cwd sentinel are the two substantive open ones.

Repo-convention drift

D1 — no README.md. code_mode and the sibling capability PRs (feat: input and output guardrails (block, redact, retry) #249, Add StepPersistence capability for step-event durability across delegates #251) all ship one, and agent_docs/index.md, capability-authoring.md, and review-checklist.md all require it. Skeleton to match: # Title → The problem → The solution → Usage (runnable Agent(...) example) → feature sections → API. Shell's README must document the safety model.
D2 — AbstractCapability[Any] instead of [AgentDepsT]. Base is Generic[AgentDepsT]; CodeMode, InputGuard, StepPersistence all stay generic, and the repo bans Any in public signatures. These tools don't touch deps, so it's a pure type-hygiene fix: FileSystem(AbstractCapability[AgentDepsT]), FunctionToolset[AgentDepsT], return AgentToolset[AgentDepsT]. (Also Shell.get_toolset is annotated | None but always returns a toolset — tighten it.)
D3 — per-PR mutmut config in repo-wide pyproject.toml. Adds the mutmut dev-dep + a [tool.mutmut] block hardcoding this PR's two files + mutants exclusions + scripts/run-mutmut.sh + docs/mutation-testing.md. No prior capability PR carries this, and it bumps against the "don't edit pyproject.toml/uv.lock, open an issue for deps" rule. Suggest stripping it from the PR (keep the 89.7% result in the PR description) and landing mutation testing repo-wide separately if wanted.
D4 — PR body tool names wrong. Body says list_files / make_dir; code registers list_directory / create_directory. Pick the public names deliberately and make code + body + README agree.
D5 — class docstrings lack the runnable python usage example code_mode has, and the root README.md capability table still shows File system / Shell as :construction: linking to the old PR Add FileSystem and Shell capabilities #177.

pyai-idiomatic API review (is this reinventing core?)

Checked against core source:

Not reinventing implementations — core has no filesystem or shell tools (common_tools/ is web-only), no FS/process sandbox. Harness is the right home. ✅
Missing HITL/approval composition — core ships requires_approval=True, ApprovalRequiredToolset / .approval_required(func), ApprovalRequired / CallDeferred, and the HandleDeferredToolCalls capability. Hard-PermissionError-denying destructive ops (rm, writes to protected paths) is exactly what that flow exists for, and code_mode already integrates with it. This is the highest-leverage design improvement, but it's a bigger conversation — flagging, not asking you to add it silently.
Subclassing FunctionToolset + add_function(self.method, …) for static config is fine and supported; the only problem is the mutable state (→ B3).

Security (beyond B1/B4)

Patterns match the agent-supplied string, not the canonical path (_check_access): config/./secret.txt evades denied=['config/secret.txt'], and fnmatch doesn't implement **, so the default **/secrets* does not protect a root-level secrets.yaml. Match against the resolved relative path.
_real_root is captured once at construction, so a parent-dir symlink swap can defeat containment (precondition: write access to a parent of root).
ReDoS / event-loop block in search_files (unbounded agent-supplied regex, no timeout) and whole-file read_bytes before any size guard (no max-file-size knob).
The shell denylist/allowlist bypasses (tokens[0]-only, /bin/rm, bash -c, $(...), substring operator matching) are inherent and correctly disclaimed — not defects. One ask: lift that "not a security boundary" disclaimer from the private _check_command up into the public Shell dataclass docstring, since denied_commands / denied_operators read as protective there.

Missing knobs / extensibility (opinion — not asking to expand scope here, just flagging what may gate adoption)

Read-only / per-tool selection on FileSystem. Today it's all 8 tools or nothing. "Give the agent read-only repo access" is one of the most common asks and is currently impossible without subclassing. A read_only: bool or a tool-subset selector (mirroring CodeMode.tools) is small and high-value.
deps-scoped root_dir / cwd. Root/cwd are frozen at construction and tools take no RunContext, so you can't scope the sandbox per-run/per-user (multi-tenant). Fixing D2 + B3 is the prerequisite; deriving root from ctx.deps would unlock it. Worth at least not foreclosing.
HITL approval hook (above) — for shell especially, "destructive op → ask a human" is the difference between a demo and something teams deploy.

Lower-priority, worth tracking in issues rather than this PR: max_file_size, env-var control for Shell (inject/scrub secrets from the subprocess env), configurable encoding, and the ripgrep backend.

Happy to implement B1–B4 + D2 and draft the two READMEs if you want — or to open follow-up issues for read-only mode, deps-scoped roots, HITL approval, and ripgrep so they're tracked without bloating this PR.

list_directory, search_files, and find_files previously only checked the root path passed to the tool, so a recursive rglob could return or read through files that the agent would otherwise be denied direct access to. Add _is_accessible(rel, write=True) as a predicate form of _check_access and call it on each entry walked by the three recursive operations, so denied and protected patterns hide children the same way they block direct read/write. Reads of protected files remain allowed in isolation; the listing operations pass write=True to make existence match the read-side policy that protected paths can't be opened.

…toolset docs A file pattern like `src/*.py` in `allowed_patterns` is a per-file rule, but the walkers gated their *root* directory on it too — so `list_directory('.')`, `search_files('.', ...)` and `find_files('*')` always raised, because a directory root never matches a file pattern. The toolset was effectively unusable whenever an allowlist was set. Walkers now skip the allowlist gate on their root (deny/protected patterns still apply) and filter each entry instead, matching how the dotfile and deny/protected filters already behave. The two stubbed tests that documented this limitation are now real end-to-end assertions. Also add README docs for the filesystem and shell toolsets and trim the mutation-testing doc down to the durable how-to-run guidance, dropping the frozen, already-stale result snapshot.

list_directory surfaced dot-entries (.git/, .env, .hidden) while search_files and find_files deliberately skip any dot-prefixed path component, so the three walkers disagreed on what exists and a listing leaked .git/. Apply the same dotfile skip to list_directory; this also makes the existing test_list_hides_protected_entries comment ("hidden by the dotfile filter") true.

Truncation dropped everything past max_output_chars from the start, discarding exactly the part that matters most — the end of stdout, the [stderr] section (appended last), and exit info — which is where errors and stack traces land. Keep the final max_output_chars instead and drop the head. This also removes the convoluted stderr-budget branch, since the tail naturally preserves stderr. Addresses review feedback on PR #260.

dsfaccini · 2026-06-02T01:10:37Z

+    def _wrap_command_for_cwd(self, command: str) -> str:
+        """Append pwd sentinel to command for cwd tracking.
+
+        Commands containing ';' are returned unwrapped because the separator
+        breaks the '&&' success-gating of the sentinel echo.
+        """
+        if ';' in command:
+            return command
+        return f'{command} && echo {_PWD_SENTINEL}$(pwd)'


Claude here: this is the B4 footgun. Two concrete problems with parsing the cwd out of stdout:

1. ; silently disables cwd tracking. Any command containing ; is returned unwrapped, so a real cd is dropped on the floor:

run_command('cd subdir ; true') # persist_cwd=True run_command('pwd') # -> still the OLD dir; the cd never persisted

The substring test also misfires on a ; inside a quoted string, disabling tracking for benign commands.

2. The sentinel is spoofable through that same ;-skip path. A command whose stdout contains the literal __HARNESS_PWD__<dir> moves the tracked cwd with no real cd:

run_command('true ; echo __HARNESS_PWD__/some/existing/dir') # -> self._cwd is now /some/existing/dir; the next command runs there

Regression tests proving both (they fail on this revision, pass after the fix):

async def test_cd_persists_even_with_semicolon(self, persist_toolset): await persist_toolset.run_command('cd subdir ; true') assert 'subdir' in await persist_toolset.run_command('pwd') async def test_output_cannot_spoof_cwd(self, persist_toolset, shell_dir): await persist_toolset.run_command(f'true ; echo __HARNESS_PWD__{shell_dir / "subdir"}') assert persist_toolset._cwd == shell_dir # no real cd happened

Fix (implemented locally): drop the stdout sentinel entirely and capture pwd out-of-band into a private temp file the agent's command can't address — {command}\n__harness_ec=$?\npwd > <random_tempfile>\nexit $__harness_ec — then read the cwd from that file. No ;-skip, nothing to spoof. Happy to push it onto the branch or hand you the diff, whichever you prefer.

…dening, generic typing Address blockers from PR review so the capabilities are robust under real agent use, not just under TestModel: - B2: tool methods raised native exceptions that pyai propagates and aborts the run on. A `_recoverable` decorator now surfaces model-correctable errors (missing file, denied path, stale edit, denied command) as `ModelRetry` so the agent can self-correct. Internal helpers keep raising native exceptions. - B3: ShellToolset held mutable per-run state (`_cwd`, `_background`) on the single instance `get_toolset` builds at construction, so concurrent runs corrupted each other. `for_run` now returns a fresh copy, matching the CodeModeToolset exemplar. - B4: persist_cwd parsed cwd from stdout, which a `;` could silently disable and command output could spoof. Capture `pwd` out-of-band via a private temp file instead. - Sec#3: pattern matching ran against the agent-supplied string, letting `config/./secret.txt` evade a deny rule, and `**/secrets*` missed root-level files (leaking them via search). Match the canonical path; treat a leading `**/` as covering the root. - D2: parametrize FileSystem/Shell and their toolsets on `AgentDepsT` instead of `Any`, matching the rest of the library. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

strawgate added 2 commits May 26, 2026 19:56

Clean-up Filesystem and Shell capabilities

592ba63

strawgate commented May 28, 2026

View reviewed changes

Comment thread pyproject.toml

mpfaffenberger reviewed May 28, 2026

View reviewed changes

This was referenced Jun 1, 2026

Master issue: ship the execution-environment capability (gh-aw shim parity + pluggable backend) #265

Open

feat: execution environments (Local + Docker) #261

Closed

strawgate added 3 commits June 1, 2026 12:18

adtyavrdhn reviewed Jun 1, 2026

View reviewed changes

Comment thread pydantic_ai_harness/shell/_toolset.py Outdated

strawgate and others added 4 commits June 1, 2026 13:39

dsfaccini reviewed Jun 2, 2026

View reviewed changes

dsfaccini merged commit e1654cc into main Jun 2, 2026
18 checks passed

Conversation

strawgate commented May 27, 2026 • edited by dsfaccini Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

FileSystem

Shell

Robustness

Test coverage

Validation

Follow-ups (tracked separately, out of scope here)

Uh oh!

strawgate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

strawgate May 28, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mpfaffenberger May 28, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger May 28, 2026

Choose a reason for hiding this comment

Uh oh!

dsfaccini commented Jun 1, 2026

Blockers I'd want fixed before merge

B1 — search_files / find_files / list_directory bypass the entire allow/deny/protected model

B2 — recoverable tool errors abort the run instead of letting the model retry

B3 — mutable run-state on a shared toolset, no for_run isolation

B4 — persist_cwd sentinel is spoofable / silently self-disabling

Unresolved review threads — are they blockers?

Repo-convention drift

pyai-idiomatic API review (is this reinventing core?)

Security (beyond B1/B4)

Missing knobs / extensibility (opinion — not asking to expand scope here, just flagging what may gate adoption)

Uh oh!

Uh oh!

dsfaccini Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

strawgate commented May 27, 2026 •

edited by dsfaccini

Loading

`FileSystem`

`Shell`

B1 — `search_files` / `find_files` / `list_directory` bypass the entire allow/deny/protected model

B3 — mutable run-state on a shared toolset, no `for_run` isolation

B4 — `persist_cwd` sentinel is spoofable / silently self-disabling