Skip to content

feat: add FileSystem and Shell capabilities#260

Merged
dsfaccini merged 11 commits into
mainfrom
capability/filesystem-shell-v2
Jun 2, 2026
Merged

feat: add FileSystem and Shell capabilities#260
dsfaccini merged 11 commits into
mainfrom
capability/filesystem-shell-v2

Conversation

@strawgate
Copy link
Copy Markdown
Contributor

@strawgate strawgate commented May 27, 2026

Summary

Adds two new batteries-included capabilities built on the AbstractCapability / FunctionToolset primitives.

FileSystem

Sandboxed file-system access scoped to a root directory.

  • All paths resolved and containment-checked before any I/O (traversal prevention, symlink-aware)
  • Allow / deny / protected pattern filtering (globs), matched against the canonical path (a leading **/ also covers the root)
  • Protected defaults: .git/, .env, *.pem, *.key, secrets
  • Tools: read_file, write_file, edit_file (hash-based optimistic concurrency), list_directory, search_files, find_files, create_directory, file_info
  • __post_init__ validates that integer limit fields (max_read_lines, max_search_results, max_find_results) are positive non-bool ints

Shell

Configurable command execution for agents.

  • Allow/deny command lists; deny operator list (e.g. >, >>, |)
  • Interactive command detection (vi, sudo, ssh, …)
  • Synchronous (run_command) and background (start_command / check_command / stop_command) execution
  • Output truncated and labelled with stdout/stderr/exit-code markers
  • Optional persist_cwd so cd is sticky across calls — the working directory is captured out-of-band (a private temp file), so command output can't spoof it
  • __aexit__ terminates all remaining background processes — the agent runtime enters toolsets via AsyncExitStack, so cleanup is guaranteed on run end (success or error)
  • for_run returns a fresh toolset per run, so persistent cwd and background processes are isolated across concurrent runs

Robustness

  • Model-correctable tool errors (missing file, denied path, stale edit, denied command) surface as ModelRetry so the agent can self-correct instead of aborting the run
  • Both capabilities and their toolsets are generic over AgentDepsT

Test coverage

  • 100% branch coverage (enforced by fail_under = 100)
  • Mutation testing: 89.7% kill rate (60 provably-equivalent survivors documented in docs/mutation-testing.md)
  • pyright strict: 0 errors

Validation

make lint && make typecheck && make testcov

All pass.

Follow-ups (tracked separately, out of scope here)

  • ripgrep-backed search/find/list for large-repo performance (deferred per review)
  • max_file_size guard + search timeout (ReDoS / OOM hardening)
  • read-only mode / per-tool selection; deps-scoped roots; HITL approval for destructive ops

strawgate added 2 commits May 26, 2026 19:56
- FileSystemToolset: 8 tools (read, write, edit, list, search, find, mkdir, info)
  with path-traversal prevention, allow/deny patterns, optimistic concurrency
- ShellToolset: 1 tool (run_command) with command validation, timeout handling,
  and async subprocess execution via anyio
- 152 tests passing on asyncio backend, 100% branch coverage
- Mutation testing: 524/584 killed (89.7%), 60 equivalent mutants documented
- All survivors proven equivalent (trampoline defaults, encoding case-insensitivity,
  unreachable except blocks, dead branches, name=None fallback behavior)
Copy link
Copy Markdown
Contributor Author

@strawgate strawgate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not run through the tests yet

Comment thread pydantic_ai_harness/filesystem/_capability.py
Comment thread pydantic_ai_harness/filesystem/_capability.py Outdated
Comment thread pydantic_ai_harness/filesystem/_toolset.py Outdated
Comment thread pydantic_ai_harness/filesystem/_toolset.py Outdated
Comment thread pydantic_ai_harness/filesystem/_toolset.py Outdated
Comment thread pydantic_ai_harness/filesystem/_toolset.py
continue
if _is_binary(raw):
continue
text = raw.decode('utf-8', errors='replace')
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth seeing if we want to shell out to ripgrep. I have a library https://github.com/strawgate/rpygrep that manages some of that but we can probably find another alternative.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ripgrep 100%

Comment thread pydantic_ai_harness/shell/_toolset.py Outdated
Comment thread pydantic_ai_harness/shell/_toolset.py Outdated
Comment thread pydantic_ai_harness/shell/_toolset.py Outdated
Comment thread pyproject.toml
- Remove bool guard from __post_init__ validation (pedantic AI slop)
- Pre-calculate _real_root in __init__ instead of per-call
- Extract _first_matching_pattern helper, use in _check_access
- Extract _first_denied_operator helper, use in _check_command
- Change _format_lines signature from list[str] to Sequence[str]
  and split lines once at call site in read_file
- write_file: raise FileNotFoundError for missing parents instead of auto-creating
- list_directory: add pragma: no cover comment on OSError branch
- _check_command docstring: document best-effort security boundary
- Remove debug = true from [tool.mutmut]
- Update tests to match: remove test_bool_max_read_lines_rejected,
  change test_write_creates_parents -> test_write_nonexistent_parent_raises,
  add isolation tests for _first_matching_pattern and _first_denied_operator
raise NotADirectoryError(f'Not a directory: {path}')

entries: list[str] = []
for entry in sorted(resolved.iterdir()):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive listing would be extremely slow on large projects. I would recommend adding ripgrep as a hard dependency (even for listing files)

If you want to see Python harness tooling that is battle tested (millions of uses per day) please look at https://github.com/mpfaffenberger/code_puppy

continue
if _is_binary(raw):
continue
text = raw.decode('utf-8', errors='replace')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ripgrep 100%

strawgate added 3 commits June 1, 2026 12:18
mutmut pulls in a large dependency tree and is only used to validate
test quality, not for normal development or CI. Keep its config in
pyproject.toml (mutmut v3 has no CLI flag to override the config path)
but install it ephemerally via 'uv run --with' from a small script.

The [tool.mutmut] block is unchanged: paths_to_mutate, tests_dir,
also_copy, and pytest_add_cli_args still live in pyproject.toml. Only
the dev dependency is removed.

docs/mutation-testing.md is updated to reference the new script.
strawgate flagged 'getattr(self, name)' in __post_init__ as 'bad claude'.
The runtime isinstance validation is still useful (dataclass field
annotations are advisory, not enforced) but the getattr is unnecessary
and obscures the intent. Iterate fields directly via a typed dict so
pyright doesn't narrow the isinstance check away.
strawgate flagged that 'default: max_read_lines' in a docstring is
unhelpful — readers can't tell what the actual default is without
scrolling to the class definition. Replace with the literal value
in both filesystem and shell toolsets.
@dsfaccini
Copy link
Copy Markdown
Contributor

Claude here: full review of #260 across correctness, security, repo conventions, pyai-idiomatic API usage, and extensibility. CI is green (all 18 checks), but a few things the green build can't catch are worth resolving before merge. Delegated parts of this to subagents (a security pass, a convention-drift pass) and to a fresh Claude session running inside the pydantic-ai core checkout for the API-idiom questions.

I've tried to be concrete and to back the load-bearing claims with sources. Two of these (B3, B4) I know you're skeptical of — I've put the strongest evidence I have on those.


Blockers I'd want fixed before merge

B1 — search_files / find_files / list_directory bypass the entire allow/deny/protected model

FileSystemToolset._check_access runs only on the directory argument, never on the per-result entries produced by rglob / glob / iterdir. Those results are filtered only by part.startswith('.'). Consequences:

  • denied_patterns does nothing for these three enumeration tools.
  • The default protected set (*.pem, *.key, **/secrets*) — none of which are dotfiles — is fully readable. search_files(pattern="PRIVATE KEY") will read and return matching lines from every key file in the tree.

This contradicts the security model the toolset's own docstring advertises. Fix: run _check_access (read mode) on each candidate and skip on PermissionError. (This is also where the ripgrep discussion below lands — a ripgrep-backed implementation would re-do this traversal anyway, so it's worth solving them together.)

B2 — recoverable tool errors abort the run instead of letting the model retry

Every tool raises bare PermissionError / FileNotFoundError / ValueError / NotADirectoryError. pyai core only turns ModelRetry (and ValidationError) into a retry prompt the model can recover from — see _tool_manager.py, where only SkipToolExecution / CallDeferred / ApprovalRequired / ModelRetry are special-cased and everything else propagates and ends the run.

So "file not found", "old_text not found", "command denied", "invalid regex" — all conditions the model should be able to correct on its own — currently crash the agent. The new code raises ModelRetry nowhere. Fix: raise ModelRetry for model-correctable conditions; reserve raw exceptions for genuine internal failures.

B3 — mutable run-state on a shared toolset, no for_run isolation

This is your "are toolsets supposed to be shareable?" question — and the answer from core is: yes, by default they're shared, and that is exactly why core gives you for_run to opt out when you hold mutable state. Quoting the base contract (AbstractToolset.for_run, pydantic-ai v1.95.1):

Called once per run, before __aenter__. Override this to return a fresh instance for per-run state isolation. Default: return self (shared across runs).

Source: https://github.com/pydantic/pydantic-ai/blob/v1.95.1/pydantic_ai_slim/pydantic_ai/toolsets/abstract.py#L110-L116

So a stateless toolset is fine to share — you're right about that. The issue is specific: get_toolset is called once at agent construction, so a single ShellToolset instance is reused across every run of that agent, and it carries mutable per-run state — self._background (the background-process dict) and self._cwd (persistent cwd). Two concurrent runs of the same agent will corrupt each other: run A's cd moves run B's cwd, and run A's __aexit__ terminates run B's background processes.

The in-repo precedent is our own exemplar: CodeModeToolset.for_run returns replace(self, …) precisely to give each run a fresh instance with reset state, and documents why. The two new toolsets override neither for_run nor for_run_step. Fix: implement for_run to return a fresh instance with an empty _background and reset _cwd. (This also makes the otherwise well-written __aexit__ cleanup sound — __aexit__ fires per run, so on a shared instance it's tearing down state that may belong to another run.)

B4 — persist_cwd sentinel is spoofable / silently self-disabling

I know you flagged this thread as "may be dubious… leave for now." I think it's a real correctness footgun, less as a security hole and more as a silent-wrong-behavior one:

  • _wrap_command_for_cwd skips wrapping whenever ';' in command. That substring test also fires on a ; inside a quoted string (echo "a;b"), so cwd tracking silently turns off for perfectly benign commands.
  • _extract_cwd_from_output parses a fixed literal __HARNESS_PWD__ out of stdout. cd /tmp; echo __HARNESS_PWD__/etc runs unwrapped (because of the ; skip), exits 0, and the parser moves the persistent cwd to /etc. Any command that emits that literal can steer subsequent commands' working directory.

The exploit window for malicious redirection is narrow (the genuine appended sentinel usually wins via rfind), so I wouldn't sell this as a security hole — but the silent-disable and the spoof-on-;-skip paths are real wrong behavior. Fix: write pwd to an out-of-band channel (a temp file the wrapper writes, read by the toolset) instead of parsing it out of stdout.


Unresolved review threads — are they blockers?

All 14 are still open. Breakdown:

  • 6 are outdated (already addressed by the "non-controversial feedback" commit but never marked resolved): calculate during init, _in_pattern_list helper, split-lines, throw-if-dir-missing, the "best-effort, not a security boundary" doc, and the line-259 comment. Not blockers — just worth resolving so the PR stops looking like it has 14 live objections.
  • ripgrep (raised by both reviewers, on search_files and list_directory): "recursive listing extremely slow on large projects." Effectively a blocker given two independent asks — either adopt ripgrep or get explicit agreement to defer. Ties into B1.
  • "make directory protection optional" (_capability.py root_dir): protection is already opt-out via an empty protected_patterns — worth clarifying whether you want protections off by default (opt-in) instead. Design call.
  • "bad claude" (the __post_init__ getattr-loop validation): trivial — replace the reflective loop with explicit validation.
  • "constants in docstrings aren't helpful" (read_file docstring default: max_read_lines): doc nit.
  • pyproject.toml:154 "Remove": ties to D3 below (strip mutmut config from this PR).
  • the cwd thread: see B4.

No new surprises hiding in the threads — ripgrep and the cwd sentinel are the two substantive open ones.


Repo-convention drift

  • D1 — no README.md. code_mode and the sibling capability PRs (feat: input and output guardrails (block, redact, retry) #249, Add StepPersistence capability for step-event durability across delegates #251) all ship one, and agent_docs/index.md, capability-authoring.md, and review-checklist.md all require it. Skeleton to match: # Title → The problem → The solution → Usage (runnable Agent(...) example) → feature sections → API. Shell's README must document the safety model.
  • D2 — AbstractCapability[Any] instead of [AgentDepsT]. Base is Generic[AgentDepsT]; CodeMode, InputGuard, StepPersistence all stay generic, and the repo bans Any in public signatures. These tools don't touch deps, so it's a pure type-hygiene fix: FileSystem(AbstractCapability[AgentDepsT]), FunctionToolset[AgentDepsT], return AgentToolset[AgentDepsT]. (Also Shell.get_toolset is annotated | None but always returns a toolset — tighten it.)
  • D3 — per-PR mutmut config in repo-wide pyproject.toml. Adds the mutmut dev-dep + a [tool.mutmut] block hardcoding this PR's two files + mutants exclusions + scripts/run-mutmut.sh + docs/mutation-testing.md. No prior capability PR carries this, and it bumps against the "don't edit pyproject.toml/uv.lock, open an issue for deps" rule. Suggest stripping it from the PR (keep the 89.7% result in the PR description) and landing mutation testing repo-wide separately if wanted.
  • D4 — PR body tool names wrong. Body says list_files / make_dir; code registers list_directory / create_directory. Pick the public names deliberately and make code + body + README agree.
  • D5 — class docstrings lack the runnable python usage example code_mode has, and the root README.md capability table still shows File system / Shell as :construction: linking to the old PR Add FileSystem and Shell capabilities #177.

pyai-idiomatic API review (is this reinventing core?)

Checked against core source:

  • Not reinventing implementations — core has no filesystem or shell tools (common_tools/ is web-only), no FS/process sandbox. Harness is the right home. ✅
  • Missing HITL/approval composition — core ships requires_approval=True, ApprovalRequiredToolset / .approval_required(func), ApprovalRequired / CallDeferred, and the HandleDeferredToolCalls capability. Hard-PermissionError-denying destructive ops (rm, writes to protected paths) is exactly what that flow exists for, and code_mode already integrates with it. This is the highest-leverage design improvement, but it's a bigger conversation — flagging, not asking you to add it silently.
  • Subclassing FunctionToolset + add_function(self.method, …) for static config is fine and supported; the only problem is the mutable state (→ B3).

Security (beyond B1/B4)

  • Patterns match the agent-supplied string, not the canonical path (_check_access): config/./secret.txt evades denied=['config/secret.txt'], and fnmatch doesn't implement **, so the default **/secrets* does not protect a root-level secrets.yaml. Match against the resolved relative path.
  • _real_root is captured once at construction, so a parent-dir symlink swap can defeat containment (precondition: write access to a parent of root).
  • ReDoS / event-loop block in search_files (unbounded agent-supplied regex, no timeout) and whole-file read_bytes before any size guard (no max-file-size knob).
  • The shell denylist/allowlist bypasses (tokens[0]-only, /bin/rm, bash -c, $(...), substring operator matching) are inherent and correctly disclaimed — not defects. One ask: lift that "not a security boundary" disclaimer from the private _check_command up into the public Shell dataclass docstring, since denied_commands / denied_operators read as protective there.

Missing knobs / extensibility (opinion — not asking to expand scope here, just flagging what may gate adoption)

  1. Read-only / per-tool selection on FileSystem. Today it's all 8 tools or nothing. "Give the agent read-only repo access" is one of the most common asks and is currently impossible without subclassing. A read_only: bool or a tool-subset selector (mirroring CodeMode.tools) is small and high-value.
  2. deps-scoped root_dir / cwd. Root/cwd are frozen at construction and tools take no RunContext, so you can't scope the sandbox per-run/per-user (multi-tenant). Fixing D2 + B3 is the prerequisite; deriving root from ctx.deps would unlock it. Worth at least not foreclosing.
  3. HITL approval hook (above) — for shell especially, "destructive op → ask a human" is the difference between a demo and something teams deploy.

Lower-priority, worth tracking in issues rather than this PR: max_file_size, env-var control for Shell (inject/scrub secrets from the subprocess env), configurable encoding, and the ripgrep backend.


Happy to implement B1–B4 + D2 and draft the two READMEs if you want — or to open follow-up issues for read-only mode, deps-scoped roots, HITL approval, and ripgrep so they're tracked without bloating this PR.

Comment thread pydantic_ai_harness/shell/_toolset.py Outdated
strawgate and others added 4 commits June 1, 2026 13:39
list_directory, search_files, and find_files previously only checked the
root path passed to the tool, so a recursive rglob could return or read
through files that the agent would otherwise be denied direct access to.

Add _is_accessible(rel, write=True) as a predicate form of _check_access
and call it on each entry walked by the three recursive operations, so
denied and protected patterns hide children the same way they block
direct read/write. Reads of protected files remain allowed in isolation;
the listing operations pass write=True to make existence match the
read-side policy that protected paths can't be opened.
…toolset docs

A file pattern like `src/*.py` in `allowed_patterns` is a per-file rule, but
the walkers gated their *root* directory on it too — so `list_directory('.')`,
`search_files('.', ...)` and `find_files('*')` always raised, because a
directory root never matches a file pattern. The toolset was effectively
unusable whenever an allowlist was set.

Walkers now skip the allowlist gate on their root (deny/protected patterns
still apply) and filter each entry instead, matching how the dotfile and
deny/protected filters already behave. The two stubbed tests that documented
this limitation are now real end-to-end assertions.

Also add README docs for the filesystem and shell toolsets and trim the
mutation-testing doc down to the durable how-to-run guidance, dropping the
frozen, already-stale result snapshot.
list_directory surfaced dot-entries (.git/, .env, .hidden) while search_files
and find_files deliberately skip any dot-prefixed path component, so the three
walkers disagreed on what exists and a listing leaked .git/. Apply the same
dotfile skip to list_directory; this also makes the existing
test_list_hides_protected_entries comment ("hidden by the dotfile filter")
true.
Truncation dropped everything past max_output_chars from the start, discarding
exactly the part that matters most — the end of stdout, the [stderr] section
(appended last), and exit info — which is where errors and stack traces land.
Keep the final max_output_chars instead and drop the head. This also removes
the convoluted stderr-budget branch, since the tail naturally preserves stderr.

Addresses review feedback on PR #260.
Comment thread pydantic_ai_harness/shell/_toolset.py Outdated
Comment on lines +153 to +161
def _wrap_command_for_cwd(self, command: str) -> str:
"""Append pwd sentinel to command for cwd tracking.

Commands containing ';' are returned unwrapped because the separator
breaks the '&&' success-gating of the sentinel echo.
"""
if ';' in command:
return command
return f'{command} && echo {_PWD_SENTINEL}$(pwd)'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude here: this is the B4 footgun. Two concrete problems with parsing the cwd out of stdout:

1. ; silently disables cwd tracking. Any command containing ; is returned unwrapped, so a real cd is dropped on the floor:

run_command('cd subdir ; true')   # persist_cwd=True
run_command('pwd')                # -> still the OLD dir; the cd never persisted

The substring test also misfires on a ; inside a quoted string, disabling tracking for benign commands.

2. The sentinel is spoofable through that same ;-skip path. A command whose stdout contains the literal __HARNESS_PWD__<dir> moves the tracked cwd with no real cd:

run_command('true ; echo __HARNESS_PWD__/some/existing/dir')
# -> self._cwd is now /some/existing/dir; the next command runs there

Regression tests proving both (they fail on this revision, pass after the fix):

async def test_cd_persists_even_with_semicolon(self, persist_toolset):
    await persist_toolset.run_command('cd subdir ; true')
    assert 'subdir' in await persist_toolset.run_command('pwd')

async def test_output_cannot_spoof_cwd(self, persist_toolset, shell_dir):
    await persist_toolset.run_command(f'true ; echo __HARNESS_PWD__{shell_dir / "subdir"}')
    assert persist_toolset._cwd == shell_dir   # no real cd happened

Fix (implemented locally): drop the stdout sentinel entirely and capture pwd out-of-band into a private temp file the agent's command can't address — {command}\n__harness_ec=$?\npwd > <random_tempfile>\nexit $__harness_ec — then read the cwd from that file. No ;-skip, nothing to spoof. Happy to push it onto the branch or hand you the diff, whichever you prefer.

…dening, generic typing

Address blockers from PR review so the capabilities are robust under real agent
use, not just under TestModel:

- B2: tool methods raised native exceptions that pyai propagates and aborts the
  run on. A `_recoverable` decorator now surfaces model-correctable errors
  (missing file, denied path, stale edit, denied command) as `ModelRetry` so the
  agent can self-correct. Internal helpers keep raising native exceptions.
- B3: ShellToolset held mutable per-run state (`_cwd`, `_background`) on the
  single instance `get_toolset` builds at construction, so concurrent runs
  corrupted each other. `for_run` now returns a fresh copy, matching the
  CodeModeToolset exemplar.
- B4: persist_cwd parsed cwd from stdout, which a `;` could silently disable and
  command output could spoof. Capture `pwd` out-of-band via a private temp file
  instead.
- Sec#3: pattern matching ran against the agent-supplied string, letting
  `config/./secret.txt` evade a deny rule, and `**/secrets*` missed root-level
  files (leaking them via search). Match the canonical path; treat a leading
  `**/` as covering the root.
- D2: parametrize FileSystem/Shell and their toolsets on `AgentDepsT` instead of
  `Any`, matching the rest of the library.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dsfaccini dsfaccini merged commit e1654cc into main Jun 2, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants