feat: add FileSystem and Shell capabilities#260
Conversation
- FileSystemToolset: 8 tools (read, write, edit, list, search, find, mkdir, info) with path-traversal prevention, allow/deny patterns, optimistic concurrency - ShellToolset: 1 tool (run_command) with command validation, timeout handling, and async subprocess execution via anyio - 152 tests passing on asyncio backend, 100% branch coverage - Mutation testing: 524/584 killed (89.7%), 60 equivalent mutants documented - All survivors proven equivalent (trampoline defaults, encoding case-insensitivity, unreachable except blocks, dead branches, name=None fallback behavior)
strawgate
left a comment
There was a problem hiding this comment.
have not run through the tests yet
| continue | ||
| if _is_binary(raw): | ||
| continue | ||
| text = raw.decode('utf-8', errors='replace') |
There was a problem hiding this comment.
It might be worth seeing if we want to shell out to ripgrep. I have a library https://github.com/strawgate/rpygrep that manages some of that but we can probably find another alternative.
- Remove bool guard from __post_init__ validation (pedantic AI slop) - Pre-calculate _real_root in __init__ instead of per-call - Extract _first_matching_pattern helper, use in _check_access - Extract _first_denied_operator helper, use in _check_command - Change _format_lines signature from list[str] to Sequence[str] and split lines once at call site in read_file - write_file: raise FileNotFoundError for missing parents instead of auto-creating - list_directory: add pragma: no cover comment on OSError branch - _check_command docstring: document best-effort security boundary - Remove debug = true from [tool.mutmut] - Update tests to match: remove test_bool_max_read_lines_rejected, change test_write_creates_parents -> test_write_nonexistent_parent_raises, add isolation tests for _first_matching_pattern and _first_denied_operator
| raise NotADirectoryError(f'Not a directory: {path}') | ||
|
|
||
| entries: list[str] = [] | ||
| for entry in sorted(resolved.iterdir()): |
There was a problem hiding this comment.
Recursive listing would be extremely slow on large projects. I would recommend adding ripgrep as a hard dependency (even for listing files)
If you want to see Python harness tooling that is battle tested (millions of uses per day) please look at https://github.com/mpfaffenberger/code_puppy
| continue | ||
| if _is_binary(raw): | ||
| continue | ||
| text = raw.decode('utf-8', errors='replace') |
mutmut pulls in a large dependency tree and is only used to validate test quality, not for normal development or CI. Keep its config in pyproject.toml (mutmut v3 has no CLI flag to override the config path) but install it ephemerally via 'uv run --with' from a small script. The [tool.mutmut] block is unchanged: paths_to_mutate, tests_dir, also_copy, and pytest_add_cli_args still live in pyproject.toml. Only the dev dependency is removed. docs/mutation-testing.md is updated to reference the new script.
strawgate flagged 'getattr(self, name)' in __post_init__ as 'bad claude'. The runtime isinstance validation is still useful (dataclass field annotations are advisory, not enforced) but the getattr is unnecessary and obscures the intent. Iterate fields directly via a typed dict so pyright doesn't narrow the isinstance check away.
strawgate flagged that 'default: max_read_lines' in a docstring is unhelpful — readers can't tell what the actual default is without scrolling to the class definition. Replace with the literal value in both filesystem and shell toolsets.
|
Claude here: full review of #260 across correctness, security, repo conventions, pyai-idiomatic API usage, and extensibility. CI is green (all 18 checks), but a few things the green build can't catch are worth resolving before merge. Delegated parts of this to subagents (a security pass, a convention-drift pass) and to a fresh Claude session running inside the I've tried to be concrete and to back the load-bearing claims with sources. Two of these (B3, B4) I know you're skeptical of — I've put the strongest evidence I have on those. Blockers I'd want fixed before mergeB1 —
|
list_directory, search_files, and find_files previously only checked the root path passed to the tool, so a recursive rglob could return or read through files that the agent would otherwise be denied direct access to. Add _is_accessible(rel, write=True) as a predicate form of _check_access and call it on each entry walked by the three recursive operations, so denied and protected patterns hide children the same way they block direct read/write. Reads of protected files remain allowed in isolation; the listing operations pass write=True to make existence match the read-side policy that protected paths can't be opened.
…toolset docs
A file pattern like `src/*.py` in `allowed_patterns` is a per-file rule, but
the walkers gated their *root* directory on it too — so `list_directory('.')`,
`search_files('.', ...)` and `find_files('*')` always raised, because a
directory root never matches a file pattern. The toolset was effectively
unusable whenever an allowlist was set.
Walkers now skip the allowlist gate on their root (deny/protected patterns
still apply) and filter each entry instead, matching how the dotfile and
deny/protected filters already behave. The two stubbed tests that documented
this limitation are now real end-to-end assertions.
Also add README docs for the filesystem and shell toolsets and trim the
mutation-testing doc down to the durable how-to-run guidance, dropping the
frozen, already-stale result snapshot.
list_directory surfaced dot-entries (.git/, .env, .hidden) while search_files
and find_files deliberately skip any dot-prefixed path component, so the three
walkers disagreed on what exists and a listing leaked .git/. Apply the same
dotfile skip to list_directory; this also makes the existing
test_list_hides_protected_entries comment ("hidden by the dotfile filter")
true.
Truncation dropped everything past max_output_chars from the start, discarding exactly the part that matters most — the end of stdout, the [stderr] section (appended last), and exit info — which is where errors and stack traces land. Keep the final max_output_chars instead and drop the head. This also removes the convoluted stderr-budget branch, since the tail naturally preserves stderr. Addresses review feedback on PR #260.
| def _wrap_command_for_cwd(self, command: str) -> str: | ||
| """Append pwd sentinel to command for cwd tracking. | ||
|
|
||
| Commands containing ';' are returned unwrapped because the separator | ||
| breaks the '&&' success-gating of the sentinel echo. | ||
| """ | ||
| if ';' in command: | ||
| return command | ||
| return f'{command} && echo {_PWD_SENTINEL}$(pwd)' |
There was a problem hiding this comment.
Claude here: this is the B4 footgun. Two concrete problems with parsing the cwd out of stdout:
1. ; silently disables cwd tracking. Any command containing ; is returned unwrapped, so a real cd is dropped on the floor:
run_command('cd subdir ; true') # persist_cwd=True
run_command('pwd') # -> still the OLD dir; the cd never persisted
The substring test also misfires on a ; inside a quoted string, disabling tracking for benign commands.
2. The sentinel is spoofable through that same ;-skip path. A command whose stdout contains the literal __HARNESS_PWD__<dir> moves the tracked cwd with no real cd:
run_command('true ; echo __HARNESS_PWD__/some/existing/dir')
# -> self._cwd is now /some/existing/dir; the next command runs there
Regression tests proving both (they fail on this revision, pass after the fix):
async def test_cd_persists_even_with_semicolon(self, persist_toolset):
await persist_toolset.run_command('cd subdir ; true')
assert 'subdir' in await persist_toolset.run_command('pwd')
async def test_output_cannot_spoof_cwd(self, persist_toolset, shell_dir):
await persist_toolset.run_command(f'true ; echo __HARNESS_PWD__{shell_dir / "subdir"}')
assert persist_toolset._cwd == shell_dir # no real cd happenedFix (implemented locally): drop the stdout sentinel entirely and capture pwd out-of-band into a private temp file the agent's command can't address — {command}\n__harness_ec=$?\npwd > <random_tempfile>\nexit $__harness_ec — then read the cwd from that file. No ;-skip, nothing to spoof. Happy to push it onto the branch or hand you the diff, whichever you prefer.
…dening, generic typing Address blockers from PR review so the capabilities are robust under real agent use, not just under TestModel: - B2: tool methods raised native exceptions that pyai propagates and aborts the run on. A `_recoverable` decorator now surfaces model-correctable errors (missing file, denied path, stale edit, denied command) as `ModelRetry` so the agent can self-correct. Internal helpers keep raising native exceptions. - B3: ShellToolset held mutable per-run state (`_cwd`, `_background`) on the single instance `get_toolset` builds at construction, so concurrent runs corrupted each other. `for_run` now returns a fresh copy, matching the CodeModeToolset exemplar. - B4: persist_cwd parsed cwd from stdout, which a `;` could silently disable and command output could spoof. Capture `pwd` out-of-band via a private temp file instead. - Sec#3: pattern matching ran against the agent-supplied string, letting `config/./secret.txt` evade a deny rule, and `**/secrets*` missed root-level files (leaking them via search). Match the canonical path; treat a leading `**/` as covering the root. - D2: parametrize FileSystem/Shell and their toolsets on `AgentDepsT` instead of `Any`, matching the rest of the library. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds two new batteries-included capabilities built on the
AbstractCapability/FunctionToolsetprimitives.FileSystemSandboxed file-system access scoped to a root directory.
**/also covers the root).git/,.env,*.pem,*.key, secretsread_file,write_file,edit_file(hash-based optimistic concurrency),list_directory,search_files,find_files,create_directory,file_info__post_init__validates that integer limit fields (max_read_lines,max_search_results,max_find_results) are positive non-bool intsShellConfigurable command execution for agents.
>,>>,|)run_command) and background (start_command/check_command/stop_command) executionpersist_cwdsocdis sticky across calls — the working directory is captured out-of-band (a private temp file), so command output can't spoof it__aexit__terminates all remaining background processes — the agent runtime enters toolsets viaAsyncExitStack, so cleanup is guaranteed on run end (success or error)for_runreturns a fresh toolset per run, so persistent cwd and background processes are isolated across concurrent runsRobustness
ModelRetryso the agent can self-correct instead of aborting the runAgentDepsTTest coverage
fail_under = 100)docs/mutation-testing.md)Validation
All pass.
Follow-ups (tracked separately, out of scope here)
max_file_sizeguard + search timeout (ReDoS / OOM hardening)