Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,4 @@ wheels/
# Hypothesis
.hypothesis/
.vscode/
mutants/
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,8 @@ We studied leading coding agents, agent frameworks, and Claw-style assistants to
|---|---|---|---|---|
| **Tools & execution** | **Code mode** | Sandboxed Python execution via [Monty](https://github.com/pydantic/monty) -- one `run_code` call replaces N tool calls | :white_check_mark: [Docs](pydantic_ai_harness/code_mode/) | |
| | **Tool search** | Progressive tool discovery for large tool sets | :white_check_mark: [Pydantic AI](https://pydantic.dev/docs/ai/tools-toolsets/toolsets/#deferred-loading) | |
| | **File system** | Read, write, edit, search files with path traversal prevention | :construction: [PR #177](https://github.com/pydantic/pydantic-ai-harness/pull/177) | [pydantic-ai-backend](https://github.com/vstorm-co/pydantic-ai-backend) (vstorm‑co) |
| | **Shell** | Execute commands with allowlists, denylists, and timeouts | :construction: [PR #177](https://github.com/pydantic/pydantic-ai-harness/pull/177) | [pydantic-ai-backend](https://github.com/vstorm-co/pydantic-ai-backend) (vstorm‑co) |
| | **File system** | Read, write, edit, search files with path traversal prevention | :white_check_mark: [Docs](pydantic_ai_harness/filesystem/) | [pydantic-ai-backend](https://github.com/vstorm-co/pydantic-ai-backend) (vstorm‑co) |
| | **Shell** | Execute commands with allowlists, denylists, and timeouts | :white_check_mark: [Docs](pydantic_ai_harness/shell/) | [pydantic-ai-backend](https://github.com/vstorm-co/pydantic-ai-backend) (vstorm‑co) |
| | **Repo context injection** | Auto-load CLAUDE.md/AGENTS.md and repo structure | :construction: [PR #175](https://github.com/pydantic/pydantic-ai-harness/pull/175) | [pydantic-deep](https://github.com/vstorm-co/pydantic-deepagents) (vstorm‑co) |
| | **Verification loop** | Run tests after edits, auto-fix failures | :construction: [PR #169](https://github.com/pydantic/pydantic-ai-harness/pull/169) | |
| **Context management** | **Sliding window** | Trim conversation history to stay within token limits | :construction: [PR #191](https://github.com/pydantic/pydantic-ai-harness/pull/191) | [summarization-pydantic-ai](https://github.com/vstorm-co/summarization-pydantic-ai) (vstorm‑co) |
Expand Down
47 changes: 47 additions & 0 deletions docs/mutation-testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Mutation Testing

Mutation testing complements the 100% branch-coverage requirement: coverage
proves every line and branch runs, mutation testing proves the assertions
actually pin the behavior down.

Covers `pydantic_ai_harness/filesystem/_toolset.py` and
`pydantic_ai_harness/shell/_toolset.py`.

Run with [mutmut](https://mutmut.readthedocs.io/) v3 via `scripts/run-mutmut.sh`,
which installs mutmut ephemerally with `uv run --with` — no dev dependency
required.

```bash
scripts/run-mutmut.sh run --max-children 1
scripts/run-mutmut.sh results
scripts/run-mutmut.sh show <mutant-name>
```

## Interpreting survivors

A surviving mutant is either a missing test or an equivalent mutant — a change
that produces behavior no test could distinguish from the original. Triage each
survivor; the recurring equivalent-mutant categories in this codebase are:

- **Trampoline default params** — mutmut v3 wraps functions, and the wrapper
keeps the original defaults, so a mutated default is never observed.
- **Omitted `name=` in `add_function()`** — pydantic-ai falls back to
`method.__name__`, which equals the explicit name being mutated away.
- **`'utf-8'` encoding mutations** — Python's codec lookup is case-insensitive
and UTF-8 is the default text encoding, so case/omission changes are no-ops.
- **`errors='replace'` mutations** — exercised only by invalid bytes; valid
UTF-8 test data never invokes the error handler.
- **Unreachable `except` blocks** (marked `pragma: no cover`) — paths that
can't be triggered in the test environment.
- **`CancelScope(shield=True)` flips** — require an outer cancellation during
the near-instant cleanup window.

Anything outside these categories should be treated as a real gap and killed
with a new test.

## Limitations

Trio-parametrized tests are excluded during mutation testing (`-k 'not trio'`
in `pyproject.toml [tool.mutmut]`) because trio segfaults in mutmut's
subprocess environment on Python 3.14 / macOS. The kill rate is unaffected —
the trio tests exercise the same code paths as the asyncio tests.
14 changes: 12 additions & 2 deletions pydantic_ai_harness/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,26 @@
"""The batteries for your Pydantic AI agent -- the official capability library."""
"""Pydantic AI capability library."""

from typing import TYPE_CHECKING

if TYPE_CHECKING:
from .code_mode import CodeMode
from .filesystem import FileSystem
from .shell import Shell

__all__ = ['CodeMode']
__all__ = ['CodeMode', 'FileSystem', 'Shell']


def __getattr__(name: str) -> object:
if name == 'CodeMode':
from .code_mode import CodeMode

return CodeMode
elif name == 'FileSystem':
from .filesystem import FileSystem

return FileSystem
elif name == 'Shell':
from .shell import Shell

return Shell
raise AttributeError(f'module {__name__!r} has no attribute {name!r}')
136 changes: 136 additions & 0 deletions pydantic_ai_harness/filesystem/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# FileSystem

Give an agent sandboxed, pattern-filtered access to a directory tree.

## The problem

Letting an agent touch the filesystem directly is risky: path traversal
(`../../etc/passwd`), symlinks that escape the project, clobbering `.git`, or
leaking `.env` secrets. Hand-rolling the guards around every tool call is
repetitive and easy to get subtly wrong.

## The solution

`FileSystem` exposes a fixed set of file tools, all scoped to a single
`root_dir`. Every path is resolved and containment-checked (symlinks included)
before any I/O, and access is filtered through allow / deny / protected glob
patterns.

```python
from pydantic_ai import Agent
from pydantic_ai_harness import FileSystem

agent = Agent(
'anthropic:claude-sonnet-4-6',
capabilities=[FileSystem(root_dir='./workspace')],
)

result = agent.run_sync('Read config.toml and tell me the package name.')
print(result.output)
```

## Tools

| Tool | Purpose |
|---|---|
| `read_file` | Read a text file with line numbers and a content hash. Binary files are detected and not dumped. |
| `write_file` | Create or overwrite a file. Optional `expected_hash` rejects stale writes (optimistic concurrency). |
| `edit_file` | Exact-string replacement; `old_text` must match exactly once. Optional `expected_hash`. |
| `list_directory` | List a directory's entries with type indicators and sizes. |
| `search_files` | Regex search over file contents, optionally narrowed by an `include_glob`. |
| `find_files` | Glob search over file names (e.g. `*.py`, `**/*.json`). |
| `create_directory` | Create a directory and any missing parents. |
| `file_info` | Metadata for a file or directory (size, type, line count, hash, symlink target). |

## Security model

- **Containment.** Paths resolve relative to `root_dir`; anything resolving
outside — via `..`, an absolute path, or a symlink — is rejected. Symlinks
are resolved with `os.path.realpath` *before* the containment check, closing
the TOCTTOU window.
- **Binary detection.** `read_file` returns a placeholder instead of dumping
binary bytes into the model context.
- **Optimistic concurrency.** `write_file`/`edit_file` accept an
`expected_hash` so an agent operating on a stale read is told to re-read
rather than silently overwriting newer content.

## Pattern filtering

Three independent glob lists control access. Patterns are matched with
`fnmatch`, whose `*` spans `/`, so `*.py` matches `src/main.py` and you rarely
need `**`.

| Field | Effect |
|---|---|
| `allowed_patterns` | If non-empty, only matching paths are accessible (allowlist). |
| `denied_patterns` | Matching paths are always rejected (denylist). |
| `protected_patterns` | Matching paths are read-only — reads succeed, writes are rejected. |

`protected_patterns` defaults to `.git/`, `.env`/`.env.*`, `*.pem`, `*.key`,
and `**/secrets*`. Pass an empty list to disable protection.

### Direct access vs. walkers

The three rules apply at two different granularities:

- **Direct access** (`read_file`, `write_file`, `edit_file`, `file_info`,
`create_directory`) gates the operation's target path. You must name a path
that the patterns permit.
- **Walkers** (`list_directory`, `search_files`, `find_files`) gate their root
by deny/protected patterns, but **not** by `allowed_patterns` — a directory
root like `.` never matches a file pattern such as `src/*.py`, so requiring
it to would make every listing fail. Instead, the root is always walked and
each **entry** is filtered against all three lists. A directory listing can
never surface a path the agent couldn't otherwise read or write.

So with `allowed_patterns=['*.py']`, `list_directory('.')` succeeds and shows
only the `.py` entries; `read_file('notes.md')` is rejected.

> Dotfiles and dot-directories (`.git`, `.env`, `.github`, …) are skipped by
> all three walkers — `list_directory`, `search_files`, and `find_files` —
> regardless of patterns.

## Configuration

```python
FileSystem(
root_dir='.', # str | Path — sandbox root
allowed_patterns=[], # allowlist globs (empty = allow all)
denied_patterns=[], # denylist globs
protected_patterns=[...], # read-only globs (defaults to secrets/.git)
max_read_lines=2000, # cap for a single read_file
max_search_results=1000, # cap for search_files
max_find_results=1000, # cap for find_files
)
```

The integer limits must be positive; they are validated at construction.

## Agent spec (YAML/JSON)

`FileSystem` works with Pydantic AI's
[agent spec](https://ai.pydantic.dev/agent-spec/):

```yaml
# agent.yaml
model: anthropic:claude-sonnet-4-6
capabilities:
- FileSystem:
root_dir: ./workspace
allowed_patterns: ['*.py', '*.toml']
```

```python
from pydantic_ai import Agent
from pydantic_ai_harness import FileSystem

agent = Agent.from_file('agent.yaml', custom_capability_types=[FileSystem])
```

Pass `custom_capability_types` so the spec loader knows how to instantiate
`FileSystem`.

## Further reading

- [Pydantic AI capabilities](https://ai.pydantic.dev/capabilities/)
- [Toolsets](https://ai.pydantic.dev/toolsets/)
6 changes: 6 additions & 0 deletions pydantic_ai_harness/filesystem/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Filesystem capability: gives agents configurable, sandboxed file system access."""

from pydantic_ai_harness.filesystem._capability import FileSystem
from pydantic_ai_harness.filesystem._toolset import FileSystemToolset

__all__ = ['FileSystem', 'FileSystemToolset']
81 changes: 81 additions & 0 deletions pydantic_ai_harness/filesystem/_capability.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
"""Filesystem capability that provides sandboxed file system access."""

from __future__ import annotations

from collections.abc import Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any

from pydantic_ai.capabilities import AbstractCapability
from pydantic_ai.tools import AgentDepsT
from pydantic_ai.toolsets import AgentToolset

from pydantic_ai_harness.filesystem._toolset import FileSystemToolset

_DEFAULT_PROTECTED: list[str] = [
'.git/*',
'.env',
'.env.*',
'*.pem',
'*.key',
'**/secrets*',
]


@dataclass
class FileSystem(AbstractCapability[AgentDepsT]):
"""File system access scoped to a root directory.

All paths are resolved relative to `root_dir`. Traversal above the root
is rejected. Symlinks are resolved before authorization.
"""

root_dir: str | Path = '.'
Comment thread
strawgate marked this conversation as resolved.
"""Root directory for all file operations. Defaults to the current directory."""

allowed_patterns: Sequence[str] = field(default_factory=list[str])
"""If non-empty, only paths matching at least one glob pattern are accessible."""

denied_patterns: Sequence[str] = field(default_factory=list[str])
"""Paths matching any of these glob patterns are rejected."""

protected_patterns: Sequence[str] = field(default_factory=lambda: list(_DEFAULT_PROTECTED))
"""Paths matching these patterns are read-only (writes are rejected).

Defaults to protecting `.git/`, `.env`, key files, and secrets.
Set to an empty list to disable protection.
"""

max_read_lines: int = 2000
"""Maximum number of lines returned by a single `read_file` call."""

max_search_results: int = 1000
"""Maximum number of matches returned by `search_files`."""

max_find_results: int = 1000
"""Maximum number of matches returned by `find_files`."""

def __post_init__(self) -> None:
# Runtime validation: dataclass field annotations are advisory, not enforced.
# A config-driven caller could pass a string that would otherwise propagate.
values: dict[str, Any] = {
'max_read_lines': self.max_read_lines,
'max_search_results': self.max_search_results,
'max_find_results': self.max_find_results,
}
for name, value in values.items():
if not isinstance(value, int) or value <= 0:
raise ValueError(f'{name} must be a positive integer, got {value!r}')

def get_toolset(self) -> AgentToolset[AgentDepsT]:
"""Build and return the filesystem toolset."""
return FileSystemToolset[AgentDepsT](
root_dir=Path(self.root_dir),
allowed_patterns=self.allowed_patterns,
denied_patterns=self.denied_patterns,
protected_patterns=self.protected_patterns,
max_read_lines=self.max_read_lines,
max_search_results=self.max_search_results,
max_find_results=self.max_find_results,
)
Loading
Loading