Skip to content
63 changes: 63 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Compaction Capability — Implementation Plan

Closes #21

## Overview

This PR adds three compaction-related capabilities to `pydantic-harness`:

1. **`SlidingWindow`** — Zero-cost message trimming via a configurable sliding window.
2. **`LimitWarner`** — Injects warning messages when the agent approaches iteration, context-window, or total-token limits.
3. **`Compaction`** — LLM-powered summarization that replaces older messages with a compact summary.

All three are `AbstractCapability` subclasses that operate via the `before_model_request` hook, modifying `request_context.messages` before each model call.

## Design Decisions

### Tool-call / tool-return pair safety

The most critical invariant: trimming or compacting must **never** orphan a `ToolCallPart` without its corresponding `ToolReturnPart` (or vice versa). Doing so causes HTTP 400 errors from LLM providers.

The implementation uses a `_is_safe_cutoff()` function that searches around a proposed cutoff point for tool-call pairs that would be split. If a cutoff is unsafe, it walks backward to find a safe one. This approach is adapted from [vstorm-co/summarization-pydantic-ai](https://github.com/vstorm-co/summarization-pydantic-ai)'s `_cutoff.py`.

### Trigger and retention modes

Both `SlidingWindow` and `Compaction` support two trigger modes:
- `max_messages` — fire when message count exceeds threshold
- `max_tokens` — fire when estimated token count exceeds threshold

And two retention modes:
- `keep_messages` — retain N tail messages
- `keep_tokens` — retain messages fitting within a token budget

### Token estimation

A simple `estimate_token_count()` function approximates tokens at ~4 characters per token. This avoids requiring a tokenizer dependency while providing reasonable estimates for threshold detection.

### LimitWarner design

Warnings are injected as a trailing `ModelRequest` with a `UserPromptPart` (not a system message), because models tend to pay more attention to user messages. A `[LimitWarner]` marker enables stripping previous warnings before injecting new ones, preventing warning accumulation.

### Compaction summarization

The `Compaction` capability creates a temporary `pydantic_ai.Agent` with the configured summarization model. System prompts from the beginning of the conversation are preserved and prepended to the summary message.

## Dependencies

- Requires `pydantic-ai-slim` with the capabilities branch (not yet on PyPI).
- For local development, add a `[tool.uv.sources]` override pointing to the capabilities branch checkout.

## Files

- `src/pydantic_harness/compaction.py` — All three capabilities plus helpers
- `src/pydantic_harness/__init__.py` — Package exports
- `tests/test_compaction.py` — 81 tests covering all code paths
- `pyproject.toml` — Coverage threshold adjustment (98% due to branch coverage of elif chains)

## References

- [pydantic/pydantic-ai#4137](https://github.com/pydantic/pydantic-ai/issues/4137) — First-class Context Compaction API
- [pydantic/pydantic-ai#4267](https://github.com/pydantic/pydantic-ai/issues/4267) — Anthropic Compactions
- [pydantic/pydantic-ai#4013](https://github.com/pydantic/pydantic-ai/issues/4013) — OpenAI Compactions
- [pydantic/pydantic-harness#35](https://github.com/pydantic/pydantic-harness/issues/35) — Expose context window size on ModelProfile
- [vstorm-co/summarization-pydantic-ai](https://github.com/vstorm-co/summarization-pydantic-ai) — Prior art for cutoff logic
13 changes: 13 additions & 0 deletions pydantic_ai_harness/experimental/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Experimental pydantic-ai-harness capabilities.
Anything under `pydantic_ai_harness.experimental` may change or be removed in any release,
without a deprecation period. Importing an experimental capability emits a
`HarnessExperimentalWarning` that tells you how to silence the whole category at once.
Importing this module on its own does **not** emit a warning, so you can pull in
`HarnessExperimentalWarning` to silence the warnings before importing a capability.
"""

from pydantic_ai_harness.experimental._warn import HarnessExperimentalWarning

__all__ = ['HarnessExperimentalWarning']
40 changes: 40 additions & 0 deletions pydantic_ai_harness/experimental/_warn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""Experimental-feature warning machinery for pydantic-ai-harness."""

from __future__ import annotations

import warnings


class HarnessExperimentalWarning(UserWarning):
"""Signals that a pydantic-ai-harness feature is experimental.

Experimental features may change or be removed in any release, without a deprecation
period. Silence every experimental-harness warning at once with::

import warnings
from pydantic_ai_harness.experimental import HarnessExperimentalWarning

warnings.filterwarnings('ignore', category=HarnessExperimentalWarning)
"""


_SILENCE_HINT = (
' import warnings\n'
' from pydantic_ai_harness.experimental import HarnessExperimentalWarning\n'
" warnings.filterwarnings('ignore', category=HarnessExperimentalWarning)"
)


def warn_experimental(feature: str) -> None:
"""Emit a `HarnessExperimentalWarning` for *feature*, including how to silence all of them.

One filter silences the whole category — every experimental capability — so users never
need a suppression line per capability.
"""
warnings.warn(
f'`pydantic_ai_harness.experimental.{feature}` is experimental: its API may change or be '
f'removed in any release, without a deprecation period.\n\n'
f'Silence all pydantic-ai-harness experimental warnings with:\n\n{_SILENCE_HINT}\n',
category=HarnessExperimentalWarning,
stacklevel=2,
)
124 changes: 124 additions & 0 deletions pydantic_ai_harness/experimental/compaction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Compaction capabilities

> [!WARNING]
> **Experimental.** These capabilities live under `pydantic_ai_harness.experimental` and may
> change or be removed in any release, without a deprecation period. Import them from the
> experimental path — there is no top-level export:
>
> ```python
> from pydantic_ai_harness.experimental.compaction import TieredCompaction
> ```
>
> Importing any experimental capability emits a `HarnessExperimentalWarning`. Silence **all**
> harness experimental warnings with a single filter (no per-capability lines needed):
>
> ```python
> import warnings
> from pydantic_ai_harness.experimental import HarnessExperimentalWarning
>
> warnings.filterwarnings('ignore', category=HarnessExperimentalWarning)
> ```
A menu of strategies for keeping an agent's conversation history within a model's context
window. Each is a Pydantic AI `Capability` that runs in the `before_model_request` hook; edits
**persist** into the run's message history, so a trim/clear/summary carries forward to later
steps (it is not recomputed from the full history every turn).
All strategies preserve tool-call / tool-return **pairing** — core does not validate this, and a
provider rejects an orphaned pair. The zero-LLM strategies never call a model.
## The menu
| Capability | Cost | What it does | Reach for it when |
|---|---|---|---|
| `SlidingWindow` | zero-LLM | Drops the oldest whole messages down to a tail | You only need the recent turns and can discard old context entirely |
| `ClearToolResults` | zero-LLM | Blanks the content of old tool *results* in place, keeping the last `keep_pairs` | Tool outputs dominate context and can be re-fetched on demand (the cheap first tier) |
| `DeduplicateFileReads` | zero-LLM | Blanks every file read superseded by a newer read of the same file | The agent re-reads files and only the latest version matters |
| `SummarizingCompaction` | one LLM call | Summarizes older messages into a structured summary, keeping the recent tail | Old context still matters but must be compressed; use behind the cheap tiers |
| `TieredCompaction` | escalates | Runs cheap passes first, summarizes only if still over `target_tokens` | You want the SOTA default: spend the expensive summary only when needed |
| `LimitWarner` | zero-LLM | Injects an URGENT/CRITICAL warning as limits approach | You want the agent to wrap up rather than have its history rewritten |
## Triggers
Every size-based strategy triggers on `max_messages` and/or `max_tokens` (estimated). Token counts
use a ~4-chars-per-token heuristic by default; pass a `tokenizer` callable (e.g. `tiktoken`) for
accuracy. `DeduplicateFileReads` runs on every request when no trigger is set (it is cheap and
near-lossless). `TieredCompaction` triggers and stops on a single `target_tokens` budget.
## Cost: why summarization is the last resort
Summarization turns input tokens into output tokens, which are billed at a premium and generated
serially — so it is genuinely expensive. The zero-LLM strategies touch only the cheaper input side.
The field consensus (Anthropic, OpenCode, Letta) is to clear/dedupe first and summarize only when
that is not enough — which is exactly what `TieredCompaction` encodes:
```python
from pydantic_ai import Agent
from pydantic_ai_harness.experimental.compaction import (
ClearToolResults,
DeduplicateFileReads,
SummarizingCompaction,
TieredCompaction,
)
agent = Agent(
'openai:gpt-4o',
capabilities=[
TieredCompaction(
tiers=[
DeduplicateFileReads(file_key=my_file_key),
ClearToolResults(max_tokens=1, keep_pairs=3),
SummarizingCompaction(max_messages=1, keep_messages=20), # model inherits the run's
],
target_tokens=120_000,
)
],
)
```
A tier inside `TieredCompaction` is driven directly by the orchestrator, which re-measures after each
and stops once under `target_tokens` — so a tier's own `max_*` trigger is irrelevant there (set it to
anything valid). Any object with `async def compact(messages, ctx) -> list[ModelMessage]`
(`CompactionStrategy`) can be a tier, so you can plug in your own.
## Cache tradeoff (read before using `ClearToolResults`)
Clearing or deduplicating rewrites message content, which invalidates the provider's prompt cache
from the edit point onward — the next request pays a cache-write. Use `ClearToolResults`'
`min_clear_tokens` to skip clearing that reclaims too little to be worth busting the cache.
## Model inheritance
`SummarizingCompaction(model=...)` accepts a model name or `Model`; when left `None` it inherits the
running agent's model. No token caps are imposed on the summary call.
## Usage accounting
The summary call is a real request to the model, so its full usage — tokens **and** the request
itself — is folded into the run's `ctx.usage`. This is deliberate: it keeps cost honest, keeps the
request count consistent (a model request that didn't count as one would be the surprise), and lets a
`UsageLimits` request limit catch a runaway compaction. A run-request / iteration limiter will
therefore see compaction calls among its requests.
## `DeduplicateFileReads.file_key`
There is no default `file_key`: identifying a file read is agent-specific, and a wrong guess would
drop live data. Supply a callable mapping a `ToolCallPart` to a stable file key, or `None` when the
call is not a file read:
```python
from pydantic_ai.messages import ToolCallPart
def my_file_key(call: ToolCallPart) -> str | None:
if call.tool_name != 'read_file':
return None
args = call.args
return args.get('path') if isinstance(args, dict) else None
```
## Out of scope
These strategies compress or drop context *inside* the window. Moving large tool outputs *out* of the
window — overflowing them to a file the agent (or a subagent) can query on demand — is a separate
capability, not lossy truncation. Prefer it over capping individual tool outputs.
28 changes: 28 additions & 0 deletions pydantic_ai_harness/experimental/compaction/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
"""Compaction capabilities: keep an agent's conversation history within the context window.

Each capability lives in its own module; shared utilities (token estimation, the
`CompactionStrategy` protocol, tool-pair-safe cutoffs, in-place clearing) live in `_shared`.
"""

from pydantic_ai_harness.experimental._warn import warn_experimental
from pydantic_ai_harness.experimental.compaction._clear_tool_results import ClearToolResults
from pydantic_ai_harness.experimental.compaction._deduplicate_file_reads import DeduplicateFileReads
from pydantic_ai_harness.experimental.compaction._limit_warner import LimitWarner, WarningKind
from pydantic_ai_harness.experimental.compaction._shared import CompactionStrategy, estimate_token_count
from pydantic_ai_harness.experimental.compaction._sliding_window import SlidingWindow
from pydantic_ai_harness.experimental.compaction._summarizing_compaction import SummarizingCompaction
from pydantic_ai_harness.experimental.compaction._tiered_compaction import TieredCompaction

warn_experimental('compaction')

__all__ = [
'ClearToolResults',
'CompactionStrategy',
'DeduplicateFileReads',
'LimitWarner',
'SlidingWindow',
'SummarizingCompaction',
'TieredCompaction',
'WarningKind',
'estimate_token_count',
]
Loading
Loading