From 0fab8703fe8d9cede6f55ebd1c8d0ab8079b53bc Mon Sep 17 00:00:00 2001 From: MavenTheAI Date: Fri, 29 May 2026 11:32:29 -0700 Subject: [PATCH 1/2] docs(evals): add index page for /evals/ --- AGENTS.md | 2 +- docs/agent.md | 2 +- docs/{evals.md => evals/index.md} | 30 +++++++++++++++--------------- docs/index.md | 2 +- docs/install.md | 2 +- docs/version-policy.md | 2 +- mkdocs.yml | 4 ++-- 7 files changed, 22 insertions(+), 22 deletions(-) rename docs/{evals.md => evals/index.md} (88%) diff --git a/AGENTS.md b/AGENTS.md index 4fc0ae0ece..6c9187ee45 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -79,7 +79,7 @@ The repo contains a `uv` workspace defining multiple Python packages: - `pydantic-ai-slim` in `pydantic_ai_slim/`: the [agent framework](docs/agent.md), including the `Agent` class and `Model` classes for each model provider/API - This is a slim package with minimal dependencies and optional dependency groups for each model provider (e.g. `openai`, `anthropic`, `google`) or integration (e.g. `logfire`, `mcp`, `temporal`). - `pydantic-graph` in `pydantic_graph/`: the type-hint based [graph library](docs/graph.md) that powers the agent loop -- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals.md) for evaluating the arbitrary stochastic functions including LLMs and agents +- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals/index.md) for evaluating the arbitrary stochastic functions including LLMs and agents - `clai` in `clai/`: a [CLI](docs/cli.md) (with an optional [web UI](docs/web.md)) to chat with Pydantic AI agents - `pydantic-ai` defined in `pyproject.toml` at the root, bringing in the packages above as well the optional dependency groups for all model providers and select integrations. diff --git a/docs/agent.md b/docs/agent.md index 1cc2981923..02f64d856b 100644 --- a/docs/agent.md +++ b/docs/agent.md @@ -1178,7 +1178,7 @@ This visibility is invaluable for: ### Systematic Testing with Evals -For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals.md) provides a code-first framework for testing AI systems: +For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals/) provides a code-first framework for testing AI systems: ```python {test="skip" lint="skip" format="skip"} from pydantic_evals import Case, Dataset diff --git a/docs/evals.md b/docs/evals/index.md similarity index 88% rename from docs/evals.md rename to docs/evals/index.md index eb77f4dec4..fd21ca38d0 100644 --- a/docs/evals.md +++ b/docs/evals/index.md @@ -20,33 +20,33 @@ title: Pydantic Evals **Getting Started:** - [Installation](#installation) -- [Quick Start](evals/quick-start.md) -- [Core Concepts](evals/core-concepts.md) +- [Quick Start](quick-start.md) +- [Core Concepts](core-concepts.md) **Evaluators:** -- [Evaluators Overview](evals/evaluators/overview.md) - Compare evaluator types and learn when to use each approach -- [Built-in Evaluators](evals/evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators -- [LLM as a Judge](evals/evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs -- [Custom Evaluators](evals/evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics -- [Span-Based Evaluation](evals/evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry. +- [Evaluators Overview](evaluators/overview.md) - Compare evaluator types and learn when to use each approach +- [Built-in Evaluators](evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators +- [LLM as a Judge](evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs +- [Custom Evaluators](evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics +- [Span-Based Evaluation](evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry. **How-To Guides:** -- [Logfire Integration](evals/how-to/logfire-integration.md) - Visualize results -- [Dataset Management](evals/how-to/dataset-management.md) - Save, load, generate -- [Concurrency & Performance](evals/how-to/concurrency.md) - Control parallel execution -- [Retry Strategies](evals/how-to/retry-strategies.md) - Handle transient failures -- [Metrics & Attributes](evals/how-to/metrics-attributes.md) - Track custom data -- [Case Lifecycle Hooks](evals/how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment +- [Logfire Integration](how-to/logfire-integration.md) - Visualize results +- [Dataset Management](how-to/dataset-management.md) - Save, load, generate +- [Concurrency & Performance](how-to/concurrency.md) - Control parallel execution +- [Retry Strategies](how-to/retry-strategies.md) - Handle transient failures +- [Metrics & Attributes](how-to/metrics-attributes.md) - Track custom data +- [Case Lifecycle Hooks](how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment **Examples:** -- [Simple Validation](evals/examples/simple-validation.md) - Basic example +- [Simple Validation](examples/simple-validation.md) - Basic example **Reference:** -- [API Documentation](api/pydantic_evals/dataset.md) +- [API Documentation](../api/pydantic_evals/dataset.md) ## Code-First Evaluation diff --git a/docs/index.md b/docs/index.md index 1de6301d11..24d23572c7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -27,7 +27,7 @@ Tightly [integrates](logfire.md) with [Pydantic Logfire](https://pydantic.dev/lo Designed to give your IDE or AI coding agent as much context as possible for auto-completion and [type checking](agent.md#static-type-checking), moving entire classes of errors from runtime to write-time for a bit of that Rust "if it compiles, it works" feel. 5. **Powerful Evals**: -Enables you to systematically test and [evaluate](evals.md) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire. +Enables you to systematically test and [evaluate](evals/) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire. 6. **Extensible by Design**: Build agents from composable [capabilities](capabilities.md) that bundle tools, hooks, instructions, and model settings into reusable units. Use built-in capabilities for [web search](capabilities.md#provider-adaptive-tools), [thinking](capabilities.md#thinking), and [MCP](capabilities.md#provider-adaptive-tools), pick from the [Pydantic AI Harness](harness/overview.md) capability library, build your own, or install [third-party capability packages](extensibility.md). Define agents entirely in [YAML/JSON](agent-spec.md) — no code required. diff --git a/docs/install.md b/docs/install.md index fc8c63f7ef..70bba0f797 100644 --- a/docs/install.md +++ b/docs/install.md @@ -41,7 +41,7 @@ pip/uv-add "pydantic-ai-slim[openai]" `pydantic-ai-slim` has the following optional groups: * `logfire` — installs [Pydantic Logfire](logfire.md) dependency `logfire` [PyPI ↗](https://pypi.org/project/logfire){:target="_blank"} -* `evals` — installs [Pydantic Evals](evals.md) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"} +* `evals` — installs [Pydantic Evals](evals/) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"} * `openai` — installs [OpenAI Model](models/openai.md) dependency `openai` [PyPI ↗](https://pypi.org/project/openai){:target="_blank"} * `vertexai` — installs `GoogleVertexProvider` dependencies `google-auth` [PyPI ↗](https://pypi.org/project/google-auth){:target="_blank"} and `requests` [PyPI ↗](https://pypi.org/project/requests){:target="_blank"} * `google` — installs [Google Model](models/google.md) dependency `google-genai` [PyPI ↗](https://pypi.org/project/google-genai){:target="_blank"} diff --git a/docs/version-policy.md b/docs/version-policy.md index 9db735a19a..a527da1145 100644 --- a/docs/version-policy.md +++ b/docs/version-policy.md @@ -12,7 +12,7 @@ The following changes will **NOT** be considered breaking changes, and may occur * Bug fixes that may result in existing code breaking, provided that such code was relying on undocumented features/constructs/assumptions. * Adding new [message parts][pydantic_ai.messages], [stream events][pydantic_ai.messages.AgentStreamEvent], or optional fields (including fields with default values) on existing message (part) and event types. Always code defensively when consuming message parts or event streams, and use the [`ModelMessagesTypeAdapter`][pydantic_ai.messages.ModelMessagesTypeAdapter] to (de)serialize message histories. -* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals.md) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/). +* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals/) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/). * Changing how `__repr__` behaves, even of public classes. In all cases we will aim to minimize churn and do so only when justified by the increase of quality of Pydantic AI for users. diff --git a/mkdocs.yml b/mkdocs.yml index 681a18fc50..391c880af7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -71,7 +71,7 @@ nav: - Code Mode: harness/code-mode.md - Pydantic Evals: - - Overview: evals.md + - Overview: evals/index.md - Getting Started: - Quick Start: evals/quick-start.md - Core Concepts: evals/core-concepts.md @@ -415,7 +415,7 @@ plugins: API Reference: - api/*.md Evals: - - evals.md + - evals/index.md - evals/*.md Durable Execution: - durable_execution/*.md From c20ced03b99a400c968d75731ed5770f5229a045 Mon Sep 17 00:00:00 2001 From: MavenTheAI Date: Fri, 29 May 2026 11:48:11 -0700 Subject: [PATCH 2/2] Docs: keep evals entrypoint file --- AGENTS.md | 2 +- docs/evals.md | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-) create mode 100644 docs/evals.md diff --git a/AGENTS.md b/AGENTS.md index 6c9187ee45..4fc0ae0ece 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -79,7 +79,7 @@ The repo contains a `uv` workspace defining multiple Python packages: - `pydantic-ai-slim` in `pydantic_ai_slim/`: the [agent framework](docs/agent.md), including the `Agent` class and `Model` classes for each model provider/API - This is a slim package with minimal dependencies and optional dependency groups for each model provider (e.g. `openai`, `anthropic`, `google`) or integration (e.g. `logfire`, `mcp`, `temporal`). - `pydantic-graph` in `pydantic_graph/`: the type-hint based [graph library](docs/graph.md) that powers the agent loop -- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals/index.md) for evaluating the arbitrary stochastic functions including LLMs and agents +- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals.md) for evaluating the arbitrary stochastic functions including LLMs and agents - `clai` in `clai/`: a [CLI](docs/cli.md) (with an optional [web UI](docs/web.md)) to chat with Pydantic AI agents - `pydantic-ai` defined in `pyproject.toml` at the root, bringing in the packages above as well the optional dependency groups for all model providers and select integrations. diff --git a/docs/evals.md b/docs/evals.md new file mode 100644 index 0000000000..f28734669b --- /dev/null +++ b/docs/evals.md @@ -0,0 +1,5 @@ +# Evals + +The Evals documentation has moved to `docs/evals/index.md`. + +- Continue here: [Evals](evals/index.md)