From 0fab8703fe8d9cede6f55ebd1c8d0ab8079b53bc Mon Sep 17 00:00:00 2001
From: MavenTheAI <maventheai@gmail.com>
Date: Fri, 29 May 2026 11:32:29 -0700
Subject: [PATCH 1/2] docs(evals): add index page for /evals/

---
 AGENTS.md                         |  2 +-
 docs/agent.md                     |  2 +-
 docs/{evals.md => evals/index.md} | 30 +++++++++++++++---------------
 docs/index.md                     |  2 +-
 docs/install.md                   |  2 +-
 docs/version-policy.md            |  2 +-
 mkdocs.yml                        |  4 ++--
 7 files changed, 22 insertions(+), 22 deletions(-)
 rename docs/{evals.md => evals/index.md} (88%)

diff --git a/AGENTS.md b/AGENTS.md
index 4fc0ae0ece..6c9187ee45 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -79,7 +79,7 @@ The repo contains a `uv` workspace defining multiple Python packages:
 - `pydantic-ai-slim` in `pydantic_ai_slim/`: the [agent framework](docs/agent.md), including the `Agent` class and `Model` classes for each model provider/API
     - This is a slim package with minimal dependencies and optional dependency groups for each model provider (e.g. `openai`, `anthropic`, `google`) or integration (e.g. `logfire`, `mcp`, `temporal`).
 - `pydantic-graph` in `pydantic_graph/`: the type-hint based [graph library](docs/graph.md) that powers the agent loop
-- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals.md) for evaluating the arbitrary stochastic functions including LLMs and agents
+- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals/index.md) for evaluating the arbitrary stochastic functions including LLMs and agents
 - `clai` in `clai/`: a [CLI](docs/cli.md) (with an optional [web UI](docs/web.md)) to chat with Pydantic AI agents
 - `pydantic-ai` defined in `pyproject.toml` at the root, bringing in the packages above as well the optional dependency groups for all model providers and select integrations.
 
diff --git a/docs/agent.md b/docs/agent.md
index 1cc2981923..02f64d856b 100644
--- a/docs/agent.md
+++ b/docs/agent.md
@@ -1178,7 +1178,7 @@ This visibility is invaluable for:
 
 ### Systematic Testing with Evals
 
-For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals.md) provides a code-first framework for testing AI systems:
+For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals/) provides a code-first framework for testing AI systems:
 
 ```python {test="skip" lint="skip" format="skip"}
 from pydantic_evals import Case, Dataset
diff --git a/docs/evals.md b/docs/evals/index.md
similarity index 88%
rename from docs/evals.md
rename to docs/evals/index.md
index eb77f4dec4..fd21ca38d0 100644
--- a/docs/evals.md
+++ b/docs/evals/index.md
@@ -20,33 +20,33 @@ title: Pydantic Evals
 **Getting Started:**
 
 - [Installation](#installation)
-- [Quick Start](evals/quick-start.md)
-- [Core Concepts](evals/core-concepts.md)
+- [Quick Start](quick-start.md)
+- [Core Concepts](core-concepts.md)
 
 **Evaluators:**
 
-- [Evaluators Overview](evals/evaluators/overview.md) - Compare evaluator types and learn when to use each approach
-- [Built-in Evaluators](evals/evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators
-- [LLM as a Judge](evals/evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs
-- [Custom Evaluators](evals/evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics
-- [Span-Based Evaluation](evals/evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry.
+- [Evaluators Overview](evaluators/overview.md) - Compare evaluator types and learn when to use each approach
+- [Built-in Evaluators](evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators
+- [LLM as a Judge](evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs
+- [Custom Evaluators](evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics
+- [Span-Based Evaluation](evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry.
 
 **How-To Guides:**
 
-- [Logfire Integration](evals/how-to/logfire-integration.md) - Visualize results
-- [Dataset Management](evals/how-to/dataset-management.md) - Save, load, generate
-- [Concurrency & Performance](evals/how-to/concurrency.md) - Control parallel execution
-- [Retry Strategies](evals/how-to/retry-strategies.md) - Handle transient failures
-- [Metrics & Attributes](evals/how-to/metrics-attributes.md) - Track custom data
-- [Case Lifecycle Hooks](evals/how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment
+- [Logfire Integration](how-to/logfire-integration.md) - Visualize results
+- [Dataset Management](how-to/dataset-management.md) - Save, load, generate
+- [Concurrency & Performance](how-to/concurrency.md) - Control parallel execution
+- [Retry Strategies](how-to/retry-strategies.md) - Handle transient failures
+- [Metrics & Attributes](how-to/metrics-attributes.md) - Track custom data
+- [Case Lifecycle Hooks](how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment
 
 **Examples:**
 
-- [Simple Validation](evals/examples/simple-validation.md) - Basic example
+- [Simple Validation](examples/simple-validation.md) - Basic example
 
 **Reference:**
 
-- [API Documentation](api/pydantic_evals/dataset.md)
+- [API Documentation](../api/pydantic_evals/dataset.md)
 
 ## Code-First Evaluation
 
diff --git a/docs/index.md b/docs/index.md
index 1de6301d11..24d23572c7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -27,7 +27,7 @@ Tightly [integrates](logfire.md) with [Pydantic Logfire](https://pydantic.dev/lo
 Designed to give your IDE or AI coding agent as much context as possible for auto-completion and [type checking](agent.md#static-type-checking), moving entire classes of errors from runtime to write-time for a bit of that Rust "if it compiles, it works" feel.
 
 5. **Powerful Evals**:
-Enables you to systematically test and [evaluate](evals.md) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire.
+Enables you to systematically test and [evaluate](evals/) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire.
 
 6. **Extensible by Design**:
 Build agents from composable [capabilities](capabilities.md) that bundle tools, hooks, instructions, and model settings into reusable units. Use built-in capabilities for [web search](capabilities.md#provider-adaptive-tools), [thinking](capabilities.md#thinking), and [MCP](capabilities.md#provider-adaptive-tools), pick from the [Pydantic AI Harness](harness/overview.md) capability library, build your own, or install [third-party capability packages](extensibility.md). Define agents entirely in [YAML/JSON](agent-spec.md) — no code required.
diff --git a/docs/install.md b/docs/install.md
index fc8c63f7ef..70bba0f797 100644
--- a/docs/install.md
+++ b/docs/install.md
@@ -41,7 +41,7 @@ pip/uv-add "pydantic-ai-slim[openai]"
 `pydantic-ai-slim` has the following optional groups:
 
 * `logfire` — installs [Pydantic Logfire](logfire.md) dependency `logfire` [PyPI ↗](https://pypi.org/project/logfire){:target="_blank"}
-* `evals` — installs [Pydantic Evals](evals.md) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"}
+* `evals` — installs [Pydantic Evals](evals/) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"}
 * `openai` — installs [OpenAI Model](models/openai.md) dependency `openai` [PyPI ↗](https://pypi.org/project/openai){:target="_blank"}
 * `vertexai` — installs `GoogleVertexProvider` dependencies `google-auth` [PyPI ↗](https://pypi.org/project/google-auth){:target="_blank"} and `requests` [PyPI ↗](https://pypi.org/project/requests){:target="_blank"}
 * `google` — installs [Google Model](models/google.md) dependency `google-genai` [PyPI ↗](https://pypi.org/project/google-genai){:target="_blank"}
diff --git a/docs/version-policy.md b/docs/version-policy.md
index 9db735a19a..a527da1145 100644
--- a/docs/version-policy.md
+++ b/docs/version-policy.md
@@ -12,7 +12,7 @@ The following changes will **NOT** be considered breaking changes, and may occur
 
 * Bug fixes that may result in existing code breaking, provided that such code was relying on undocumented features/constructs/assumptions.
 * Adding new [message parts][pydantic_ai.messages], [stream events][pydantic_ai.messages.AgentStreamEvent], or optional fields (including fields with default values) on existing message (part) and event types. Always code defensively when consuming message parts or event streams, and use the [`ModelMessagesTypeAdapter`][pydantic_ai.messages.ModelMessagesTypeAdapter] to (de)serialize message histories.
-* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals.md) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
+* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals/) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
 * Changing how `__repr__` behaves, even of public classes.
 
 In all cases we will aim to minimize churn and do so only when justified by the increase of quality of Pydantic AI for users.
diff --git a/mkdocs.yml b/mkdocs.yml
index 681a18fc50..391c880af7 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -71,7 +71,7 @@ nav:
       - Code Mode: harness/code-mode.md
 
   - Pydantic Evals:
-      - Overview: evals.md
+      - Overview: evals/index.md
       - Getting Started:
           - Quick Start: evals/quick-start.md
           - Core Concepts: evals/core-concepts.md
@@ -415,7 +415,7 @@ plugins:
         API Reference:
           - api/*.md
         Evals:
-          - evals.md
+          - evals/index.md
           - evals/*.md
         Durable Execution:
           - durable_execution/*.md

From c20ced03b99a400c968d75731ed5770f5229a045 Mon Sep 17 00:00:00 2001
From: MavenTheAI <maventheai@gmail.com>
Date: Fri, 29 May 2026 11:48:11 -0700
Subject: [PATCH 2/2] Docs: keep evals entrypoint file

---
 AGENTS.md     | 2 +-
 docs/evals.md | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)
 create mode 100644 docs/evals.md

diff --git a/AGENTS.md b/AGENTS.md
index 6c9187ee45..4fc0ae0ece 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -79,7 +79,7 @@ The repo contains a `uv` workspace defining multiple Python packages:
 - `pydantic-ai-slim` in `pydantic_ai_slim/`: the [agent framework](docs/agent.md), including the `Agent` class and `Model` classes for each model provider/API
     - This is a slim package with minimal dependencies and optional dependency groups for each model provider (e.g. `openai`, `anthropic`, `google`) or integration (e.g. `logfire`, `mcp`, `temporal`).
 - `pydantic-graph` in `pydantic_graph/`: the type-hint based [graph library](docs/graph.md) that powers the agent loop
-- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals/index.md) for evaluating the arbitrary stochastic functions including LLMs and agents
+- `pydantic-evals` in `pydantic_evals/`: the [evaluation framework](docs/evals.md) for evaluating the arbitrary stochastic functions including LLMs and agents
 - `clai` in `clai/`: a [CLI](docs/cli.md) (with an optional [web UI](docs/web.md)) to chat with Pydantic AI agents
 - `pydantic-ai` defined in `pyproject.toml` at the root, bringing in the packages above as well the optional dependency groups for all model providers and select integrations.
 
diff --git a/docs/evals.md b/docs/evals.md
new file mode 100644
index 0000000000..f28734669b
--- /dev/null
+++ b/docs/evals.md
@@ -0,0 +1,5 @@
+# Evals
+
+The Evals documentation has moved to `docs/evals/index.md`.
+
+- Continue here: [Evals](evals/index.md)