diff --git a/docs/agent.md b/docs/agent.md
index 1cc2981923..02f64d856b 100644
--- a/docs/agent.md
+++ b/docs/agent.md
@@ -1178,7 +1178,7 @@ This visibility is invaluable for:
 
 ### Systematic Testing with Evals
 
-For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals.md) provides a code-first framework for testing AI systems:
+For systematic evaluation of agent behavior beyond runtime debugging, [Pydantic Evals](evals/) provides a code-first framework for testing AI systems:
 
 ```python {test="skip" lint="skip" format="skip"}
 from pydantic_evals import Case, Dataset
diff --git a/docs/evals.md b/docs/evals.md
index eb77f4dec4..f28734669b 100644
--- a/docs/evals.md
+++ b/docs/evals.md
@@ -1,286 +1,5 @@
----
-title: Pydantic Evals
----
+# Evals
 
-# Pydantic Evals
+The Evals documentation has moved to `docs/evals/index.md`.
 
-**Pydantic Evals** is a powerful evaluation framework for systematically testing and evaluating AI systems, from simple LLM calls to complex multi-agent applications.
-
-## Design Philosophy
-
-!!! note "Code-First Approach"
-    Pydantic Evals follows a code-first philosophy where all evaluation components are defined in Python. This differs from platforms with web-based configuration. You write and run evals in code, and can write the results to disk or view them in your terminal or in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
-
-!!! danger "Evals are an Emerging Practice"
-    Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored. We've designed Pydantic Evals to be flexible and useful without being too opinionated.
-
-
-## Quick Navigation
-
-**Getting Started:**
-
-- [Installation](#installation)
-- [Quick Start](evals/quick-start.md)
-- [Core Concepts](evals/core-concepts.md)
-
-**Evaluators:**
-
-- [Evaluators Overview](evals/evaluators/overview.md) - Compare evaluator types and learn when to use each approach
-- [Built-in Evaluators](evals/evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators
-- [LLM as a Judge](evals/evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs
-- [Custom Evaluators](evals/evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics
-- [Span-Based Evaluation](evals/evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry.
-
-**How-To Guides:**
-
-- [Logfire Integration](evals/how-to/logfire-integration.md) - Visualize results
-- [Dataset Management](evals/how-to/dataset-management.md) - Save, load, generate
-- [Concurrency & Performance](evals/how-to/concurrency.md) - Control parallel execution
-- [Retry Strategies](evals/how-to/retry-strategies.md) - Handle transient failures
-- [Metrics & Attributes](evals/how-to/metrics-attributes.md) - Track custom data
-- [Case Lifecycle Hooks](evals/how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment
-
-**Examples:**
-
-- [Simple Validation](evals/examples/simple-validation.md) - Basic example
-
-**Reference:**
-
-- [API Documentation](api/pydantic_evals/dataset.md)
-
-## Code-First Evaluation
-
-Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.
-
-When you run an _Experiment_ you'll see a progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.
-
-If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.
-
-## Installation
-
-To install the Pydantic Evals package, run:
-
-```bash
-pip/uv-add pydantic-evals
-```
-
-`pydantic-evals` does not depend on `pydantic-ai`, but has an optional dependency on `logfire` if you'd like to
-use OpenTelemetry traces in your evals, or send evaluation results to [logfire](https://pydantic.dev/logfire).
-
-```bash
-pip/uv-add 'pydantic-evals[logfire]'
-```
-
-## Pydantic Evals Data Model
-
-Pydantic Evals is built around a simple data model:
-
-### Data Model Diagram
-
-```
-Dataset (1) ──────────── (Many) Case
-│                        │
-│                        │
-└─── (Many) Experiment ──┴─── (Many) Case results
-     │
-     └─── (1) Task
-     │
-     └─── (Many) Evaluator
-```
-
-### Key Relationships
-
-1. **Dataset → Cases**: One Dataset contains many Cases
-2. **Dataset → Experiments**: One Dataset can be used across many Experiments over time
-3. **Experiment → Case results**: One Experiment generates results by executing each Case
-4. **Experiment → Task**: One Experiment evaluates one defined Task
-5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all Cases, and Case-specific Evaluators against their respective Cases
-
-### Data Flow
-
-1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python
-2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)`
-3. **Cases run**: Each Case is executed against the Task
-4. **Evaluation**: Evaluators score the Task outputs for each Case
-5. **Results**: All Case results are collected into a summary report
-
-!!! note "A metaphor"
-
-    A useful metaphor (although not perfect) is to think of evals like a **Unit Testing** framework:
-
-    - **Cases + Evaluators** are your individual unit tests - each one
-    defines a specific scenario you want to test, complete with inputs
-    and expected outcomes. Just like a unit test, a case asks: _"Given
-    this input, does my system produce the right output?"_
-
-    -  **Datasets** are like test suites - they are the scaffolding that holds your unit
-    tests together. They group related cases and define shared
-    evaluation criteria that should apply across all tests in the suite.
-
-    - **Experiments** are like running your entire test suite and getting a
-    report. When you execute `dataset.evaluate_sync(my_ai_function)`,
-    you're running all your cases against your AI system and
-    collecting the results - just like running `pytest` and getting a
-    summary of passes, failures, and performance metrics.
-
-    The key difference from traditional unit testing is that AI systems are
-    probabilistic. If you're type checking you'll still get a simple pass/fail,
-    but scores for text outputs are likely qualitative and/or categorical,
-    and more open to interpretation.
-
-For a deeper understanding, see [Core Concepts](evals/core-concepts.md).
-
-## Datasets and Cases
-
-In Pydantic Evals, everything begins with [`Dataset`][pydantic_evals.dataset.Dataset]s and [`Case`][pydantic_evals.dataset.Case]s:
-
-- **[`Dataset`][pydantic_evals.dataset.Dataset]**: A collection of test Cases designed for the evaluation of a specific task or function
-- **[`Case`][pydantic_evals.dataset.Case]**: A single test scenario corresponding to Task inputs, with optional expected outputs, metadata, and case-specific evaluators
-
-```python {title="simple_eval_dataset.py"}
-from pydantic_evals import Case, Dataset
-
-case1 = Case(
-    name='simple_case',
-    inputs='What is the capital of France?',
-    expected_output='Paris',
-    metadata={'difficulty': 'easy'},
-)
-
-dataset = Dataset(name='capital_quiz', cases=[case1])
-```
-
-_(This example is complete, it can be run "as is")_
-
-See [Dataset Management](evals/how-to/dataset-management.md) to learn about saving, loading, and generating datasets.
-
-## Evaluators
-
-[`Evaluator`][pydantic_evals.evaluators.Evaluator]s analyze and score the results of your Task when tested against a Case.
-
-These can be deterministic, code-based checks (such as testing model output format with a regex, or checking for the appearance of PII or sensitive data), or they can assess non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations, or instruction-following.
-
-While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs.
-
-Pydantic Evals includes several [built-in evaluators](evals/evaluators/built-in.md) and allows you to define [custom evaluators](evals/evaluators/custom.md):
-
-```python {title="simple_eval_evaluator.py" requires="simple_eval_dataset.py"}
-from dataclasses import dataclass
-
-from pydantic_evals.evaluators import Evaluator, EvaluatorContext
-from pydantic_evals.evaluators.common import IsInstance
-
-from simple_eval_dataset import dataset
-
-dataset.add_evaluator(IsInstance(type_name='str'))  # (1)!
-
-
-@dataclass
-class MyEvaluator(Evaluator):
-    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:  # (2)!
-        if ctx.output == ctx.expected_output:
-            return 1.0
-        elif (
-            isinstance(ctx.output, str)
-            and ctx.expected_output.lower() in ctx.output.lower()
-        ):
-            return 0.8
-        else:
-            return 0.0
-
-
-dataset.add_evaluator(MyEvaluator())
-```
-
-1. You can add built-in evaluators to a dataset using the [`add_evaluator`][pydantic_evals.dataset.Dataset.add_evaluator] method.
-2. This custom evaluator returns a simple score based on whether the output matches the expected output.
-
-_(This example is complete, it can be run "as is")_
-
-Learn more:
-
-- [Evaluators Overview](evals/evaluators/overview.md) - When to use different types
-- [Built-in Evaluators](evals/evaluators/built-in.md) - Complete reference
-- [LLM Judge](evals/evaluators/llm-judge.md) - Using LLMs as evaluators
-- [Custom Evaluators](evals/evaluators/custom.md) - Write your own logic
-- [Span-Based Evaluation](evals/evaluators/span-based.md) - Analyze execution traces
-
-## Running Experiments
-
-Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment".
-
-Putting the above two examples together and using the more declarative `evaluators` kwarg to [`Dataset`][pydantic_evals.dataset.Dataset]:
-
-```python {title="simple_eval_complete.py"}
-from pydantic_evals import Case, Dataset
-from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance
-
-case1 = Case(  # (1)!
-    name='simple_case',
-    inputs='What is the capital of France?',
-    expected_output='Paris',
-    metadata={'difficulty': 'easy'},
-)
-
-
-class MyEvaluator(Evaluator[str, str]):
-    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
-        if ctx.output == ctx.expected_output:
-            return 1.0
-        elif (
-            isinstance(ctx.output, str)
-            and ctx.expected_output.lower() in ctx.output.lower()
-        ):
-            return 0.8
-        else:
-            return 0.0
-
-
-dataset = Dataset(
-    name='capital_quiz',
-    cases=[case1],
-    evaluators=[IsInstance(type_name='str'), MyEvaluator()],  # (2)!
-)
-
-
-async def guess_city(question: str) -> str:  # (3)!
-    return 'Paris'
-
-
-report = dataset.evaluate_sync(guess_city)  # (4)!
-report.print(include_input=True, include_output=True, include_durations=False)  # (5)!
-"""
-                              Evaluation Summary: guess_city
-┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
-┃ Case ID     ┃ Inputs                         ┃ Outputs ┃ Scores            ┃ Assertions ┃
-┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
-│ simple_case │ What is the capital of France? │ Paris   │ MyEvaluator: 1.00 │ ✔          │
-├─────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┤
-│ Averages    │                                │         │ MyEvaluator: 1.00 │ 100.0% ✔   │
-└─────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┘
-"""
-```
-
-1. Create a [test case][pydantic_evals.dataset.Case] as above
-2. Create a [`Dataset`][pydantic_evals.dataset.Dataset] with test cases and [`evaluators`][pydantic_evals.dataset.Dataset.evaluators]
-3. Our function to evaluate.
-4. Run the evaluation with [`evaluate_sync`][pydantic_evals.dataset.Dataset.evaluate_sync], which runs the function against all test cases in the dataset, and returns an [`EvaluationReport`][pydantic_evals.reporting.EvaluationReport] object.
-5. Print the report with [`print`][pydantic_evals.reporting.EvaluationReport.print], which shows the results of the evaluation. We have omitted duration here just to keep the printed output from changing from run to run.
-
-_(This example is complete, it can be run "as is")_
-
-See [Quick Start](evals/quick-start.md) for more examples and [Concurrency & Performance](evals/how-to/concurrency.md) to learn about controlling parallel execution.
-
-## API Reference
-
-For comprehensive coverage of all classes, methods, and configuration options, see the detailed [API Reference documentation](https://ai.pydantic.dev/api/pydantic_evals/dataset/).
-
-## Next Steps
-
-<!-- TODO - this would be the perfect place for a full tutorial or case study  -->
-1. **Start with simple evaluations** using [Quick Start](evals/quick-start.md)
-2. **Understand the data model** with [Core Concepts](evals/core-concepts.md)
-3. **Explore built-in evaluators** in [Built-in Evaluators](evals/evaluators/built-in.md)
-4. **Integrate with Logfire** for visualization: [Logfire Integration](evals/how-to/logfire-integration.md)
-5. **Build comprehensive test suites** with [Dataset Management](evals/how-to/dataset-management.md)
-6. **Implement custom evaluators** for domain-specific metrics: [Custom Evaluators](evals/evaluators/custom.md)
+- Continue here: [Evals](evals/index.md)
diff --git a/docs/evals/index.md b/docs/evals/index.md
new file mode 100644
index 0000000000..fd21ca38d0
--- /dev/null
+++ b/docs/evals/index.md
@@ -0,0 +1,286 @@
+---
+title: Pydantic Evals
+---
+
+# Pydantic Evals
+
+**Pydantic Evals** is a powerful evaluation framework for systematically testing and evaluating AI systems, from simple LLM calls to complex multi-agent applications.
+
+## Design Philosophy
+
+!!! note "Code-First Approach"
+    Pydantic Evals follows a code-first philosophy where all evaluation components are defined in Python. This differs from platforms with web-based configuration. You write and run evals in code, and can write the results to disk or view them in your terminal or in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
+
+!!! danger "Evals are an Emerging Practice"
+    Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored. We've designed Pydantic Evals to be flexible and useful without being too opinionated.
+
+
+## Quick Navigation
+
+**Getting Started:**
+
+- [Installation](#installation)
+- [Quick Start](quick-start.md)
+- [Core Concepts](core-concepts.md)
+
+**Evaluators:**
+
+- [Evaluators Overview](evaluators/overview.md) - Compare evaluator types and learn when to use each approach
+- [Built-in Evaluators](evaluators/built-in.md) - Complete reference for exact match, instance checks, and other ready-to-use evaluators
+- [LLM as a Judge](evaluators/llm-judge.md) - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs
+- [Custom Evaluators](evaluators/custom.md) - Implement domain-specific scoring logic and custom evaluation metrics
+- [Span-Based Evaluation](evaluators/span-based.md) - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on _how_ the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry.
+
+**How-To Guides:**
+
+- [Logfire Integration](how-to/logfire-integration.md) - Visualize results
+- [Dataset Management](how-to/dataset-management.md) - Save, load, generate
+- [Concurrency & Performance](how-to/concurrency.md) - Control parallel execution
+- [Retry Strategies](how-to/retry-strategies.md) - Handle transient failures
+- [Metrics & Attributes](how-to/metrics-attributes.md) - Track custom data
+- [Case Lifecycle Hooks](how-to/lifecycle.md) - Per-case setup, teardown, and context enrichment
+
+**Examples:**
+
+- [Simple Validation](examples/simple-validation.md) - Basic example
+
+**Reference:**
+
+- [API Documentation](../api/pydantic_evals/dataset.md)
+
+## Code-First Evaluation
+
+Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.
+
+When you run an _Experiment_ you'll see a progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.
+
+If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.
+
+## Installation
+
+To install the Pydantic Evals package, run:
+
+```bash
+pip/uv-add pydantic-evals
+```
+
+`pydantic-evals` does not depend on `pydantic-ai`, but has an optional dependency on `logfire` if you'd like to
+use OpenTelemetry traces in your evals, or send evaluation results to [logfire](https://pydantic.dev/logfire).
+
+```bash
+pip/uv-add 'pydantic-evals[logfire]'
+```
+
+## Pydantic Evals Data Model
+
+Pydantic Evals is built around a simple data model:
+
+### Data Model Diagram
+
+```
+Dataset (1) ──────────── (Many) Case
+│                        │
+│                        │
+└─── (Many) Experiment ──┴─── (Many) Case results
+     │
+     └─── (1) Task
+     │
+     └─── (Many) Evaluator
+```
+
+### Key Relationships
+
+1. **Dataset → Cases**: One Dataset contains many Cases
+2. **Dataset → Experiments**: One Dataset can be used across many Experiments over time
+3. **Experiment → Case results**: One Experiment generates results by executing each Case
+4. **Experiment → Task**: One Experiment evaluates one defined Task
+5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all Cases, and Case-specific Evaluators against their respective Cases
+
+### Data Flow
+
+1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python
+2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)`
+3. **Cases run**: Each Case is executed against the Task
+4. **Evaluation**: Evaluators score the Task outputs for each Case
+5. **Results**: All Case results are collected into a summary report
+
+!!! note "A metaphor"
+
+    A useful metaphor (although not perfect) is to think of evals like a **Unit Testing** framework:
+
+    - **Cases + Evaluators** are your individual unit tests - each one
+    defines a specific scenario you want to test, complete with inputs
+    and expected outcomes. Just like a unit test, a case asks: _"Given
+    this input, does my system produce the right output?"_
+
+    -  **Datasets** are like test suites - they are the scaffolding that holds your unit
+    tests together. They group related cases and define shared
+    evaluation criteria that should apply across all tests in the suite.
+
+    - **Experiments** are like running your entire test suite and getting a
+    report. When you execute `dataset.evaluate_sync(my_ai_function)`,
+    you're running all your cases against your AI system and
+    collecting the results - just like running `pytest` and getting a
+    summary of passes, failures, and performance metrics.
+
+    The key difference from traditional unit testing is that AI systems are
+    probabilistic. If you're type checking you'll still get a simple pass/fail,
+    but scores for text outputs are likely qualitative and/or categorical,
+    and more open to interpretation.
+
+For a deeper understanding, see [Core Concepts](evals/core-concepts.md).
+
+## Datasets and Cases
+
+In Pydantic Evals, everything begins with [`Dataset`][pydantic_evals.dataset.Dataset]s and [`Case`][pydantic_evals.dataset.Case]s:
+
+- **[`Dataset`][pydantic_evals.dataset.Dataset]**: A collection of test Cases designed for the evaluation of a specific task or function
+- **[`Case`][pydantic_evals.dataset.Case]**: A single test scenario corresponding to Task inputs, with optional expected outputs, metadata, and case-specific evaluators
+
+```python {title="simple_eval_dataset.py"}
+from pydantic_evals import Case, Dataset
+
+case1 = Case(
+    name='simple_case',
+    inputs='What is the capital of France?',
+    expected_output='Paris',
+    metadata={'difficulty': 'easy'},
+)
+
+dataset = Dataset(name='capital_quiz', cases=[case1])
+```
+
+_(This example is complete, it can be run "as is")_
+
+See [Dataset Management](evals/how-to/dataset-management.md) to learn about saving, loading, and generating datasets.
+
+## Evaluators
+
+[`Evaluator`][pydantic_evals.evaluators.Evaluator]s analyze and score the results of your Task when tested against a Case.
+
+These can be deterministic, code-based checks (such as testing model output format with a regex, or checking for the appearance of PII or sensitive data), or they can assess non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations, or instruction-following.
+
+While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs.
+
+Pydantic Evals includes several [built-in evaluators](evals/evaluators/built-in.md) and allows you to define [custom evaluators](evals/evaluators/custom.md):
+
+```python {title="simple_eval_evaluator.py" requires="simple_eval_dataset.py"}
+from dataclasses import dataclass
+
+from pydantic_evals.evaluators import Evaluator, EvaluatorContext
+from pydantic_evals.evaluators.common import IsInstance
+
+from simple_eval_dataset import dataset
+
+dataset.add_evaluator(IsInstance(type_name='str'))  # (1)!
+
+
+@dataclass
+class MyEvaluator(Evaluator):
+    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:  # (2)!
+        if ctx.output == ctx.expected_output:
+            return 1.0
+        elif (
+            isinstance(ctx.output, str)
+            and ctx.expected_output.lower() in ctx.output.lower()
+        ):
+            return 0.8
+        else:
+            return 0.0
+
+
+dataset.add_evaluator(MyEvaluator())
+```
+
+1. You can add built-in evaluators to a dataset using the [`add_evaluator`][pydantic_evals.dataset.Dataset.add_evaluator] method.
+2. This custom evaluator returns a simple score based on whether the output matches the expected output.
+
+_(This example is complete, it can be run "as is")_
+
+Learn more:
+
+- [Evaluators Overview](evals/evaluators/overview.md) - When to use different types
+- [Built-in Evaluators](evals/evaluators/built-in.md) - Complete reference
+- [LLM Judge](evals/evaluators/llm-judge.md) - Using LLMs as evaluators
+- [Custom Evaluators](evals/evaluators/custom.md) - Write your own logic
+- [Span-Based Evaluation](evals/evaluators/span-based.md) - Analyze execution traces
+
+## Running Experiments
+
+Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment".
+
+Putting the above two examples together and using the more declarative `evaluators` kwarg to [`Dataset`][pydantic_evals.dataset.Dataset]:
+
+```python {title="simple_eval_complete.py"}
+from pydantic_evals import Case, Dataset
+from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance
+
+case1 = Case(  # (1)!
+    name='simple_case',
+    inputs='What is the capital of France?',
+    expected_output='Paris',
+    metadata={'difficulty': 'easy'},
+)
+
+
+class MyEvaluator(Evaluator[str, str]):
+    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
+        if ctx.output == ctx.expected_output:
+            return 1.0
+        elif (
+            isinstance(ctx.output, str)
+            and ctx.expected_output.lower() in ctx.output.lower()
+        ):
+            return 0.8
+        else:
+            return 0.0
+
+
+dataset = Dataset(
+    name='capital_quiz',
+    cases=[case1],
+    evaluators=[IsInstance(type_name='str'), MyEvaluator()],  # (2)!
+)
+
+
+async def guess_city(question: str) -> str:  # (3)!
+    return 'Paris'
+
+
+report = dataset.evaluate_sync(guess_city)  # (4)!
+report.print(include_input=True, include_output=True, include_durations=False)  # (5)!
+"""
+                              Evaluation Summary: guess_city
+┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ Case ID     ┃ Inputs                         ┃ Outputs ┃ Scores            ┃ Assertions ┃
+┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
+│ simple_case │ What is the capital of France? │ Paris   │ MyEvaluator: 1.00 │ ✔          │
+├─────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┤
+│ Averages    │                                │         │ MyEvaluator: 1.00 │ 100.0% ✔   │
+└─────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┘
+"""
+```
+
+1. Create a [test case][pydantic_evals.dataset.Case] as above
+2. Create a [`Dataset`][pydantic_evals.dataset.Dataset] with test cases and [`evaluators`][pydantic_evals.dataset.Dataset.evaluators]
+3. Our function to evaluate.
+4. Run the evaluation with [`evaluate_sync`][pydantic_evals.dataset.Dataset.evaluate_sync], which runs the function against all test cases in the dataset, and returns an [`EvaluationReport`][pydantic_evals.reporting.EvaluationReport] object.
+5. Print the report with [`print`][pydantic_evals.reporting.EvaluationReport.print], which shows the results of the evaluation. We have omitted duration here just to keep the printed output from changing from run to run.
+
+_(This example is complete, it can be run "as is")_
+
+See [Quick Start](evals/quick-start.md) for more examples and [Concurrency & Performance](evals/how-to/concurrency.md) to learn about controlling parallel execution.
+
+## API Reference
+
+For comprehensive coverage of all classes, methods, and configuration options, see the detailed [API Reference documentation](https://ai.pydantic.dev/api/pydantic_evals/dataset/).
+
+## Next Steps
+
+<!-- TODO - this would be the perfect place for a full tutorial or case study  -->
+1. **Start with simple evaluations** using [Quick Start](evals/quick-start.md)
+2. **Understand the data model** with [Core Concepts](evals/core-concepts.md)
+3. **Explore built-in evaluators** in [Built-in Evaluators](evals/evaluators/built-in.md)
+4. **Integrate with Logfire** for visualization: [Logfire Integration](evals/how-to/logfire-integration.md)
+5. **Build comprehensive test suites** with [Dataset Management](evals/how-to/dataset-management.md)
+6. **Implement custom evaluators** for domain-specific metrics: [Custom Evaluators](evals/evaluators/custom.md)
diff --git a/docs/index.md b/docs/index.md
index 1de6301d11..24d23572c7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -27,7 +27,7 @@ Tightly [integrates](logfire.md) with [Pydantic Logfire](https://pydantic.dev/lo
 Designed to give your IDE or AI coding agent as much context as possible for auto-completion and [type checking](agent.md#static-type-checking), moving entire classes of errors from runtime to write-time for a bit of that Rust "if it compiles, it works" feel.
 
 5. **Powerful Evals**:
-Enables you to systematically test and [evaluate](evals.md) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire.
+Enables you to systematically test and [evaluate](evals/) the performance and accuracy of the agentic systems you build, and monitor the performance over time in Pydantic Logfire.
 
 6. **Extensible by Design**:
 Build agents from composable [capabilities](capabilities.md) that bundle tools, hooks, instructions, and model settings into reusable units. Use built-in capabilities for [web search](capabilities.md#provider-adaptive-tools), [thinking](capabilities.md#thinking), and [MCP](capabilities.md#provider-adaptive-tools), pick from the [Pydantic AI Harness](harness/overview.md) capability library, build your own, or install [third-party capability packages](extensibility.md). Define agents entirely in [YAML/JSON](agent-spec.md) — no code required.
diff --git a/docs/install.md b/docs/install.md
index fc8c63f7ef..70bba0f797 100644
--- a/docs/install.md
+++ b/docs/install.md
@@ -41,7 +41,7 @@ pip/uv-add "pydantic-ai-slim[openai]"
 `pydantic-ai-slim` has the following optional groups:
 
 * `logfire` — installs [Pydantic Logfire](logfire.md) dependency `logfire` [PyPI ↗](https://pypi.org/project/logfire){:target="_blank"}
-* `evals` — installs [Pydantic Evals](evals.md) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"}
+* `evals` — installs [Pydantic Evals](evals/) dependency `pydantic-evals` [PyPI ↗](https://pypi.org/project/pydantic-evals){:target="_blank"}
 * `openai` — installs [OpenAI Model](models/openai.md) dependency `openai` [PyPI ↗](https://pypi.org/project/openai){:target="_blank"}
 * `vertexai` — installs `GoogleVertexProvider` dependencies `google-auth` [PyPI ↗](https://pypi.org/project/google-auth){:target="_blank"} and `requests` [PyPI ↗](https://pypi.org/project/requests){:target="_blank"}
 * `google` — installs [Google Model](models/google.md) dependency `google-genai` [PyPI ↗](https://pypi.org/project/google-genai){:target="_blank"}
diff --git a/docs/version-policy.md b/docs/version-policy.md
index 9db735a19a..a527da1145 100644
--- a/docs/version-policy.md
+++ b/docs/version-policy.md
@@ -12,7 +12,7 @@ The following changes will **NOT** be considered breaking changes, and may occur
 
 * Bug fixes that may result in existing code breaking, provided that such code was relying on undocumented features/constructs/assumptions.
 * Adding new [message parts][pydantic_ai.messages], [stream events][pydantic_ai.messages.AgentStreamEvent], or optional fields (including fields with default values) on existing message (part) and event types. Always code defensively when consuming message parts or event streams, and use the [`ModelMessagesTypeAdapter`][pydantic_ai.messages.ModelMessagesTypeAdapter] to (de)serialize message histories.
-* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals.md) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
+* Changing OpenTelemetry span attributes. Because different [observability platforms](logfire.md#using-opentelemetry) support different versions of the [OpenTelemetry Semantic Conventions for Generative AI systems](https://opentelemetry.io/docs/specs/semconv/gen-ai/), Pydantic AI lets you configure the [instrumentation version](logfire.md#configuring-data-format), but the default version may change in a minor release. Span attributes for [Pydantic Evals](evals/) may also change as we iterate on Evals support in [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
 * Changing how `__repr__` behaves, even of public classes.
 
 In all cases we will aim to minimize churn and do so only when justified by the increase of quality of Pydantic AI for users.
diff --git a/mkdocs.yml b/mkdocs.yml
index 681a18fc50..391c880af7 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -71,7 +71,7 @@ nav:
       - Code Mode: harness/code-mode.md
 
   - Pydantic Evals:
-      - Overview: evals.md
+      - Overview: evals/index.md
       - Getting Started:
           - Quick Start: evals/quick-start.md
           - Core Concepts: evals/core-concepts.md
@@ -415,7 +415,7 @@ plugins:
         API Reference:
           - api/*.md
         Evals:
-          - evals.md
+          - evals/index.md
           - evals/*.md
         Durable Execution:
           - durable_execution/*.md