chore: allow additional fields to EvaluationData and flexible experiment report type by poshinchen · Pull Request #232 · strands-agents/evals

poshinchen · 2026-05-22T16:51:37Z

Description

Two related extensibility changes to Case / EvaluationData and Experiment so subclasses can carry typed fields end-to-end and customize the report element type without fighting overrides.

1. EvaluationData accepts extra fields from Case subclasses.

Added model_config = ConfigDict(extra="allow") to EvaluationData.
Experiment._run_task_async now passes any subclass-only Case fields through via getattr (not model_dump, so nested BaseModels reach the evaluator with their types intact).
A Case subclass that adds a typed field (e.g. config: SomeModel) no longer needs to flatten that data into metadata — the typed object reaches the evaluator directly.

2. Experiment is generic over its report element type.

New ReportT = TypeVar("ReportT", bound=EvaluationReport, default=EvaluationReport).
Experiment[InputT, OutputT, ReportT] with a report_cls: type[ReportT] = EvaluationReport class attribute.
run_evaluations and run_evaluations_async return list[ReportT].
Subclasses bind ReportT = MyReport and set report_cls = MyReport; no # type: ignore[override] needed.

Related Issues

n/a

Documentation PR

n/a

Type of Change

New feature (extensibility hooks; backwards-compatible)

Testing

hatch test — 1069 unit tests pass (1065 existing + 4 new)
mypy src/strands_evals/experiment.py tests/strands_evals/test_experiment.py — clean
New tests cover both extension points: typed subclass-field passthrough into EvaluationData, an evaluator reading that typed field end-to-end, the default report_cls behavior, and a subclass swapping report_cls.
Default callers (Experiment[str, str], Experiment(cases=..., evaluators=...)) unchanged at runtime and in type-checker output thanks to the PEP 696 TypeVar default.
I ran hatch run prepare

Backwards compatibility

Runtime: identical for default usage. Static typing: existing Experiment[str, str] annotations keep working because ReportT defaults via PEP 696 (already supported by the repo's pinned typing_extensions>=4.13.2). Modern mypy/pyright handle this; only callers on outdated type-checkers would see a "too few type arguments" warning.

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-05-22T16:55:37Z

Issue: Missing tests for the two new features introduced in this PR.

Extra fields passthrough: No test verifies that a Case subclass with custom fields correctly propagates those fields to EvaluationData via extra_fields.
ReportT / report_cls: No test verifies that a subclass of Experiment can bind ReportT to a custom report type and that run_evaluations returns instances of that type.

Per AGENTS.md: "new code needs a mirrored test file in tests/strands_evals/..."

Suggestion: Add at least:

A test with a Case subclass that has an extra field, runs through _run_task_async, and asserts the field appears on the resulting EvaluationData
A test with a custom Experiment subclass that sets report_cls to a custom EvaluationReport subclass and verifies the returned reports are of the correct type

github-actions · 2026-05-22T16:55:41Z

Assessment: Request Changes

This PR introduces two useful extensibility mechanisms: extra field passthrough from Case subclasses to EvaluationData, and a customizable report type via ReportT/report_cls. However, there are gaps that should be addressed before merging.

Review Categories

Testing: No tests exist for either new feature. Both the extra fields mechanism and the report_cls customization pattern need unit test coverage.
Safety: The extra fields passthrough has a collision risk where a Case subclass field could silently override an EvaluationData declared field. A guard against this would prevent subtle bugs.
Documentation: The PR description is empty — the motivation, use cases, and backward compatibility implications should be documented for reviewers and future maintainers.
API Design: The ReportT addition with TypeVar(default=...) is backward compatible thanks to typing-extensions>=4.13.2, but should be explicitly called out as a public API surface change.

The extensibility patterns themselves are well-designed and align with the project's goals.

…ent report type

…e parse The previous explanatory comment started with "type: ignore[assignment]:" which mypy parsed as a malformed inline directive. Rephrase the leading words so mypy treats it as a plain comment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-22T17:06:45Z

Assessment: Approve

All concerns from the previous review have been addressed: collision protection added, docstrings document the extra="allow" tradeoff, the type: ignore is well-commented, the PR description explains motivation and backward compatibility, and 4 new tests cover both extension points end-to-end.

The changes are minimal (+29/-9 in production code), backward-compatible (PEP 696 TypeVar default), and well-tested. LGTM.

github-actions · 2026-05-22T17:07:09Z

Assessment: Approve

All issues from the previous review have been addressed. The PR is well-scoped, backward-compatible, and now includes comprehensive tests covering both extension points.

Changes verified

Collision guard: Extra fields filter now excludes EvaluationData.model_fields (prevents silent overwrites)
Documentation: Docstring documents the extra="allow" tradeoff; type: ignore has a clear 3-line rationale
Tests: 4 new tests verify typed passthrough, evaluator access to extra fields, default report_cls behavior, and subclass report type swapping
Backward compat: PEP 696 TypeVar default ensures Experiment[str, str] continues to work

Clean, well-designed extensibility hooks.

github-actions · 2026-05-22T17:20:53Z

Assessment: Approve

Clean, minimal extensibility changes that are well-tested and backward-compatible. Previous review concerns have been fully addressed.

Verification

Collision guard: Extra field passthrough filters against both Case.model_fields and EvaluationData.model_fields — prevents silent overwrites
Type safety: ReportT uses PEP 696 TypeVar(default=...) backed by typing-extensions>=4.13.2 — existing Experiment[str, str] annotations continue working
Testing: 4 new tests cover typed passthrough, evaluator access to extras, default report type, and custom report subclass
Documentation: Docstring documents the extra="allow" tradeoff; type: ignore is well-justified
Style: New logging at line 464 uses %s interpolation per AGENTS.md; no new f-string logger calls introduced
No new dependencies: Uses existing pydantic, typing_extensions

Well-scoped PR with good attention to backward compatibility.

jjbuck · 2026-06-01T16:23:37Z

+    rejected — the tradeoff for supporting subclass passthrough.
    """

+    model_config = ConfigDict(extra="allow")


Let's double-check that there's not a subtle gotcha lurking here.

The pydantic extras are preserved as live objects in memory, but they have no field annotation, so model_validate can't reconstruct their type. Any EvaluationDataStore round-trips via model_dump/model_validate, so on a cache hit a typed config: SomeModel comes back as a plain dict:

FRESH path : config -> SomeModel (config.label works) CACHED path: config -> dict (config.label raises AttributeError)

So an evaluator written with evaluation_case.config.label passes on first run and raises an exception on the second once evaluation_data_store= is set. The new tests only cover the cache-miss path.

Please either:

add a cache-path test (save → load → assert the evaluator still reads the field), and document that extras survive as dict through a store; or

have subclasses declare the field on a typed EvaluationData subclass instead of relying on extra="allow".

Minimal repro:

ed = EvaluationData(input="hi", config=SomeModel(label="x", weight=0.7)) EvaluationData.model_validate(ed.model_dump()).config # -> dict, not SomeModel

yes I think this is not a blocker as of now, but we definitely need to think about how to achieve the recovery when loading from the EvaluationData.

poshinchen temporarily deployed to auto-approve May 22, 2026 16:51 — with GitHub Actions Inactive

poshinchen changed the title ~~chore: allow additional fields to EvaluatioonData and flexible experi…~~ chore: allow additional fields to EvaluatioonData and flexible experiment report type May 22, 2026

poshinchen changed the title ~~chore: allow additional fields to EvaluatioonData and flexible experiment report type~~ chore: allow additional fields to EvaluationData and flexible experiment report type May 22, 2026

github-actions Bot added the strands-running label May 22, 2026

github-actions Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/strands_evals/experiment.py Outdated

github-actions Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/strands_evals/types/evaluation.py

github-actions Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/strands_evals/experiment.py

github-actions Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/strands_evals/experiment.py

github-actions Bot removed the strands-running label May 22, 2026

poshinchen force-pushed the chore/add-kwargs-and-report-type branch from e4f85a5 to f17db97 Compare May 22, 2026 17:02

chore: allow additional fields to EvaluationData and flexible experim…

280b176

…ent report type

poshinchen temporarily deployed to auto-approve May 22, 2026 17:02 — with GitHub Actions Inactive

poshinchen force-pushed the chore/add-kwargs-and-report-type branch from f17db97 to 280b176 Compare May 22, 2026 17:02

poshinchen temporarily deployed to auto-approve May 22, 2026 17:02 — with GitHub Actions Inactive

github-actions Bot added the strands-running label May 22, 2026

poshinchen temporarily deployed to auto-approve May 22, 2026 17:04 — with GitHub Actions Inactive

github-actions Bot removed the strands-running label May 22, 2026

poshinchen mentioned this pull request May 22, 2026

feat(redteam): add built-in red teaming support #184

Merged

6 tasks

jjbuck reviewed Jun 1, 2026

View reviewed changes

poshinchen mentioned this pull request Jun 2, 2026

fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives #240

Merged

yonib05 added area-core Core eval framework: Case, Experiment, task handler, evaluation data stores chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: allow additional fields to EvaluationData and flexible experiment report type#232

chore: allow additional fields to EvaluationData and flexible experiment report type#232
poshinchen wants to merge 2 commits into
strands-agents:mainfrom
poshinchen:chore/add-kwargs-and-report-type

poshinchen commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

jjbuck Jun 1, 2026

Uh oh!

poshinchen Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

poshinchen commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Backwards compatibility

Checklist

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

jjbuck Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

poshinchen Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

poshinchen commented May 22, 2026 •

edited

Loading