Skip to content

chore: allow additional fields to EvaluationData and flexible experiment report type#232

Open
poshinchen wants to merge 2 commits into
strands-agents:mainfrom
poshinchen:chore/add-kwargs-and-report-type
Open

chore: allow additional fields to EvaluationData and flexible experiment report type#232
poshinchen wants to merge 2 commits into
strands-agents:mainfrom
poshinchen:chore/add-kwargs-and-report-type

Conversation

@poshinchen

@poshinchen poshinchen commented May 22, 2026

Copy link
Copy Markdown
Contributor

Description

Two related extensibility changes to Case / EvaluationData and Experiment so subclasses can carry typed fields end-to-end and customize the report element type without fighting overrides.

1. EvaluationData accepts extra fields from Case subclasses.

  • Added model_config = ConfigDict(extra="allow") to EvaluationData.
  • Experiment._run_task_async now passes any subclass-only Case fields through via getattr (not model_dump, so nested BaseModels reach the evaluator with their types intact).
  • A Case subclass that adds a typed field (e.g. config: SomeModel) no longer needs to flatten that data into metadata — the typed object reaches the evaluator directly.

2. Experiment is generic over its report element type.

  • New ReportT = TypeVar("ReportT", bound=EvaluationReport, default=EvaluationReport).
  • Experiment[InputT, OutputT, ReportT] with a report_cls: type[ReportT] = EvaluationReport class attribute.
  • run_evaluations and run_evaluations_async return list[ReportT].
  • Subclasses bind ReportT = MyReport and set report_cls = MyReport; no # type: ignore[override] needed.

Related Issues

n/a

Documentation PR

n/a

Type of Change

New feature (extensibility hooks; backwards-compatible)

Testing

  • hatch test — 1069 unit tests pass (1065 existing + 4 new)

  • mypy src/strands_evals/experiment.py tests/strands_evals/test_experiment.py — clean

  • New tests cover both extension points: typed subclass-field passthrough into EvaluationData, an evaluator reading that typed field end-to-end, the default report_cls behavior, and a subclass swapping report_cls.

  • Default callers (Experiment[str, str], Experiment(cases=..., evaluators=...)) unchanged at runtime and in type-checker output thanks to the PEP 696 TypeVar default.

  • I ran hatch run prepare

Backwards compatibility

Runtime: identical for default usage. Static typing: existing Experiment[str, str] annotations keep working because ReportT defaults via PEP 696 (already supported by the repo's pinned typing_extensions>=4.13.2). Modern mypy/pyright handle this; only callers on outdated type-checkers would see a "too few type arguments" warning.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@poshinchen poshinchen changed the title chore: allow additional fields to EvaluatioonData and flexible experi… chore: allow additional fields to EvaluatioonData and flexible experiment report type May 22, 2026
@poshinchen poshinchen changed the title chore: allow additional fields to EvaluatioonData and flexible experiment report type chore: allow additional fields to EvaluationData and flexible experiment report type May 22, 2026
Comment thread src/strands_evals/experiment.py Outdated
Comment thread src/strands_evals/types/evaluation.py
@github-actions

Copy link
Copy Markdown

Issue: Missing tests for the two new features introduced in this PR.

  1. Extra fields passthrough: No test verifies that a Case subclass with custom fields correctly propagates those fields to EvaluationData via extra_fields.
  2. ReportT / report_cls: No test verifies that a subclass of Experiment can bind ReportT to a custom report type and that run_evaluations returns instances of that type.

Per AGENTS.md: "new code needs a mirrored test file in tests/strands_evals/..."

Suggestion: Add at least:

  • A test with a Case subclass that has an extra field, runs through _run_task_async, and asserts the field appears on the resulting EvaluationData
  • A test with a custom Experiment subclass that sets report_cls to a custom EvaluationReport subclass and verifies the returned reports are of the correct type

Comment thread src/strands_evals/experiment.py
Comment thread src/strands_evals/experiment.py
@github-actions

Copy link
Copy Markdown

Assessment: Request Changes

This PR introduces two useful extensibility mechanisms: extra field passthrough from Case subclasses to EvaluationData, and a customizable report type via ReportT/report_cls. However, there are gaps that should be addressed before merging.

Review Categories
  • Testing: No tests exist for either new feature. Both the extra fields mechanism and the report_cls customization pattern need unit test coverage.
  • Safety: The extra fields passthrough has a collision risk where a Case subclass field could silently override an EvaluationData declared field. A guard against this would prevent subtle bugs.
  • Documentation: The PR description is empty — the motivation, use cases, and backward compatibility implications should be documented for reviewers and future maintainers.
  • API Design: The ReportT addition with TypeVar(default=...) is backward compatible thanks to typing-extensions>=4.13.2, but should be explicitly called out as a public API surface change.

The extensibility patterns themselves are well-designed and align with the project's goals.

@poshinchen poshinchen force-pushed the chore/add-kwargs-and-report-type branch from e4f85a5 to f17db97 Compare May 22, 2026 17:02
…e parse

The previous explanatory comment started with "type: ignore[assignment]:"
which mypy parsed as a malformed inline directive. Rephrase the leading
words so mypy treats it as a plain comment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Assessment: Approve

All concerns from the previous review have been addressed: collision protection added, docstrings document the extra="allow" tradeoff, the type: ignore is well-commented, the PR description explains motivation and backward compatibility, and 4 new tests cover both extension points end-to-end.

The changes are minimal (+29/-9 in production code), backward-compatible (PEP 696 TypeVar default), and well-tested. LGTM.

@github-actions

Copy link
Copy Markdown

Assessment: Approve

All issues from the previous review have been addressed. The PR is well-scoped, backward-compatible, and now includes comprehensive tests covering both extension points.

Changes verified
  • Collision guard: Extra fields filter now excludes EvaluationData.model_fields (prevents silent overwrites)
  • Documentation: Docstring documents the extra="allow" tradeoff; type: ignore has a clear 3-line rationale
  • Tests: 4 new tests verify typed passthrough, evaluator access to extra fields, default report_cls behavior, and subclass report type swapping
  • Backward compat: PEP 696 TypeVar default ensures Experiment[str, str] continues to work

Clean, well-designed extensibility hooks.

@github-actions

Copy link
Copy Markdown

Assessment: Approve

Clean, minimal extensibility changes that are well-tested and backward-compatible. Previous review concerns have been fully addressed.

Verification
  • Collision guard: Extra field passthrough filters against both Case.model_fields and EvaluationData.model_fields — prevents silent overwrites
  • Type safety: ReportT uses PEP 696 TypeVar(default=...) backed by typing-extensions>=4.13.2 — existing Experiment[str, str] annotations continue working
  • Testing: 4 new tests cover typed passthrough, evaluator access to extras, default report type, and custom report subclass
  • Documentation: Docstring documents the extra="allow" tradeoff; type: ignore is well-justified
  • Style: New logging at line 464 uses %s interpolation per AGENTS.md; no new f-string logger calls introduced
  • No new dependencies: Uses existing pydantic, typing_extensions

Well-scoped PR with good attention to backward compatibility.

rejected — the tradeoff for supporting subclass passthrough.
"""

model_config = ConfigDict(extra="allow")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's double-check that there's not a subtle gotcha lurking here.

The pydantic extras are preserved as live objects in memory, but they have no field annotation, so model_validate can't reconstruct their type. Any EvaluationDataStore round-trips via model_dump/model_validate, so on a cache hit a typed config: SomeModel comes back as a plain dict:

  FRESH path : config -> SomeModel   (config.label works)
  CACHED path: config -> dict        (config.label raises AttributeError)

So an evaluator written with evaluation_case.config.label passes on first run and raises an exception on the second once evaluation_data_store= is set. The new tests only cover the cache-miss path.

Please either:

  1. add a cache-path test (save → load → assert the evaluator still reads the field), and document that extras survive as dict through a store; or
  2. have subclasses declare the field on a typed EvaluationData subclass instead of relying on extra="allow".

Minimal repro:

  ed = EvaluationData(input="hi", config=SomeModel(label="x", weight=0.7))
  EvaluationData.model_validate(ed.model_dump()).config  # -> dict, not SomeModel

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think this is not a blocker as of now, but we definitely need to think about how to achieve the recovery when loading from the EvaluationData.

@yonib05 yonib05 added area-core Core eval framework: Case, Experiment, task handler, evaluation data stores chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-core Core eval framework: Case, Experiment, task handler, evaluation data stores chore Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants