chore: allow additional fields to EvaluationData and flexible experiment report type#232
chore: allow additional fields to EvaluationData and flexible experiment report type#232poshinchen wants to merge 2 commits into
Conversation
|
Issue: Missing tests for the two new features introduced in this PR.
Per AGENTS.md: "new code needs a mirrored test file in Suggestion: Add at least:
|
|
Assessment: Request Changes This PR introduces two useful extensibility mechanisms: extra field passthrough from Review Categories
The extensibility patterns themselves are well-designed and align with the project's goals. |
e4f85a5 to
f17db97
Compare
f17db97 to
280b176
Compare
…e parse The previous explanatory comment started with "type: ignore[assignment]:" which mypy parsed as a malformed inline directive. Rephrase the leading words so mypy treats it as a plain comment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Assessment: Approve All concerns from the previous review have been addressed: collision protection added, docstrings document the The changes are minimal (+29/-9 in production code), backward-compatible (PEP 696 TypeVar default), and well-tested. LGTM. |
|
Assessment: Approve All issues from the previous review have been addressed. The PR is well-scoped, backward-compatible, and now includes comprehensive tests covering both extension points. Changes verified
Clean, well-designed extensibility hooks. |
|
Assessment: Approve Clean, minimal extensibility changes that are well-tested and backward-compatible. Previous review concerns have been fully addressed. Verification
Well-scoped PR with good attention to backward compatibility. |
| rejected — the tradeoff for supporting subclass passthrough. | ||
| """ | ||
|
|
||
| model_config = ConfigDict(extra="allow") |
There was a problem hiding this comment.
Let's double-check that there's not a subtle gotcha lurking here.
The pydantic extras are preserved as live objects in memory, but they have no field annotation, so model_validate can't reconstruct their type. Any EvaluationDataStore round-trips via model_dump/model_validate, so on a cache hit a typed config: SomeModel comes back as a plain dict:
FRESH path : config -> SomeModel (config.label works)
CACHED path: config -> dict (config.label raises AttributeError)
So an evaluator written with evaluation_case.config.label passes on first run and raises an exception on the second once evaluation_data_store= is set. The new tests only cover the cache-miss path.
Please either:
- add a cache-path test (save → load → assert the evaluator still reads the field), and document that extras survive as dict through a store; or
- have subclasses declare the field on a typed
EvaluationDatasubclass instead of relying onextra="allow".
Minimal repro:
ed = EvaluationData(input="hi", config=SomeModel(label="x", weight=0.7))
EvaluationData.model_validate(ed.model_dump()).config # -> dict, not SomeModel
There was a problem hiding this comment.
yes I think this is not a blocker as of now, but we definitely need to think about how to achieve the recovery when loading from the EvaluationData.
Description
Two related extensibility changes to
Case/EvaluationDataandExperimentso subclasses can carry typed fields end-to-end and customize the report element type without fighting overrides.1.
EvaluationDataaccepts extra fields fromCasesubclasses.model_config = ConfigDict(extra="allow")toEvaluationData.Experiment._run_task_asyncnow passes any subclass-onlyCasefields through viagetattr(notmodel_dump, so nestedBaseModels reach the evaluator with their types intact).Casesubclass that adds a typed field (e.g.config: SomeModel) no longer needs to flatten that data intometadata— the typed object reaches the evaluator directly.2.
Experimentis generic over its report element type.ReportT = TypeVar("ReportT", bound=EvaluationReport, default=EvaluationReport).Experiment[InputT, OutputT, ReportT]with areport_cls: type[ReportT] = EvaluationReportclass attribute.run_evaluationsandrun_evaluations_asyncreturnlist[ReportT].ReportT = MyReportand setreport_cls = MyReport; no# type: ignore[override]needed.Related Issues
n/a
Documentation PR
n/a
Type of Change
New feature (extensibility hooks; backwards-compatible)
Testing
hatch test— 1069 unit tests pass (1065 existing + 4 new)mypy src/strands_evals/experiment.py tests/strands_evals/test_experiment.py— cleanNew tests cover both extension points: typed subclass-field passthrough into
EvaluationData, an evaluator reading that typed field end-to-end, the defaultreport_clsbehavior, and a subclass swappingreport_cls.Default callers (
Experiment[str, str],Experiment(cases=..., evaluators=...)) unchanged at runtime and in type-checker output thanks to the PEP 696 TypeVar default.I ran
hatch run prepareBackwards compatibility
Runtime: identical for default usage. Static typing: existing
Experiment[str, str]annotations keep working becauseReportTdefaults via PEP 696 (already supported by the repo's pinnedtyping_extensions>=4.13.2). Modern mypy/pyright handle this; only callers on outdated type-checkers would see a "too few type arguments" warning.Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.