chore(cli): added generate command for experiment generation#260
chore(cli): added generate command for experiment generation#260poshinchen wants to merge 1 commit into
Conversation
| def test_generate_writes_to_stdout(tmp_path: Path, capsys): | ||
| fake = AsyncMock(return_value=_experiment_with(num_cases=3)) | ||
| with patch.object( | ||
| __import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator, |
There was a problem hiding this comment.
Issue: This test uses __import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator with patch.object, while every other test in this file patches the clean string target "strands_evals.cli.commands.generate.ExperimentGenerator.from_context_async". The dynamic __import__ form is harder to read and inconsistent with the rest of the suite.
Suggestion: Use the same string-target patch(...) form as the other tests for consistency.
|
Assessment: Comment Clean, well-scoped CLI wrapper that faithfully follows the existing Review Categories
Nice attention to fast, CLI-flavored error messages and the inline rationale comments — they make the mode semantics easy to follow. |
13c48a3 to
75c84ff
Compare
|
Assessment: Approve (pending one doc fix) Re-reviewed after the latest push. Three of four round-1 items are resolved cleanly: Round-1 follow-up status
I re-verified locally: 101 CLI tests pass (14 in |
Description
Adds the
strands-evals generatesubcommand to the CLI, wrappingExperimentGeneratorfor synthesizing starter test sets from the shell.Two mutually-exclusive modes:
--context CONTEXT:ExperimentGenerator.from_context_asyncto create cases (and an optional rubric) from a free-form description.--num-topics NforTopicPlanner-driven diversity and--evaluator {OutputEvaluator|TrajectoryEvaluator|InteractionsEvaluator}to attach a generated rubric.--experiment EXPERIMENT_FILEExperimentGenerator.from_experiment_asyncto derive new cases from an existing serializedExperiment, inheriting the source's evaluators.--extra-information TEXTfor steering and--custom-evaluator MODULE:CLASS(repeatable) so the source experiment can load custom Evaluator subclasses.Shared flags:
--num-cases,--task-description,--model,-o/--output.With
-othe experiment is written viaExperiment.to_file(.jsonauto-appended); without it, the JSON document is emitted on stdout. A one-linegenerated N case(s), M evaluator(s)summary always lands on stderr for scripting.Notes
--evaluator/--num-topicsare not allowed with--experiment(the underlying API ignores them);--extra-informationrequires--experiment. Both surface as exit-code 2 with a clear message.Examples
I've executed:
Generate the experiment from scratch
Confirmed that
runcommand can execute the experiment. Then execute the next command to generate another experiment based on the above generated experiment.And confirmed that
runcommand can execute the experiment too.Related Issues
N/A
Documentation PR
N/A.
Type of Change
New feature.
Testing
13 new unit tests in
tests/strands_evals/cli/test_generate.pycovering stdout/file output,.jsonextension auto-append, pass-through of all flags tofrom_context_async,from_experiment_asynchappy path,--extra-informationpass-through, mode mutual exclusion, missing-source argparse rejection, and the three runtime guards (--evaluator/--num-topicsrejected with--experiment;--extra-informationrequires--experiment).All 100 CLI tests pass;
mypy -p srcandruff check/ruff formatclean.from_context_asyncandfrom_experiment_asyncare mocked withAsyncMock, so tests are hermetic — no Bedrock calls.I ran
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.