chore(cli): added generate command for experiment generation by poshinchen · Pull Request #260 · strands-agents/evals

poshinchen · 2026-06-11T19:55:51Z

Description

Adds the strands-evals generate subcommand to the CLI, wrapping ExperimentGenerator for synthesizing starter test sets from the shell.

Two mutually-exclusive modes:

--context CONTEXT:

wraps ExperimentGenerator.from_context_async to create cases (and an optional rubric) from a free-form description.
Supports --num-topics N for TopicPlanner-driven diversity and --evaluator {OutputEvaluator|TrajectoryEvaluator|InteractionsEvaluator} to attach a generated rubric.

--experiment EXPERIMENT_FILE

wraps ExperimentGenerator.from_experiment_async to derive new cases from an existing serialized Experiment, inheriting the source's evaluators.
Supports --extra-information TEXT for steering and --custom-evaluator MODULE:CLASS (repeatable) so the source experiment can load custom Evaluator subclasses.

Shared flags: --num-cases, --task-description, --model, -o/--output.
With -o the experiment is written via Experiment.to_file (.json auto-appended); without it, the JSON document is emitted on stdout. A one-line generated N case(s), M evaluator(s) summary always lands on stderr for scripting.

Notes

CLI-level guards reject silent flag drops: --evaluator / --num-topics are not allowed with --experiment (the underlying API ignores them);
--extra-information requires --experiment. Both surface as exit-code 2 with a clear message.

Examples

I've executed:
Generate the experiment from scratch

strands-evals generate \
    --context "A customer support bot for an e-commerce site. Tools: lookup_order(id), refund(order_id), check_inventory(sku)" \
    --task-description "answer customer questions and resolve issues" \
    --evaluator OutputEvaluator \
    --num-cases 10 \
    -o generated_support_context.json

Confirmed that run command can execute the experiment. Then execute the next command to generate another experiment based on the above generated experiment.

strands-evals generate \
    --experiment generated_support_context.json \
    --task-description "answer customer questions in a friendlier tone" \
    --extra-information "Bias cases toward refund-and-return scenarios; avoid SKU lookups" \
    --num-cases 12 \
    -o generated_support_context_v2.json

And confirmed that run command can execute the experiment too.

Related Issues

N/A

Documentation PR

N/A.

Type of Change

New feature.

Testing

13 new unit tests in tests/strands_evals/cli/test_generate.py covering stdout/file output, .json extension auto-append, pass-through of all flags to from_context_async, from_experiment_async happy path, --extra-information pass-through, mode mutual exclusion, missing-source argparse rejection, and the three runtime guards (--evaluator / --num-topics rejected with --experiment; --extra-information requires --experiment).
All 100 CLI tests pass; mypy -p src and ruff check/ruff format clean.
from_context_async and from_experiment_async are mocked with AsyncMock, so tests are hermetic — no Bedrock calls.
I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-06-11T20:11:51Z

+def test_generate_writes_to_stdout(tmp_path: Path, capsys):
+    fake = AsyncMock(return_value=_experiment_with(num_cases=3))
+    with patch.object(
+        __import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator,


Issue: This test uses __import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator with patch.object, while every other test in this file patches the clean string target "strands_evals.cli.commands.generate.ExperimentGenerator.from_context_async". The dynamic __import__ form is harder to read and inconsistent with the rest of the suite.

Suggestion: Use the same string-target patch(...) form as the other tests for consistency.

github-actions · 2026-06-11T20:11:54Z

Assessment: Comment

Clean, well-scoped CLI wrapper that faithfully follows the existing ~50-line command pattern (argparse → resolve → one library call → write). I verified the 13 tests pass and ruff/mypy are clean, the ExperimentGenerator signatures match the calls, and the to_file/to_dict round-trip is consistent. One correctness gap and a stale-docs item are worth addressing before merge.

Review Categories

Correctness: One flag (--custom-evaluator) escapes the "no silent flag drops" guard rail that the PR otherwise applies consistently — it's accepted but ignored in --context mode despite being documented as --experiment only.
Documentation: AGENTS.md still describes four CLI subcommands and omits generate.py from the directory tree; repo guidelines require updating it on structural changes. The PR's own "updated documentation" checklist item is unchecked.
Testing: Solid coverage of both modes, output sinks, and all guards. Minor consistency/robustness nits (one test uses a dynamic __import__ patch target; pass-through tests could assert kwargs as a single dict).

Nice attention to fast, CLI-flavored error messages and the inline rationale comments — they make the mode semantics easy to follow.

github-actions · 2026-06-11T20:24:47Z

Assessment: Approve (pending one doc fix)

Re-reviewed after the latest push. Three of four round-1 items are resolved cleanly:

Round-1 follow-up status

✅ --custom-evaluator silent drop — now guarded in the else branch (--custom-evaluator requires --experiment.) with a dedicated test (test_generate_custom_evaluator_requires_experiment). The _validate_args docstring documents the symmetry too.
✅ __import__ patch target — replaced with the clean string-target patch(...) form, consistent with the rest of the suite.
✅ Per-field kwargs assertion — converted to single dict-equality (popping the non-comparable evaluator class), which now guards against stray kwargs.
⏳ AGENTS.md stale — the only remaining item; see inline comment. Still says "four subcommands" and omits generate.py from the tree.

I re-verified locally: 101 CLI tests pass (14 in test_generate.py, including the new guard test), and ruff + mypy are clean. Nice, thorough turnaround on the feedback — once AGENTS.md is updated this is good to merge.

poshinchen temporarily deployed to auto-approve June 11, 2026 19:56 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 11, 2026

github-actions Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src/strands_evals/cli/commands/generate.py Outdated

github-actions Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/strands_evals/cli/test_generate.py Outdated

github-actions Bot removed the strands-running label Jun 11, 2026

chore(cli): added generate command for experiment generation

75c84ff

poshinchen force-pushed the chore/cli-generate branch from 13c48a3 to 75c84ff Compare June 11, 2026 20:20

poshinchen temporarily deployed to auto-approve June 11, 2026 20:20 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 11, 2026

github-actions Bot removed the strands-running label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(cli): added generate command for experiment generation#260

chore(cli): added generate command for experiment generation#260
poshinchen wants to merge 1 commit into
strands-agents:mainfrom
poshinchen:chore/cli-generate

poshinchen commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poshinchen commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Notes

Examples

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

github-actions Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poshinchen commented Jun 11, 2026 •

edited

Loading