Skip to content

chore(cli): added generate command for experiment generation#260

Open
poshinchen wants to merge 1 commit into
strands-agents:mainfrom
poshinchen:chore/cli-generate
Open

chore(cli): added generate command for experiment generation#260
poshinchen wants to merge 1 commit into
strands-agents:mainfrom
poshinchen:chore/cli-generate

Conversation

@poshinchen

@poshinchen poshinchen commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

Adds the strands-evals generate subcommand to the CLI, wrapping ExperimentGenerator for synthesizing starter test sets from the shell.

Two mutually-exclusive modes:

  1. --context CONTEXT:
  • wraps ExperimentGenerator.from_context_async to create cases (and an optional rubric) from a free-form description.
  • Supports --num-topics N for TopicPlanner-driven diversity and --evaluator {OutputEvaluator|TrajectoryEvaluator|InteractionsEvaluator} to attach a generated rubric.
  1. --experiment EXPERIMENT_FILE
  • wraps ExperimentGenerator.from_experiment_async to derive new cases from an existing serialized Experiment, inheriting the source's evaluators.
  • Supports --extra-information TEXT for steering and --custom-evaluator MODULE:CLASS (repeatable) so the source experiment can load custom Evaluator subclasses.

Shared flags: --num-cases, --task-description, --model, -o/--output.
With -o the experiment is written via Experiment.to_file (.json auto-appended); without it, the JSON document is emitted on stdout. A one-line generated N case(s), M evaluator(s) summary always lands on stderr for scripting.

Notes

  1. CLI-level guards reject silent flag drops: --evaluator / --num-topics are not allowed with --experiment (the underlying API ignores them);
  2. --extra-information requires --experiment. Both surface as exit-code 2 with a clear message.

Examples

I've executed:
Generate the experiment from scratch

strands-evals generate \
    --context "A customer support bot for an e-commerce site. Tools: lookup_order(id), refund(order_id), check_inventory(sku)" \
    --task-description "answer customer questions and resolve issues" \
    --evaluator OutputEvaluator \
    --num-cases 10 \
    -o generated_support_context.json

Confirmed that run command can execute the experiment. Then execute the next command to generate another experiment based on the above generated experiment.

strands-evals generate \
    --experiment generated_support_context.json \
    --task-description "answer customer questions in a friendlier tone" \
    --extra-information "Bias cases toward refund-and-return scenarios; avoid SKU lookups" \
    --num-cases 12 \
    -o generated_support_context_v2.json

And confirmed that run command can execute the experiment too.

Related Issues

N/A

Documentation PR

N/A.

Type of Change

New feature.

Testing

  • 13 new unit tests in tests/strands_evals/cli/test_generate.py covering stdout/file output, .json extension auto-append, pass-through of all flags to from_context_async, from_experiment_async happy path, --extra-information pass-through, mode mutual exclusion, missing-source argparse rejection, and the three runtime guards (--evaluator / --num-topics rejected with --experiment; --extra-information requires --experiment).

  • All 100 CLI tests pass; mypy -p src and ruff check/ruff format clean.

  • from_context_async and from_experiment_async are mocked with AsyncMock, so tests are hermetic — no Bedrock calls.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread src/strands_evals/cli/commands/generate.py Outdated
def test_generate_writes_to_stdout(tmp_path: Path, capsys):
fake = AsyncMock(return_value=_experiment_with(num_cases=3))
with patch.object(
__import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This test uses __import__("strands_evals.cli.commands.generate", fromlist=["ExperimentGenerator"]).ExperimentGenerator with patch.object, while every other test in this file patches the clean string target "strands_evals.cli.commands.generate.ExperimentGenerator.from_context_async". The dynamic __import__ form is harder to read and inconsistent with the rest of the suite.

Suggestion: Use the same string-target patch(...) form as the other tests for consistency.

Comment thread tests/strands_evals/cli/test_generate.py Outdated
@github-actions

Copy link
Copy Markdown

Assessment: Comment

Clean, well-scoped CLI wrapper that faithfully follows the existing ~50-line command pattern (argparse → resolve → one library call → write). I verified the 13 tests pass and ruff/mypy are clean, the ExperimentGenerator signatures match the calls, and the to_file/to_dict round-trip is consistent. One correctness gap and a stale-docs item are worth addressing before merge.

Review Categories
  • Correctness: One flag (--custom-evaluator) escapes the "no silent flag drops" guard rail that the PR otherwise applies consistently — it's accepted but ignored in --context mode despite being documented as --experiment only.
  • Documentation: AGENTS.md still describes four CLI subcommands and omits generate.py from the directory tree; repo guidelines require updating it on structural changes. The PR's own "updated documentation" checklist item is unchecked.
  • Testing: Solid coverage of both modes, output sinks, and all guards. Minor consistency/robustness nits (one test uses a dynamic __import__ patch target; pass-through tests could assert kwargs as a single dict).

Nice attention to fast, CLI-flavored error messages and the inline rationale comments — they make the mode semantics easy to follow.

@github-actions

Copy link
Copy Markdown

Assessment: Approve (pending one doc fix)

Re-reviewed after the latest push. Three of four round-1 items are resolved cleanly:

Round-1 follow-up status
  • --custom-evaluator silent drop — now guarded in the else branch (--custom-evaluator requires --experiment.) with a dedicated test (test_generate_custom_evaluator_requires_experiment). The _validate_args docstring documents the symmetry too.
  • __import__ patch target — replaced with the clean string-target patch(...) form, consistent with the rest of the suite.
  • Per-field kwargs assertion — converted to single dict-equality (popping the non-comparable evaluator class), which now guards against stray kwargs.
  • AGENTS.md stale — the only remaining item; see inline comment. Still says "four subcommands" and omits generate.py from the tree.

I re-verified locally: 101 CLI tests pass (14 in test_generate.py, including the new guard test), and ruff + mypy are clean. Nice, thorough turnaround on the feedback — once AGENTS.md is updated this is good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant