diff --git a/.github/labelers/area.yml b/.github/labelers/area.yml new file mode 100644 index 00000000..ab4b7579 --- /dev/null +++ b/.github/labelers/area.yml @@ -0,0 +1,53 @@ +# .github/labelers/area.yml +# Classifies issues/PRs by technical area/subsystem of the Strands Evals SDK. + +instructions: | + This is the Strands Evals SDK repository (Python only). + Assign an area based on which subsystem the issue concerns. Each label below maps to a package under src/strands_evals/. + + Precedence — chaos, redteam, and multimodal own their own evaluators, generators, and sessions, so the specialized area always wins over the general one: + - Anything about fault injection, recovery strategy, partial completion, or the chaos evaluators belongs to area-chaos (NOT area-evaluators). + - Anything about adversarial/attack generation, attack strategies, or the attack-success evaluator belongs to area-redteam (NOT area-evaluators or area-generators). + - Image/MLLM-as-judge and image-to-text evaluation belongs to area-multimodal (NOT area-evaluators). + + General mappings (use only when the precedence rules above do not apply): + - General-purpose quality metrics (correctness, coherence, faithfulness, helpfulness, refusal, stereotyping, conciseness, relevance, instruction-following, goal success, harmfulness, tool selection/parameter, trajectory, output, interactions) belong to area-evaluators. + - Actor simulation, tool simulation, and multi-turn user simulation that the SDK generates belong to area-simulation. + - Failure detection and root cause analysis of a session belong to area-detectors. + - General experiment generation and topic planning belong to area-generators. + - Ingesting trace or session data that already exists elsewhere (CloudWatch, Langfuse, OpenSearch providers; session mappers; trace/tool/graph/swarm extractors; OTEL telemetry) belongs to area-tracing. This is about reading external data IN, as opposed to area-simulation which generates new conversations. + - CLI commands (run, report, validate, diagnose) and terminal/console output belong to area-cli. area-cli is strictly the command-line and console layer. Do NOT use it for web-based UIs, GUIs, dashboards, hosting, or any non-terminal interface — those have no dedicated area, so assign no area label unless they clearly concern area-core primitives. + - Core framework primitives (Case, Experiment, task handling, result/data stores) belong to area-core. area-core is NOT a catch-all: use it only when the issue is genuinely about these shared primitives, not when it concerns a specific subsystem above. + + Cross-cutting labels (not tied to one subsystem): + - area-devx covers how it feels to use the SDK: papercuts, confusing or awkward public APIs, error-message quality, ergonomics, and surprising behavior that does not match what a developer would reasonably expect. It is appropriate even for a specific subsystem when the issue is about that subsystem's public API being confusing, awkward, or surprising to use (as opposed to a clear functional bug in its internal behavior). It is NOT a catch-all for any generic "make it better" request that lacks a concrete usability concern. + - area-community is NOT a catch-all. Use it only for repo health, governance, contributor process, release process, and CI dependency bumps. If no area clearly applies, assign no area label rather than defaulting to area-community. + + Do not force an area: if no area clearly applies, assign no area label rather than guessing. + Order matters: list the most important label first, because at most 2 labels are kept. Rank by specificity — concrete subsystems (area-evaluators, area-detectors, area-simulation, area-tracing, etc.) come first, then the broader area-core, and finally the softer cross-cutting labels (area-devx, area-community) which are not tied to a feature area. When an issue reports a bug in a concrete subsystem, that subsystem takes priority. Example: "the OutputEvaluator constructor args are confusing and its error message is unhelpful" -> area-evaluators, then area-devx (subsystem first, usability second). + +labels: + area-evaluators: + description: "Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics (correctness, faithfulness, helpfulness, etc.)" + area-multimodal: + description: "Multimodal evaluation: MLLM-as-judge evaluators and image-to-text rubrics" + area-simulation: + description: "Conversation simulation: actor simulator, tool simulator, profiles, multi-turn interactions" + area-detectors: + description: "Failure detection and root cause analysis of agent sessions" + area-chaos: + description: "Chaos/fault injection: experiments, recovery strategy, partial completion, failure communication" + area-redteam: + description: "Red teaming: adversarial generation, attack strategies, attack success evaluation" + area-generators: + description: "Automated experiment generation and topic planning" + area-tracing: + description: "Trace/session ingestion: providers (CloudWatch, Langfuse, OpenSearch), session mappers, extractors, telemetry/OTEL" + area-cli: + description: "CLI commands (run, report, validate, diagnose) and console display" + area-core: + description: "Core eval framework: Case, Experiment, task handler, evaluation data stores" + area-devx: + description: "Developer experience: papercuts, confusing or awkward public APIs, error messages, ergonomics, surprising behavior" + area-community: + description: "Repo health, governance, contributor process, release process, and CI dependency bumps. Not a fallback for issues that fit no other area." diff --git a/.github/labelers/type.yml b/.github/labelers/type.yml new file mode 100644 index 00000000..db0e15ad --- /dev/null +++ b/.github/labelers/type.yml @@ -0,0 +1,24 @@ +# .github/labelers/type.yml +# Classifies issues/PRs by type (bug, feature, etc.) +# Use max_labels: 1 in the workflow since these are mutually exclusive. + +instructions: | + Choose exactly one type. You are given only the title and body — there is no diff or file list, so judge from those. + The conventional-commit prefix in the title is authoritative when present, including its optional scope (e.g. "feat(core):" counts as "feat:"). Match the prefix regardless of scope: + - "feat" -> enhancement + - "fix" -> bug + - "chore", "ci", "build", "refactor", "perf", "style", "test" -> chore + - "docs" -> follow the documentation rules below + The prefix wins even when the description sounds user-facing: a "perf:" or "refactor:" title is a chore. When no conventional-commit prefix is present, fall back to the body: maintenance with no user-facing impact (dependency bumps, CI config, internal refactors) is a chore, while anything that adds or changes user-facing functionality is an enhancement. + If the title starts with [BUG] it is a bug. If the title starts with [FEATURE] it is an enhancement. + Documentation improvements or corrections are bugs. Requests for new docs or content additions are enhancements. Documentation questions are questions. + +labels: + bug: + description: "Something is broken or not working as documented" + enhancement: + description: "New feature request or improvement to existing functionality" + question: + description: "User asking for help, clarification, or how to do something" + chore: + description: "Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact" diff --git a/.github/workflows/issue-labeler.yml b/.github/workflows/issue-labeler.yml new file mode 100644 index 00000000..34f29459 --- /dev/null +++ b/.github/workflows/issue-labeler.yml @@ -0,0 +1,44 @@ +name: Issue Labeler + +on: + issues: + types: [opened] + pull_request_target: + types: [opened] + +permissions: + issues: write + pull-requests: write + id-token: write + contents: read + +jobs: + label-area: + name: "Label: Area" + runs-on: ubuntu-latest + timeout-minutes: 2 + steps: + - uses: actions/checkout@v6 + with: + sparse-checkout: .github/labelers + sparse-checkout-cone-mode: false + - uses: strands-agents/devtools/issue-labeler@main + with: + aws_role_arn: ${{ secrets.AWS_ROLE_ARN }} + config_path: '.github/labelers/area.yml' + max_labels: '2' + + label-type: + name: "Label: Type" + runs-on: ubuntu-latest + timeout-minutes: 2 + steps: + - uses: actions/checkout@v6 + with: + sparse-checkout: .github/labelers + sparse-checkout-cone-mode: false + - uses: strands-agents/devtools/issue-labeler@main + with: + aws_role_arn: ${{ secrets.AWS_ROLE_ARN }} + config_path: '.github/labelers/type.yml' + max_labels: '1'