Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/labelers/area.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# .github/labelers/area.yml
# Classifies issues/PRs by technical area/subsystem of the Strands Evals SDK.

instructions: |
This is the Strands Evals SDK repository (Python only).
Assign an area based on which subsystem the issue concerns. Each label below maps to a package under src/strands_evals/.

Precedence — chaos, redteam, and multimodal own their own evaluators, generators, and sessions, so the specialized area always wins over the general one:
- Anything about fault injection, recovery strategy, partial completion, or the chaos evaluators belongs to area-chaos (NOT area-evaluators).
- Anything about adversarial/attack generation, attack strategies, or the attack-success evaluator belongs to area-redteam (NOT area-evaluators or area-generators).
- Image/MLLM-as-judge and image-to-text evaluation belongs to area-multimodal (NOT area-evaluators).

General mappings (use only when the precedence rules above do not apply):
- General-purpose quality metrics (correctness, coherence, faithfulness, helpfulness, refusal, stereotyping, conciseness, relevance, instruction-following, goal success, harmfulness, tool selection/parameter, trajectory, output, interactions) belong to area-evaluators.
- Actor simulation, tool simulation, and multi-turn user simulation that the SDK generates belong to area-simulation.
- Failure detection and root cause analysis of a session belong to area-detectors.
- General experiment generation and topic planning belong to area-generators.
- Ingesting trace or session data that already exists elsewhere (CloudWatch, Langfuse, OpenSearch providers; session mappers; trace/tool/graph/swarm extractors; OTEL telemetry) belongs to area-tracing. This is about reading external data IN, as opposed to area-simulation which generates new conversations.
- CLI commands (run, report, validate, diagnose) and terminal/console output belong to area-cli. area-cli is strictly the command-line and console layer. Do NOT use it for web-based UIs, GUIs, dashboards, hosting, or any non-terminal interface — those have no dedicated area, so assign no area label unless they clearly concern area-core primitives.
- Core framework primitives (Case, Experiment, task handling, result/data stores) belong to area-core. area-core is NOT a catch-all: use it only when the issue is genuinely about these shared primitives, not when it concerns a specific subsystem above.

Cross-cutting labels (not tied to one subsystem):
- area-devx covers how it feels to use the SDK: papercuts, confusing or awkward public APIs, error-message quality, ergonomics, and surprising behavior that does not match what a developer would reasonably expect. It is appropriate even for a specific subsystem when the issue is about that subsystem's public API being confusing, awkward, or surprising to use (as opposed to a clear functional bug in its internal behavior). It is NOT a catch-all for any generic "make it better" request that lacks a concrete usability concern.
- area-community is NOT a catch-all. Use it only for repo health, governance, contributor process, release process, and CI dependency bumps. If no area clearly applies, assign no area label rather than defaulting to area-community.

Do not force an area: if no area clearly applies, assign no area label rather than guessing.
Order matters: list the most important label first, because at most 2 labels are kept. Rank by specificity — concrete subsystems (area-evaluators, area-detectors, area-simulation, area-tracing, etc.) come first, then the broader area-core, and finally the softer cross-cutting labels (area-devx, area-community) which are not tied to a feature area. When an issue reports a bug in a concrete subsystem, that subsystem takes priority. Example: "the OutputEvaluator constructor args are confusing and its error message is unhelpful" -> area-evaluators, then area-devx (subsystem first, usability second).

labels:
area-evaluators:
description: "Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics (correctness, faithfulness, helpfulness, etc.)"
area-multimodal:
description: "Multimodal evaluation: MLLM-as-judge evaluators and image-to-text rubrics"
area-simulation:
description: "Conversation simulation: actor simulator, tool simulator, profiles, multi-turn interactions"
area-detectors:
description: "Failure detection and root cause analysis of agent sessions"
area-chaos:
description: "Chaos/fault injection: experiments, recovery strategy, partial completion, failure communication"
area-redteam:
description: "Red teaming: adversarial generation, attack strategies, attack success evaluation"
area-generators:
description: "Automated experiment generation and topic planning"
area-tracing:
description: "Trace/session ingestion: providers (CloudWatch, Langfuse, OpenSearch), session mappers, extractors, telemetry/OTEL"
area-cli:
description: "CLI commands (run, report, validate, diagnose) and console display"
area-core:
description: "Core eval framework: Case, Experiment, task handler, evaluation data stores"
area-devx:
description: "Developer experience: papercuts, confusing or awkward public APIs, error messages, ergonomics, surprising behavior"
area-community:
description: "Repo health, governance, contributor process, release process, and CI dependency bumps. Not a fallback for issues that fit no other area."
24 changes: 24 additions & 0 deletions .github/labelers/type.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# .github/labelers/type.yml
# Classifies issues/PRs by type (bug, feature, etc.)
# Use max_labels: 1 in the workflow since these are mutually exclusive.

instructions: |
Choose exactly one type. You are given only the title and body — there is no diff or file list, so judge from those.
The conventional-commit prefix in the title is authoritative when present, including its optional scope (e.g. "feat(core):" counts as "feat:"). Match the prefix regardless of scope:
- "feat" -> enhancement
- "fix" -> bug
- "chore", "ci", "build", "refactor", "perf", "style", "test" -> chore
- "docs" -> follow the documentation rules below
The prefix wins even when the description sounds user-facing: a "perf:" or "refactor:" title is a chore. When no conventional-commit prefix is present, fall back to the body: maintenance with no user-facing impact (dependency bumps, CI config, internal refactors) is a chore, while anything that adds or changes user-facing functionality is an enhancement.
If the title starts with [BUG] it is a bug. If the title starts with [FEATURE] it is an enhancement.
Documentation improvements or corrections are bugs. Requests for new docs or content additions are enhancements. Documentation questions are questions.

labels:
bug:
description: "Something is broken or not working as documented"
enhancement:
description: "New feature request or improvement to existing functionality"
question:
description: "User asking for help, clarification, or how to do something"
chore:
description: "Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact"
44 changes: 44 additions & 0 deletions .github/workflows/issue-labeler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Issue Labeler

on:
issues:
Comment thread
yonib05 marked this conversation as resolved.
types: [opened]
pull_request_target:
types: [opened]

permissions:
issues: write
pull-requests: write
id-token: write
contents: read

jobs:
Comment thread
yonib05 marked this conversation as resolved.
label-area:
name: "Label: Area"
runs-on: ubuntu-latest
timeout-minutes: 2
steps:
- uses: actions/checkout@v6
with:
sparse-checkout: .github/labelers
sparse-checkout-cone-mode: false
- uses: strands-agents/devtools/issue-labeler@main
Comment thread
yonib05 marked this conversation as resolved.
with:
aws_role_arn: ${{ secrets.AWS_ROLE_ARN }}
config_path: '.github/labelers/area.yml'
max_labels: '2'

label-type:
name: "Label: Type"
runs-on: ubuntu-latest
timeout-minutes: 2
steps:
- uses: actions/checkout@v6
with:
sparse-checkout: .github/labelers
sparse-checkout-cone-mode: false
- uses: strands-agents/devtools/issue-labeler@main
with:
aws_role_arn: ${{ secrets.AWS_ROLE_ARN }}
config_path: '.github/labelers/type.yml'
max_labels: '1'
Loading