-
Notifications
You must be signed in to change notification settings - Fork 37
feat(issue-labeler): add LLM issue labeler for area and type #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+121
−0
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
b4ca6f9
feat(issue-labeler): add LLM issue labeler for area and type
yonib05 cd98e74
fix(issue-labeler): scope area-cli to terminal, not web UIs
yonib05 147ecb7
fix(issue-labeler): clarify type.yml prefix rules
yonib05 64c23f7
fix(issue-labeler): add area label priority ordering
yonib05 317af73
feat(issue-labeler): add area-devx and area-community to evals
yonib05 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # .github/labelers/area.yml | ||
| # Classifies issues/PRs by technical area/subsystem of the Strands Evals SDK. | ||
|
|
||
| instructions: | | ||
| This is the Strands Evals SDK repository (Python only). | ||
| Assign an area based on which subsystem the issue concerns. Each label below maps to a package under src/strands_evals/. | ||
|
|
||
| Precedence — chaos, redteam, and multimodal own their own evaluators, generators, and sessions, so the specialized area always wins over the general one: | ||
| - Anything about fault injection, recovery strategy, partial completion, or the chaos evaluators belongs to area-chaos (NOT area-evaluators). | ||
| - Anything about adversarial/attack generation, attack strategies, or the attack-success evaluator belongs to area-redteam (NOT area-evaluators or area-generators). | ||
| - Image/MLLM-as-judge and image-to-text evaluation belongs to area-multimodal (NOT area-evaluators). | ||
|
|
||
| General mappings (use only when the precedence rules above do not apply): | ||
| - General-purpose quality metrics (correctness, coherence, faithfulness, helpfulness, refusal, stereotyping, conciseness, relevance, instruction-following, goal success, harmfulness, tool selection/parameter, trajectory, output, interactions) belong to area-evaluators. | ||
| - Actor simulation, tool simulation, and multi-turn user simulation that the SDK generates belong to area-simulation. | ||
| - Failure detection and root cause analysis of a session belong to area-detectors. | ||
| - General experiment generation and topic planning belong to area-generators. | ||
| - Ingesting trace or session data that already exists elsewhere (CloudWatch, Langfuse, OpenSearch providers; session mappers; trace/tool/graph/swarm extractors; OTEL telemetry) belongs to area-tracing. This is about reading external data IN, as opposed to area-simulation which generates new conversations. | ||
| - CLI commands (run, report, validate, diagnose) and terminal/console output belong to area-cli. area-cli is strictly the command-line and console layer. Do NOT use it for web-based UIs, GUIs, dashboards, hosting, or any non-terminal interface — those have no dedicated area, so assign no area label unless they clearly concern area-core primitives. | ||
| - Core framework primitives (Case, Experiment, task handling, result/data stores) belong to area-core. area-core is NOT a catch-all: use it only when the issue is genuinely about these shared primitives, not when it concerns a specific subsystem above. | ||
|
|
||
| Cross-cutting labels (not tied to one subsystem): | ||
| - area-devx covers how it feels to use the SDK: papercuts, confusing or awkward public APIs, error-message quality, ergonomics, and surprising behavior that does not match what a developer would reasonably expect. It is appropriate even for a specific subsystem when the issue is about that subsystem's public API being confusing, awkward, or surprising to use (as opposed to a clear functional bug in its internal behavior). It is NOT a catch-all for any generic "make it better" request that lacks a concrete usability concern. | ||
| - area-community is NOT a catch-all. Use it only for repo health, governance, contributor process, release process, and CI dependency bumps. If no area clearly applies, assign no area label rather than defaulting to area-community. | ||
|
|
||
| Do not force an area: if no area clearly applies, assign no area label rather than guessing. | ||
| Order matters: list the most important label first, because at most 2 labels are kept. Rank by specificity — concrete subsystems (area-evaluators, area-detectors, area-simulation, area-tracing, etc.) come first, then the broader area-core, and finally the softer cross-cutting labels (area-devx, area-community) which are not tied to a feature area. When an issue reports a bug in a concrete subsystem, that subsystem takes priority. Example: "the OutputEvaluator constructor args are confusing and its error message is unhelpful" -> area-evaluators, then area-devx (subsystem first, usability second). | ||
|
|
||
| labels: | ||
| area-evaluators: | ||
| description: "Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics (correctness, faithfulness, helpfulness, etc.)" | ||
| area-multimodal: | ||
| description: "Multimodal evaluation: MLLM-as-judge evaluators and image-to-text rubrics" | ||
| area-simulation: | ||
| description: "Conversation simulation: actor simulator, tool simulator, profiles, multi-turn interactions" | ||
| area-detectors: | ||
| description: "Failure detection and root cause analysis of agent sessions" | ||
| area-chaos: | ||
| description: "Chaos/fault injection: experiments, recovery strategy, partial completion, failure communication" | ||
| area-redteam: | ||
| description: "Red teaming: adversarial generation, attack strategies, attack success evaluation" | ||
| area-generators: | ||
| description: "Automated experiment generation and topic planning" | ||
| area-tracing: | ||
| description: "Trace/session ingestion: providers (CloudWatch, Langfuse, OpenSearch), session mappers, extractors, telemetry/OTEL" | ||
| area-cli: | ||
| description: "CLI commands (run, report, validate, diagnose) and console display" | ||
| area-core: | ||
| description: "Core eval framework: Case, Experiment, task handler, evaluation data stores" | ||
| area-devx: | ||
| description: "Developer experience: papercuts, confusing or awkward public APIs, error messages, ergonomics, surprising behavior" | ||
| area-community: | ||
| description: "Repo health, governance, contributor process, release process, and CI dependency bumps. Not a fallback for issues that fit no other area." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # .github/labelers/type.yml | ||
| # Classifies issues/PRs by type (bug, feature, etc.) | ||
| # Use max_labels: 1 in the workflow since these are mutually exclusive. | ||
|
|
||
| instructions: | | ||
| Choose exactly one type. You are given only the title and body — there is no diff or file list, so judge from those. | ||
| The conventional-commit prefix in the title is authoritative when present, including its optional scope (e.g. "feat(core):" counts as "feat:"). Match the prefix regardless of scope: | ||
| - "feat" -> enhancement | ||
| - "fix" -> bug | ||
| - "chore", "ci", "build", "refactor", "perf", "style", "test" -> chore | ||
| - "docs" -> follow the documentation rules below | ||
| The prefix wins even when the description sounds user-facing: a "perf:" or "refactor:" title is a chore. When no conventional-commit prefix is present, fall back to the body: maintenance with no user-facing impact (dependency bumps, CI config, internal refactors) is a chore, while anything that adds or changes user-facing functionality is an enhancement. | ||
| If the title starts with [BUG] it is a bug. If the title starts with [FEATURE] it is an enhancement. | ||
| Documentation improvements or corrections are bugs. Requests for new docs or content additions are enhancements. Documentation questions are questions. | ||
|
|
||
| labels: | ||
| bug: | ||
| description: "Something is broken or not working as documented" | ||
| enhancement: | ||
| description: "New feature request or improvement to existing functionality" | ||
| question: | ||
| description: "User asking for help, clarification, or how to do something" | ||
| chore: | ||
| description: "Maintenance tasks, dependency updates, CI changes, refactoring with no user-facing impact" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| name: Issue Labeler | ||
|
|
||
| on: | ||
| issues: | ||
| types: [opened] | ||
| pull_request_target: | ||
| types: [opened] | ||
|
|
||
| permissions: | ||
| issues: write | ||
| pull-requests: write | ||
| id-token: write | ||
| contents: read | ||
|
|
||
| jobs: | ||
|
yonib05 marked this conversation as resolved.
|
||
| label-area: | ||
| name: "Label: Area" | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 2 | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| with: | ||
| sparse-checkout: .github/labelers | ||
| sparse-checkout-cone-mode: false | ||
| - uses: strands-agents/devtools/issue-labeler@main | ||
|
yonib05 marked this conversation as resolved.
|
||
| with: | ||
| aws_role_arn: ${{ secrets.AWS_ROLE_ARN }} | ||
| config_path: '.github/labelers/area.yml' | ||
| max_labels: '2' | ||
|
|
||
| label-type: | ||
| name: "Label: Type" | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 2 | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| with: | ||
| sparse-checkout: .github/labelers | ||
| sparse-checkout-cone-mode: false | ||
| - uses: strands-agents/devtools/issue-labeler@main | ||
| with: | ||
| aws_role_arn: ${{ secrets.AWS_ROLE_ARN }} | ||
| config_path: '.github/labelers/type.yml' | ||
| max_labels: '1' | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.