Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 0 additions & 262 deletions .flue/PLAN.md

This file was deleted.

27 changes: 14 additions & 13 deletions .flue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Experimental Flue-powered investigation bot for `emdash-cms/emdash` issues. Runs as a GitHub Actions workflow when a maintainer applies the `bot:repro` label. Not deployed as a Cloudflare Worker.

For the design rationale, see [PLAN.md](./PLAN.md) and the [PR description](https://github.com/emdash-cms/emdash/pull/1090). Astro's analogous setup (`.flue/agents/issue-triage.ts` in `withastro/astro`) is the closest reference.
For the design rationale, see the [PR description](https://github.com/emdash-cms/emdash/pull/1090). Astro's analogous setup (`.flue/agents/issue-triage.ts` in `withastro/astro`) is the closest reference.

## What it does

Expand All @@ -13,24 +13,25 @@ When a maintainer adds `bot:repro` to an issue:
- `repro-api` — `pnpm test`, CLI commands, direct API hits, no browser
- `repro-admin` — `agent-browser` against `pnpm dev` with the dev-bypass auth shortcut
- `repro-public` — `agent-browser` against the rendered public site
3. **Diagnose** — read the source paths that explain the symptom, rate confidence honestly.
3. **Diagnose** — read the source paths that explain the symptom, rate confidence in the root cause, choose a fix approach (`mechanical` / `clear-best-option` / `needs-design-decision`), and write a concrete proposed fix.
4. **Verify** — decide whether the behaviour is a bug or intended-by-design. Gates the fix stage.
5. **Fix** — conditional on `verdict=bug` AND `confidence=high`. Writes the change, runs the reproduce test, runs the broader package tests, typecheck, lint, format. Stages but does not commit.
5. **Fix** — conditional on `verdict=bug`, `confidence!=low`, and `fixApproach!=needs-design-decision`. Runs on a cheaper model (kimi-k2.6) in its own session — diagnose already produced the plan, so this stage is guided implementation. Writes the change, runs the reproduce test, the broader package tests, typecheck, lint, format. Stages but does not commit.

The orchestrator (`.github/workflows/investigate.yml`) reads the structured JSON output and performs all GitHub writes — labels, comments, branch pushes, PR creation. The agent itself has no write access to GitHub.

## Trigger and label state

| Label | Set by | Meaning |
| -------------------------- | ---------- | ------------------------------------------------ |
| `bot:repro` | Maintainer | Investigation requested |
| `triage/reproducing` | Bot | Investigation in progress |
| `triage/reproduced` | Bot | Reproduced; no fix attempted (or fix abandoned) |
| `triage/awaiting-reporter` | Bot | Fix pushed; reporter asked to verify |
| `triage/verified` | Bot | Reporter confirmed; PR opened |
| `triage/not-reproduced` | Bot | Could not observe the reported behaviour |
| `triage/skipped` | Bot | Declined (non-bug, requires external data, etc.) |
| `triage/failed` | Bot | Gave up after retries |
| Label | Set by | Meaning |
| -------------------------- | ---------- | ------------------------------------------------------------ |
| `bot:repro` | Maintainer | Investigation requested |
| `triage/reproducing` | Bot | Investigation in progress |
| `triage/reproduced` | Bot | Confirmed bug; needs a maintainer (no fix, or fix abandoned) |
| `triage/by-design` | Bot | Reproduced, but the behaviour appears intentional |
| `triage/awaiting-reporter` | Bot | Fix pushed; reporter asked to verify |
| `triage/verified` | Bot | Reporter confirmed; PR opened |
| `triage/not-reproduced` | Bot | Could not observe the reported behaviour |
| `triage/skipped` | Bot | Declined (non-bug, requires external data, etc.) |
| `triage/failed` | Bot | Gave up after retries |

The bot owns every label except `bot:repro`. Maintainers don't manage state directly — they trigger by adding `bot:repro` and re-trigger by removing/re-adding it.

Expand Down
11 changes: 6 additions & 5 deletions .flue/skills/_INVESTIGATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ If the reproduce stage returns `skipped: true`, do not run diagnose or fix. Run

### 3. Diagnose

Follow `../diagnose.md`. Feed it the reproduce notes. It returns a root cause (file plus approximate line plus prose), a confidence rating, and hypothesis notes if confidence is lower than `high`.
Follow `../diagnose.md`. Feed it the reproduce notes. It returns a root cause (file plus approximate line plus prose), a confidence rating in that cause, a fix approach (`mechanical`, `clear-best-option`, or `needs-design-decision`), a concrete proposed fix, and hypothesis notes covering alternative causes. Confidence rates the _cause_; fix approach rates the _fix_ -- the two are independent, so a confidently-located bug whose fix is one clear backwards-compatible change is `high` + `clear-best-option`, not `medium`.

If the reproduce stage failed to reproduce (`reproduced: false`, not skipped), still run diagnose -- often the issue text alone is enough to identify the code path, and the bot's comment is more useful with a guess than without one. Diagnose should lower its own confidence accordingly.

Expand All @@ -63,14 +63,15 @@ Follow `../verify.md`. It looks at the diagnosed code, the surrounding documenta

### 5. Fix (conditional)

Only run `../fix.md` when **both** of the following hold:
Only run `../fix.md` when **all** of the following hold:

- `verify.verdict === 'bug'`
- `diagnose.confidence === 'high'`
- `diagnose.confidence !== 'low'` (the cause is pinned with at least medium confidence)
- `diagnose.fixApproach !== 'needs-design-decision'` (the fix is `mechanical` or `clear-best-option`)

Any other combination: skip fix. The bot will post the diagnosis and verify reasoning as a comment, and a human takes it from there. Attempting a fix at medium or low confidence wastes runner minutes and produces noisy diffs that have to be thrown away.
Any other combination: skip fix. The bot posts the diagnosis (including the proposed fix or, for a design decision, the options) and verify reasoning as a comment, and a human takes it from there. The gate is deliberately broader than the old `confidence === 'high'` rule, which conflated "is the cause certain?" with "is the fix obvious?" and starved the fix stage of real, fixable bugs. The output is not a merge -- it is a candidate branch the reporter is asked to verify and a maintainer reviews -- so a clear, test-backed fix is worth attempting even when it is more than a one-line change.

When you do invoke fix, carry its result forward. Fix returns whether the change actually built and tested clean, a conventional-commit-style message, the list of files changed, and notes. The orchestrator is responsible for committing and pushing -- you do not.
The fix stage runs on a cheaper model than the reasoning stages: diagnose has already produced a concrete plan, so fix is guided implementation rather than open-ended investigation. Carry its result forward. Fix returns whether the change actually built and tested clean, a conventional-commit-style message, the list of files changed, and notes. The orchestrator is responsible for committing and pushing -- you do not.

## Output

Expand Down
24 changes: 16 additions & 8 deletions .flue/skills/diagnose/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,27 @@ You read code. You do not modify it. No edits, no test runs, no demo boots. The
- Lingui `t` called at module scope.
- Physical Tailwind class (`ml-*`, `text-left`) where a logical class belongs.
5. **Pin the location.** Identify the file and the smallest range of lines that contain the bug. A single line is ideal; a function-sized range is acceptable when the bug is structural. If you cannot get below file-level, you do not yet have a diagnosis -- search more.
6. **Rate confidence honestly.**
- **High** -- the root cause is mechanical and obvious. There is one line or a tightly-scoped block that, when changed in a specific way, would fix the bug without ambiguity. A junior engineer pointed at this code would arrive at the same fix.
- **Medium** -- you have identified the right code, but the correct fix involves design choices (which behaviour is the right one, whether to add a new parameter, whether to change the contract). A maintainer needs to decide before code is written.
- **Low** -- there are multiple plausible causes and you cannot rule them out without instrumentation or further testing. Or the candidate code is the right area but no specific bug is visible in it.
Rate down, not up. The fix stage only runs at `high`; over-rating produces wasted runs and rejected diffs.
7. **Write hypothesis notes when confidence is below high.** What else might be going on? What would you test to find out? This is the most valuable part of the comment for a maintainer reading a `medium` or `low` diagnosis.
6. **Rate your confidence in the root cause.** This axis is only about how sure you are that you have found the code responsible -- _not_ about how easy the fix is. Keep the two separate; the next step rates the fix.
- **High** -- you traced the symptom to a specific file and line range and can explain the mechanism end to end. Another engineer reading your diagnosis would agree this is the cause.
- **Medium** -- you have the right area and a strong candidate, but you could not fully confirm the mechanism (reproduce was skipped or failed, or there is a second plausible cause you cannot rule out by reading alone).
- **Low** -- multiple plausible causes you cannot distinguish without instrumentation, or the candidate code is the right area but no specific defect is visible in it.
Rate honestly in both directions. The fix stage does not run at `low`, but it _does_ run at `medium` when the fix is clear, so do not reflexively rate down -- a confidently-located cause is `high` even when the fix involves choosing between options. That choice is the next field's job, not this one's.
7. **Choose a fix approach.** This is independent of confidence. Judge how clear the _fix_ is, given the cause:
- **mechanical** -- there is one obviously-correct change: a single line or tightly-scoped block, no judgement calls. (A missing `await`, a wrong comparison operator, a missing `locale` filter.)
- **clear-best-option** -- the fix is bigger than a one-liner, or several shapes exist, but one is clearly the right call: it is backwards-compatible, matches patterns already in the codebase, and the reproduce test can confirm it. Name that option and say why it beats the alternatives. (Example: issue #1178 hard-codes `c.title` in a SELECT; probing the column list and selecting `title` only when it exists is backwards-compatible and matches the bug's shape, whereas every alternative either breaks the documented API or is a larger redesign. The sibling code in the same file is often direct evidence of intended behaviour -- if one branch already does the right thing, mirroring it is `clear-best-option`, not a design decision.)
- **needs-design-decision** -- choosing correctly requires a judgement only a maintainer should make: a new public API or option, a shared component that does not exist yet, a behavioural-contract change, or a security / performance tradeoff. Do not guess; lay out the options.
The fix stage runs for `mechanical` and `clear-best-option` and defers `needs-design-decision` to a human. Do not retreat to `needs-design-decision` just because more than one fix is conceivable -- reserve it for when the _right_ choice genuinely belongs to a maintainer.
8. **Write the proposed fix, always.** For `mechanical` / `clear-best-option`: describe the specific change -- which file, what to add/remove/change, and how the reproduce test proves it -- in enough detail that the fix stage can implement it directly without re-deriving your reasoning. (A cheaper model implements it; the more concrete your plan, the better the result.) For `needs-design-decision`: lay out the viable options and the tradeoff that distinguishes them, and name your recommendation if you have one. This becomes the maintainer's starting point.
9. **Write hypothesis notes for alternative _causes_.** Distinct from the proposed fix (which is about the remedy): what _other_ root causes did you consider, and how did you rule them in or out? Empty only when the cause is genuinely unambiguous. This is the most valuable part of the comment for a maintainer reading a `medium` or `low` diagnosis.

## Output

Return:

- A root cause: the file path with approximate line number (e.g. `packages/core/src/api/handlers/menus.ts:142`), followed by prose explaining what is wrong and why it produces the reported symptom.
- A confidence rating: `high`, `medium`, or `low`.
- Hypothesis notes: empty if confidence is `high`; otherwise a short paragraph listing the alternative causes you considered and what would distinguish them.
- A confidence rating in the root cause: `high`, `medium`, or `low`.
- A fix approach: `mechanical`, `clear-best-option`, or `needs-design-decision`.
- A proposed fix: the concrete change to make (`mechanical` / `clear-best-option`) or the options a maintainer must choose between (`needs-design-decision`). Never empty.
- Hypothesis notes: the alternative _causes_ you considered and what distinguishes them; empty only when the cause is unambiguous.

Be specific. "Probably in the menu code somewhere" is not a diagnosis. "`resolveContentUrl` in `packages/core/src/menus/index.ts:87` issues three queries per item and the third is the missing-locale fallback path -- on a primary-locale request it is dead code, but it still runs" is.
8 changes: 6 additions & 2 deletions .flue/skills/fix/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@ description: Write the fix when verify says bug and diagnose says high confidenc

# Fix

You are only here because verify returned `bug` and diagnose returned `high` confidence. The orchestrator decided this is worth attempting an automated fix. Your job is to write that fix, prove it works, leave the working tree in a state the orchestrator can commit and push, and report what you did.
You are here because verify returned `bug`, diagnose pinned the cause with at least `medium` confidence, and diagnose rated the fix `mechanical` or `clear-best-option`. Diagnose handed you a **proposed fix** -- a concrete plan naming the file and the change. Your job is to implement that plan, prove it works, leave the working tree in a state the orchestrator can commit, and report what you did. The hard reasoning is already done; do not re-litigate the diagnosis unless reading the code convinces you it is wrong (in which case abandon -- see below).

Read diagnose's proposed fix first and treat it as your spec. Implement that change. If, once you are in the code, the plan turns out to be wrong or incomplete, do not improvise a different large change -- abandon with `fixed: false` and say why, so a human can re-diagnose.

**What your output is, and is not.** You are not merging anything, and you are not even opening a PR. The orchestrator pushes your staged change to a `bot/fix-<n>` branch and asks the original reporter to install a preview build and confirm it resolves their issue. A maintainer reviews before anything lands on `main`. So the bar is "a correct, conventions-respecting change that makes the reproduce test pass" -- not "a perfect, unimprovable patch." A clear, test-backed fix is worth shipping for verification even when it is more than a one-liner. Equally: do not gold-plate, do not expand scope, do not refactor beyond the diagnosed bug.

You can edit source. You can run tests, lint, typecheck, and format. You cannot commit, push, open a PR, or touch any GitHub state.

Expand All @@ -23,7 +27,7 @@ You can edit source. You can run tests, lint, typecheck, and format. You cannot

1. **Re-read the diagnose root cause.** That is your target. The fix should land in the file and approximate line diagnose named. If your work drifts to a different file, stop and reconsider -- diagnose may have been wrong, in which case the right answer is to abandon, not to wander.
2. **Establish a regression test where one is feasible.** Reproduce confirmed the bug through agent-browser, not a test, so there is usually no failing test on disk yet. If the bug is unit- or integration-testable (a handler, a query, a pure function, an API route), write a `vitest` test now that fails for the reported reason -- run it with `pnpm --filter <package> test <path>` and confirm it fails before you touch the fix. A bug with a testable surface and no regression test is not fixed. If the bug only manifests in the browser (admin UI interaction, rendered output), do not write a browser test -- the bot cannot run one reliably here; instead verify the fix through agent-browser and describe the manual verification in your notes so the maintainer can add a durable test when landing it.
3. **Write the smallest fix that resolves the bug.** Follow EmDash's conventions:
3. **Implement diagnose's proposed fix -- the smallest change that fully resolves the bug.** Start from the plan diagnose gave you; the change should land in the file and approximate line it named. Follow EmDash's conventions:
- Internal imports end with `.js`. Type-only imports use `import type`.
- Routes that change state start with `export const prerender = false;`.
- Never interpolate values into SQL. Use Kysely's `sql` tagged template; use `sql.ref()` for identifiers; validate dynamic identifiers with `validateIdentifier()` before any `sql.raw()`.
Expand Down
2 changes: 1 addition & 1 deletion .flue/skills/verify/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,4 @@ Return:
- A verdict: `bug`, `intended-behavior`, or `unclear`.
- Reasoning: the prose that supports the verdict, with paths to the comments, docs, or tests you relied on.

The orchestrator uses your verdict as a gate. `bug` plus a `high`-confidence diagnose triggers the fix stage. Anything else stops here and produces a comment-only outcome.
The orchestrator uses your verdict as a gate. `bug` triggers the fix stage when diagnose also pinned the cause (confidence not `low`) and rated the fix `mechanical` or `clear-best-option`. A `bug` whose fix `needs-design-decision`, an `unclear` verdict, or `intended-behavior` all stop here and produce a comment-only outcome for a maintainer.
Loading
Loading