Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5ace511
docs(cross-model-review-eval): add brainstorm and eval plan
jaybna May 24, 2026
759cc57
feat(cross-model-eval): add corpus manifest stub and pre-registration…
jaybna May 24, 2026
1de3c89
feat(cross-model-eval): add runner skeleton, record schema, and carri…
jaybna May 24, 2026
e98bf5a
feat(cross-model-eval): add CLI arms (b, c) with arm-b isolation prob…
jaybna May 24, 2026
4d36a30
feat(cross-model-eval): add in-process arm prompts (U4)
jaybna May 25, 2026
27ffe44
feat(cross-model-eval): add blinded judge rubric, dedup/integrity, an…
jaybna May 25, 2026
251f153
feat(cross-model-eval): add decision-artifact template (U7)
jaybna May 25, 2026
0ca59ed
fix(cross-model-eval): make arm-b runnable and parse numbered finding…
jaybna May 25, 2026
313e8e9
docs(cross-model-eval): capture arm-isolation methodology lesson
jaybna May 25, 2026
b0a5cbc
feat(cross-model-eval): add critique.sh one-liner for quick cross-mod…
jaybna May 25, 2026
c9cd048
docs(cross-model-eval): generalize the run reference (drop internal r…
jaybna May 25, 2026
3ae2683
feat(cross-model-eval): add known-bug corpus builder for the code-rev…
jaybna May 25, 2026
78c4dc3
feat(cross-model-eval): add GT-match scoring for the code-review brea…
jaybna May 25, 2026
4cea60f
feat(cross-model-eval): wire the code-review eval pipeline end-to-end
jaybna May 25, 2026
d73a25d
fix(cross-model-eval): give GT-match findings arm-opaque unique ids t…
jaybna May 25, 2026
3f01a4d
docs(cross-model-eval): record the first code-review run findings
jaybna May 25, 2026
2db9787
feat(cross-model-eval): add corpus quality gate to scan-fixes
jaybna May 25, 2026
a317091
docs(cross-model-eval): record the gated second run + the metric finding
jaybna May 25, 2026
aa5d82f
feat(cross-model-eval): add finding-yield metric alongside GT-match
jaybna May 25, 2026
c5f6700
feat(cross-model-eval): wire gemini CLI as the Gemini-family arm; dep…
jaybna May 25, 2026
c86da36
docs(cross-model-eval): record run 3 (gemini + finding-yield) and the…
jaybna May 25, 2026
8cefdd9
test(cross-model-eval): genericize a code-path test fixture filename
jaybna May 26, 2026
8e80383
fix(cross-model-eval): switch critique.sh from deprecated agy to gemini
jaybna May 26, 2026
1dfba11
feat(cross-model-eval): add panel-critique.sh (fair 6-lens compare) +…
jaybna May 26, 2026
82f4325
docs(cross-model-eval): correct the plan-review finding after the fai…
jaybna May 26, 2026
93e0ef6
fix(cross-model-eval): make finding-count reliable (JSON output) + co…
jaybna May 26, 2026
e1d9ee0
feat(cross-model-eval): add a non-Claude blind judge (the last decisi…
jaybna May 26, 2026
9fa8a7a
docs(cross-model-eval): add the decision-grade run procedure (pre-reg…
jaybna May 26, 2026
1706594
docs(cross-model-eval): decision-grade run record (non-Claude judge, …
jaybna May 26, 2026
dba311d
fix(cross-model-eval): harden the decision spine against invalid/part…
jaybna May 26, 2026
6072ec5
feat(cross-model-eval): seatbelt deny-write floor for the agy arm + P…
jaybna May 28, 2026
e1226ed
docs(ce-deep-review): Phase 0 arm-posture findings, onboarding, plan …
jaybna May 28, 2026
006b709
feat(ce-deep-review-beta): Phase 1 thin slice — panel + consent gate …
jaybna May 28, 2026
aa55833
docs(ce-deep-review): discoverability for the beta thin slice
jaybna May 28, 2026
766c730
fix(ce-deep-review-beta): make consent legible to the auto-mode egres…
jaybna May 29, 2026
09d7f73
docs(ce-deep-review): add v4 plan reconciled to committed reality; ma…
jaybna May 29, 2026
335bf6c
docs(ce-deep-review): reconcile v4 plan to OD-4 resolution (dogfood #2)
jaybna May 29, 2026
cf1b388
feat(ce-deep-review): make agy (Antigravity) the default cross-model …
jaybna May 29, 2026
72ca96b
chore(ce-deep-review): retire the gemini arm from the skill (EOL 2026…
jaybna May 29, 2026
df80731
feat(ce-deep-review): parallelize cross-model dispatch + define --mod…
jaybna May 29, 2026
c3badcf
feat(ce-deep-review): verify cross-model findings with a quote-grep b…
jaybna May 29, 2026
62fa59b
feat(ce-deep-review): reconcile into the verified .deep-review.md sid…
jaybna May 29, 2026
b9cd13a
feat(ce-deep-review): verifier-rate measurement + RU6 validation/cleanup
jaybna May 29, 2026
2cc9c70
fix(ce-deep-review): fold markdown emphasis in verify-findings normalize
jaybna May 29, 2026
becffb9
fix(review): address Codex review feedback on #858
jaybna May 29, 2026
1ae8b3b
fix(ce-deep-review): whitelist verify/reconcile helpers + collision-s…
jaybna May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md

Large diffs are not rendered by default.

228 changes: 228 additions & 0 deletions docs/brainstorms/2026-05-28-ce-deep-review-requirements.md

Large diffs are not rendered by default.

334 changes: 334 additions & 0 deletions docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md

Large diffs are not rendered by default.

544 changes: 544 additions & 0 deletions docs/plans/2026-05-28-001-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

637 changes: 637 additions & 0 deletions docs/plans/2026-05-28-002-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

484 changes: 484 additions & 0 deletions docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

357 changes: 357 additions & 0 deletions docs/plans/2026-05-28-004-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ Invoked when a specific need arises — not part of any chain.
| Skill | Description |
|-------|-------------|
| [`/ce-polish-beta`](./ce-polish-beta.md) | Conversational UX polish — start dev server, open browser, iterate together; auto-detects 8 frameworks |
| [`/ce-deep-review-beta`](./ce-deep-review.md) | Deep cross-model plan review — Claude panel + consent-gated non-Claude reviewer CLIs for decorrelated findings (thin slice: findings unverified) |

---

Expand Down
94 changes: 94 additions & 0 deletions docs/skills/ce-deep-review-onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# ce-deep-review — onboarding & setup

`ce-deep-review-beta` runs a high-stakes plan through the Claude `ce-doc-review` panel **plus**
one or more non-Claude reviewer CLIs (cross-model decorrelation), verifies every cross-model
finding against the plan, and writes a reconciled sidecar. This doc covers the per-developer
setup it needs.

> **You are responsible for vendor data-handling.** When you opt a model in at the consent gate,
> the plan content is sent to that vendor. You are responsible for having configured each vendor
> with an appropriate data-handling policy (paid plan + DPA where applicable) per your
> organization's requirements. The skill does not verify this for you.
## v1 cross-model arms (status as of 2026-05-28 Phase 0 validation)

| Arm | Status | Why |
|---|---|---|
| **codex** (OpenAI) | ✅ available | `-s read-only` posture; strong, precise reviewer (clean negative control in prior eval) |
| **agy** (Antigravity) | ✅ available, OS-sandboxed | Viable on 1.0.3; read-only floor enforced via a macOS seatbelt profile (agy's own flags don't confine the FS) |
| **grok** (xAI) | ⏸️ deferred | grok 0.2.8 headless reviewer is blocked by a relay-auth bug; re-enabled after a grok fix/version bump (see `docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md`) |

You need **at least one** arm available. With none, the skill still runs the Claude panel and
writes a `*.panel-review.md` (it refuses to be quiet, not to run).

## codex

- Install the OpenAI `codex` CLI and sign in so it runs non-interactively.
- Verify: `codex exec -s read-only --skip-git-repo-check - <<<'say hi'` returns a response.
- No env var required; auth is via codex's own login.

## agy (Antigravity)

- Install `agy` (Antigravity CLI) and sign in to a **paid Antigravity plan**, accepting the
appropriate **DPA** with Google for the content you'll send.
- Auth lands at `~/.gemini/oauth_creds.json` (OAuth; agy auto-refreshes via its `refresh_token`,
so a stale `expiry_date` is fine — it refreshes on use).
- Verify: `agy -p "say hi"` returns a non-empty response.
- **Posture:** agy's `--sandbox` flag does **not** restrict the filesystem, so `ce-deep-review`
runs agy inside a macOS `sandbox-exec` (seatbelt) profile that enforces read-only + no arbitrary
writes at the process boundary. No action needed from you; just be aware the floor is OS-enforced.

## grok (xAI) — deferred

`grok login` authenticates you, and `grok models` will show you logged in — but grok 0.2.8's
**headless `-p` reviewer** currently fails (`Transport channel closed / AuthorizationRequired` at
the WebSocket-relay layer), independent of login state. grok is therefore deferred from v1. When a
future grok version fixes the relay path, re-run the U1 validation and re-enable the arm with the
documented posture (clean cwd + `--tools ""` + `--permission-mode plan` + `--disable-web-search`
+ `--no-subagents` + `--sandbox read-only` + a generous `--max-turns`).

## gitleaks (recommended, not required)

The consent gate previews your plan for secret/PII-shaped content before egress using `gitleaks`.

- Install: `brew install gitleaks`.
- If gitleaks is **not** installed, the gate still opens but shows a "content preview unavailable —
you are the sole filter" notice and escalates the responsibility acknowledgment. Installing it
upgrades the preview from manual-only to automated + manual.

## Egress permission (auto-mode)

The cross-model dispatch shells out to send your plan to the consented vendors. Under Claude
Code's **default auto-mode**, that `bash` call is screened by a permission classifier that reasons
about whether the conversation authorized the egress — the skill's `allowed-tools` declaration is
**not** sufficient on its own (verified 2026-05-28).

- **Interactive runs (the normal case):** no setup needed. The consent gate's options are phrased
as explicit egress authorizations (`Send the plan to agy (Antigravity)`), which is what the
classifier reads. Selecting a model and proceeding clears the dispatch. If a run is still blocked,
the skill restates your consent and retries, then offers to let you re-issue the command via the
`!` prefix.
- **Unattended / headless runs** (no interactive consent turn — e.g. `/loop`, scheduled, or
piped): add a durable allow rule to your settings so the dispatch is pre-authorized. In
`~/.claude/settings.json` (or project `.claude/settings.json`):

```json
{ "permissions": { "allow": ["Bash(bash *panel-critique.sh*)"] } }
```

> **Caveat (untested):** the interactive consent path above is empirically confirmed to clear the
> classifier; whether a `permissions.allow` rule *alone* bypasses it for fully-headless runs is
> not yet verified. Add the rule for headless use, but expect the interactive path to be the
> reliable one until the headless path is confirmed. See
> `docs/solutions/skill-design/2026-05-28-od4-egress-classifier-consent-scope.md`.
## First run

```
/ce-deep-review-beta docs/plans/<your-plan>.md
```

(The beta is invoked explicitly — typed slash command or an explicit skill call. It does not
auto-trigger.) You'll get the Claude panel, then a consent gate listing the arms available in your
environment (default: none selected — opt in per model), then a verified reconciled sidecar at
`<plan>.deep-review.md`.
62 changes: 62 additions & 0 deletions docs/skills/ce-deep-review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# ce-deep-review (beta)

> **Beta.** `ce-deep-review-beta` is invoked explicitly (it does not auto-trigger). Cross-model
> findings are **verdict-tagged by a deterministic quote-grep backstop** (CONFIRMED / NOT-FOUND-IN-DOC
> / NEEDS-HUMAN); NEEDS-HUMAN findings still need your judgment, and the reconciled verified
> `.deep-review.md` sidecar arrives in a later phase.
## What it does

Runs a high-stakes plan through two passes:

1. **Claude panel (no egress)** — invokes `ce-doc-review` headless: the six-persona panel
(coherence, feasibility, security, scope, product, adversarial).
2. **Cross-model panel (egress, with consent)** — after a single consent gate, fans the plan
across the non-Claude reviewer CLIs you opt in to, for *decorrelated* findings the Claude panel
may have missed.

It then verifies each cross-model finding against the plan (a deterministic quote-grep backstop —
CONFIRMED / NOT-FOUND-IN-DOC / NEEDS-HUMAN, blind to the producing model), reconciles them with the
trusted panel findings, and writes a verified sidecar next to the plan at `<plan>.deep-review.md`
(`skill_phase: verified`). An existing verified sidecar is rotated to `<plan>.deep-review.<ISO>.md`
(5 most recent kept); a thin-slice `<plan>.deep-review-draft.md`, if present, is left in place.

## How it differs from `ce-doc-review`

`ce-doc-review` is the no-egress single-panel review. `ce-deep-review` adds cross-model
decorrelation — sending the plan to external vendors (with explicit per-model consent) to surface
issues a single model family tends to miss. Use it for genuinely high-stakes plans (irreversible
migrations, credentials, privacy, data cutover), not routine ones.

## Arms (v1)

| Arm | Status |
|-----|--------|
| **codex** (OpenAI) | available |
| **agy** (Antigravity) | available — the non-codex arm; macOS-only (its read-only floor is a seatbelt) |
| **grok** (xAI) | deferred — blocked by a grok 0.2.8 headless relay-auth bug |
| **gemini** (Google) | retired from the skill (410s 2026-06-18); arm retained only in the cross-model eval |

You need at least one arm installed + authenticated. With none, the skill runs the Claude panel and
writes a `*.panel-review.md` (it refuses to be quiet, not to run).

## Consent & safety

- **One gate.** Before any egress, a single interaction previews the plan for secret-shaped content
(`gitleaks`, if installed), takes **per-model opt-in** (default: none selected), and captures your
acknowledgment that you are responsible for each vendor's data-handling policy.
- **Graceful without gitleaks.** If `gitleaks` isn't installed the gate still opens, tells you no
automated scan ran (you're the sole filter), and escalates the acknowledgment.
- **Egress = consent.** Only the models you select receive the plan.

See [`ce-deep-review-onboarding.md`](./ce-deep-review-onboarding.md) for per-CLI setup (codex, agy
paid-plan + DPA, gitleaks).

## Quick use

```
/ce-deep-review-beta docs/plans/my-plan.md
```

You'll get the Claude panel, then a consent gate listing the arms available in your environment,
then the cross-model findings + a sidecar.
128 changes: 128 additions & 0 deletions docs/solutions/skill-design/2026-05-28-agy-arm-posture-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: "agy arm posture validation (ce-deep-review Phase 0, U2): agy 1.0.3 is a viable reviewer but needs an OS sandbox for the read-only floor"
date: 2026-05-28
last_updated: 2026-05-28
category: skill-design
module: compound-engineering / cross-model-review-eval
tags: [cross-model, agy, antigravity, sandbox, seatbelt, validation, ce-deep-review, phase-0]
problem_type: integration_issue
---

# agy arm posture validation (ce-deep-review Phase 0 / U2)

Empirical re-validation of Antigravity (`agy`) as a cross-model reviewer arm for `ce-deep-review`,
run 2026-05-28. Plan: `docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md` (U2).
Supersedes the agy verdict in `cross-model-eval-first-run-2026-05-25.md` ("agy stays dropped"),
which was measured on agy **1.0.2**.

## Verdict: agy 1.0.3 is VIABLE, but its read-only floor must be enforced by an OS sandbox

- ✅ **Viability fixed in 1.0.3.** agy returns clean, specific JSON review findings. The 1.0.2
failure that got it dropped (empty output / its own CLI monologue) is **gone**.
- 🔴 **agy has no flag that delivers R5's read-only/no-tools floor.** `--sandbox` does **not**
confine the filesystem; agy has FS read+write tools and no web-search-disable flag.
- ✅ **Decision (Phase 0 gate):** enforce the floor at the **OS layer** (macOS `sandbox-exec`/
seatbelt), not via agy flags. PoC confirms OS write-confinement works; the production profile is
being finalized (see "OS sandbox" below).

## Environment

- `agy 1.0.3` at `~/.local/bin/agy`.
- Auth: OAuth credentials at `~/.gemini/oauth_creds.json` (keys: `access_token`, `refresh_token`,
`id_token`, `expiry_date`, `scope`, `token_type`). agy state under `~/.gemini/antigravity-cli/`.

## Viability (the headline change from 1.0.2)

`agy --print "<review instruction>"` with the doc on **stdin** returns a clean JSON array of
findings. On a benign planted-flaw doc it correctly surfaced both planted issues
(destructive-before-confirm; plaintext-password) as specific, well-phrased findings, exit 0, no
monologue, no tool use. Parseable directly by the existing `arms.py parse_findings()` (tolerates a
```json fence). **The 1.0.2 blocker is fixed; agy is a usable reviewer on 1.0.3.**

## CLI surface (agy 1.0.3)

`-p`/`--print`/`--prompt` (single-shot, prints response), prompt via arg **or stdin**;
`--print-timeout <dur>` (default 5m); boolean `--sandbox` ("terminal restrictions enabled");
`--add-dir <dir>` (add workspace dir); `--dangerously-skip-permissions`; `--continue`.
**No `--approval-mode`/`--permission-mode`/plan-mode. No `--output-format`. No `--disable-web-search`.**
This confirms the brainstorm's agy surface assumptions. Note: the harness passes the doc via
**stdin** (not `-p "<inline>"`), so the plan's earlier "`-p` argument-length cap" concern is moot.

## Offline auth signal (R9) — do NOT gate on expiry

`~/.gemini/oauth_creds.json` carries `expiry_date` in **ms**. Observed: `expiry_date` was ~52h in
the **past**, yet `agy --print` still worked — agy **silently refreshes** via the `refresh_token`.

**R9 offline-detection rule for agy:** "available" iff `~/.gemini/oauth_creds.json` exists, is
non-empty JSON, and contains a non-empty `refresh_token`. **Do NOT** require `expiry_date` in the
future — that would false-negative (mark agy unavailable when it actually works). This corrects the
v3 plan's assumed expiry check and the brainstorm's `AV_API_KEY` env-var assumption (no env var is
used).

## Posture floor: agy flags do NOT enforce it

R5 requires every non-Claude arm to run read-only, no-web-search, no-tools — symmetric with codex
`-s read-only`. Empirical test (`agy --sandbox`, clean cwd, prompted to read an out-of-workspace
sentinel and write a canary):

- 🔴 **Read leak:** agy **read** `/var/folders/.../secret.txt` (outside the workspace) and printed the sentinel token.
- 🔴 **Write:** agy **created** `/tmp/agy-canary-*.txt`.
- No `--disable-web-search` exists, so the web-search tool can't be flag-disabled either.

`--sandbox` restricts *terminal command execution*, not the FS read/write tools. **No agy flag
combination delivers R5's floor** — so, per R5/U2, the agy arm would be "unavailable" unless the
floor is supplied externally.

Normal operation caveat: when given a plain review prompt (doc on stdin, "return findings"), agy
does **not** touch the filesystem — the leak/write only happened because the prompt explicitly
asked. But R5's floor is a hard guarantee, not a best-effort behavior, and a hostile/garbage plan
doc could induce FS access. Hence the OS sandbox.

## OS sandbox (the chosen mechanism) — PoC + status

Decision: wrap every non-Claude arm in a macOS `sandbox-exec` (seatbelt) profile that enforces the
floor at the process boundary, independent of the CLI's own flags (the same seatbelt mechanism grok
uses internally).

**Iteration (2026-05-28) — what failed, what works:**
- ✅ A `(deny file-write*)` profile **blocks** agy's writes (PoC: `$HOME` canary never created).
- ❌ **`(deny file-write*)` (deny-all, allowlist needed) HANGS agy** (>11–25 min, ignoring its own
`--print-timeout`): it retries denied writes to un-allowlisted state paths and blocks at the
syscall level. Its write-set is too large/dynamic to enumerate (denials don't surface in the
sandbox log).
- ❌ **Any `(deny file-read* ...)` rule ALSO hangs agy** (it stats `~/.config`-ish paths during
init and wedges on a denied read).
- ✅ **`(deny file-write* <specific paths>)` with `(allow default)` works** — agy writes its own
state freely (no hang) and reviews cleanly, while writes to the named sensitive paths are blocked.

**Validated production floor — deny-WRITE-only denylist.** Template:
`scripts/eval/cross_model_review/validation/agy-readonly.sb.tmpl` (substitute `__REPO_DIR__` +
`__HOME__`). `(allow default)` then `(deny file-write* ...)` for: the repo under review, `~/.ssh`,
`~/.aws`, `~/.config/gcloud`, `~/.zshrc`, `~/.gitconfig`, `~/.netrc`. Network allowed (vendor API).
Invoke: `sandbox-exec -f <generated.sb> agy --print ...` from a clean cwd.

- **Gotcha — canonicalize paths.** macOS seatbelt matches canonical paths; a `mktemp -d`
`/var/folders/...` path silently won't match its `/private/var/...` real path (deny won't fire).
Substitute the **real** repo path (`git rev-parse --show-toplevel` + `pwd -P`; `/Users/...` is
already canonical).
- **Validated by `agy-smoke.sh`** (committed alongside the template): `PASS(floor)` write-to-repo
blocked + `PASS(viable)` agy returns 2 findings on the sentinel under the sandbox. Re-runnable.

**What this floor does and doesn't enforce:**
- ✅ Blocks agy modifying the repo, credentials (`~/.ssh`/`~/.aws`/gcloud), and shell/git dotfiles.
- ✅ Network allowed for the vendor API; combined with clean cwd, agy has no ambient repo context.
- ⚠️ **Does NOT block agy *reading* secrets** (deny-read hangs agy). Secret-read-then-exfil via an
induced/injected finding is a **documented residual**, mitigated by: clean cwd, a review-only
prompt, and the fact that v1 reviews the user's *own* internal plans. It is a real prompt-injection
vector for **untrusted** docs — out of scope for v1's threat model; revisit if untrusted-doc review
is ever in scope (would need a confinable agy or an OS read-jail agy tolerates).

**Integration point (for the harness work, post-Phase-0):** `arms.py`'s agy branch should generate
the concrete `.sb` from the template (real repo path + `$HOME`) and wrap the agy invocation in
`sandbox-exec -f <profile>`. The arm continues to pass the doc via stdin from a clean cwd.

## Phase 0 gate consequence

agy is **viable and accepted for v1, confined via the OS sandbox** (not via agy flags). Combined
with grok being dropped (separate doc), v1's cross-model arms are **codex + agy**. The brainstorm's
R5 (agy posture) and Dependencies/Assumptions (auth mechanism) are corrected accordingly.
Loading