EveryInc · jaybna · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md b/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
diff --git a/docs/brainstorms/2026-05-28-ce-deep-review-requirements.md b/docs/brainstorms/2026-05-28-ce-deep-review-requirements.md
diff --git a/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md b/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
diff --git a/docs/plans/2026-05-28-001-feat-ce-deep-review-skill-plan.md b/docs/plans/2026-05-28-001-feat-ce-deep-review-skill-plan.md
diff --git a/docs/plans/2026-05-28-002-feat-ce-deep-review-skill-plan.md b/docs/plans/2026-05-28-002-feat-ce-deep-review-skill-plan.md
diff --git a/docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md b/docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md
diff --git a/docs/skills/ce-deep-review-onboarding.md b/docs/skills/ce-deep-review-onboarding.md
@@ -0,0 +1,68 @@
+# ce-deep-review — onboarding & setup
+
+`ce-deep-review-beta` runs a high-stakes plan through the Claude `ce-doc-review` panel **plus**
+one or more non-Claude reviewer CLIs (cross-model decorrelation), verifies every cross-model
+finding against the plan, and writes a reconciled sidecar. This doc covers the per-developer
+setup it needs.
+
+> **You are responsible for vendor data-handling.** When you opt a model in at the consent gate,
+> the plan content is sent to that vendor. You are responsible for having configured each vendor
+> with an appropriate data-handling policy (paid plan + DPA where applicable) per your
+> organization's requirements. The skill does not verify this for you.
+
+## v1 cross-model arms (status as of 2026-05-28 Phase 0 validation)
+
+| Arm | Status | Why |
+|---|---|---|
+| **codex** (OpenAI) | ✅ available | `-s read-only` posture; strong, precise reviewer (clean negative control in prior eval) |
+| **agy** (Antigravity) | ✅ available, OS-sandboxed | Viable on 1.0.3; read-only floor enforced via a macOS seatbelt profile (agy's own flags don't confine the FS) |
+| **grok** (xAI) | ⏸️ deferred | grok 0.2.8 headless reviewer is blocked by a relay-auth bug; re-enabled after a grok fix/version bump (see `docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md`) |
+
+You need **at least one** arm available. With none, the skill still runs the Claude panel and
+writes a `*.panel-review.md` (it refuses to be quiet, not to run).
+
+## codex
+
+- Install the OpenAI `codex` CLI and sign in so it runs non-interactively.
+- Verify: `codex exec -s read-only --skip-git-repo-check - <<<'say hi'` returns a response.
+- No env var required; auth is via codex's own login.
+
+## agy (Antigravity)
+
+- Install `agy` (Antigravity CLI) and sign in to a **paid Antigravity plan**, accepting the
+  appropriate **DPA** with Google for the content you'll send.
+- Auth lands at `~/.gemini/oauth_creds.json` (OAuth; agy auto-refreshes via its `refresh_token`,
+  so a stale `expiry_date` is fine — it refreshes on use).
+- Verify: `agy -p "say hi"` returns a non-empty response.
+- **Posture:** agy's `--sandbox` flag does **not** restrict the filesystem, so `ce-deep-review`
+  runs agy inside a macOS `sandbox-exec` (seatbelt) profile that enforces read-only + no arbitrary
+  writes at the process boundary. No action needed from you; just be aware the floor is OS-enforced.
+
+## grok (xAI) — deferred
+
+`grok login` authenticates you, and `grok models` will show you logged in — but grok 0.2.8's
+**headless `-p` reviewer** currently fails (`Transport channel closed / AuthorizationRequired` at
+the WebSocket-relay layer), independent of login state. grok is therefore deferred from v1. When a
+future grok version fixes the relay path, re-run the U1 validation and re-enable the arm with the
+documented posture (clean cwd + `--tools ""` + `--permission-mode plan` + `--disable-web-search`
++ `--no-subagents` + `--sandbox read-only` + a generous `--max-turns`).
+
+## gitleaks (recommended, not required)
+
+The consent gate previews your plan for secret/PII-shaped content before egress using `gitleaks`.
+
+- Install: `brew install gitleaks`.
+- If gitleaks is **not** installed, the gate still opens but shows a "content preview unavailable —
+  you are the sole filter" notice and escalates the responsibility acknowledgment. Installing it
+  upgrades the preview from manual-only to automated + manual.
+
+## First run
+
+```
+/ce-deep-review-beta docs/plans/<your-plan>.md
+```
+
+(The beta is invoked explicitly — typed slash command or an explicit skill call. It does not
+auto-trigger.) You'll get the Claude panel, then a consent gate listing the arms available in your
+environment (default: none selected — opt in per model), then a verified reconciled sidecar at
+`<plan>.deep-review.md`.
diff --git a/docs/solutions/skill-design/2026-05-28-agy-arm-posture-validation.md b/docs/solutions/skill-design/2026-05-28-agy-arm-posture-validation.md
@@ -0,0 +1,128 @@
+---
+title: "agy arm posture validation (ce-deep-review Phase 0, U2): agy 1.0.3 is a viable reviewer but needs an OS sandbox for the read-only floor"
+date: 2026-05-28
+last_updated: 2026-05-28
+category: skill-design
+module: compound-engineering / cross-model-review-eval
+tags: [cross-model, agy, antigravity, sandbox, seatbelt, validation, ce-deep-review, phase-0]
+problem_type: integration_issue
+---
+
+# agy arm posture validation (ce-deep-review Phase 0 / U2)
+
+Empirical re-validation of Antigravity (`agy`) as a cross-model reviewer arm for `ce-deep-review`,
+run 2026-05-28. Plan: `docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md` (U2).
+Supersedes the agy verdict in `cross-model-eval-first-run-2026-05-25.md` ("agy stays dropped"),
+which was measured on agy **1.0.2**.
+
+## Verdict: agy 1.0.3 is VIABLE, but its read-only floor must be enforced by an OS sandbox
+
+- ✅ **Viability fixed in 1.0.3.** agy returns clean, specific JSON review findings. The 1.0.2
+  failure that got it dropped (empty output / its own CLI monologue) is **gone**.
+- 🔴 **agy has no flag that delivers R5's read-only/no-tools floor.** `--sandbox` does **not**
+  confine the filesystem; agy has FS read+write tools and no web-search-disable flag.
+- ✅ **Decision (Phase 0 gate):** enforce the floor at the **OS layer** (macOS `sandbox-exec`/
+  seatbelt), not via agy flags. PoC confirms OS write-confinement works; the production profile is
+  being finalized (see "OS sandbox" below).
+
+## Environment
+
+- `agy 1.0.3` at `~/.local/bin/agy`.
+- Auth: OAuth credentials at `~/.gemini/oauth_creds.json` (keys: `access_token`, `refresh_token`,
+  `id_token`, `expiry_date`, `scope`, `token_type`). agy state under `~/.gemini/antigravity-cli/`.
+
+## Viability (the headline change from 1.0.2)
+
+`agy --print "<review instruction>"` with the doc on **stdin** returns a clean JSON array of
+findings. On a benign planted-flaw doc it correctly surfaced both planted issues
+(destructive-before-confirm; plaintext-password) as specific, well-phrased findings, exit 0, no
+monologue, no tool use. Parseable directly by the existing `arms.py parse_findings()` (tolerates a
+```json fence). **The 1.0.2 blocker is fixed; agy is a usable reviewer on 1.0.3.**
+
+## CLI surface (agy 1.0.3)
+
+`-p`/`--print`/`--prompt` (single-shot, prints response), prompt via arg **or stdin**;
+`--print-timeout <dur>` (default 5m); boolean `--sandbox` ("terminal restrictions enabled");
+`--add-dir <dir>` (add workspace dir); `--dangerously-skip-permissions`; `--continue`.
+**No `--approval-mode`/`--permission-mode`/plan-mode. No `--output-format`. No `--disable-web-search`.**
+This confirms the brainstorm's agy surface assumptions. Note: the harness passes the doc via
+**stdin** (not `-p "<inline>"`), so the plan's earlier "`-p` argument-length cap" concern is moot.
+
+## Offline auth signal (R9) — do NOT gate on expiry
+
+`~/.gemini/oauth_creds.json` carries `expiry_date` in **ms**. Observed: `expiry_date` was ~52h in
+the **past**, yet `agy --print` still worked — agy **silently refreshes** via the `refresh_token`.
+
+**R9 offline-detection rule for agy:** "available" iff `~/.gemini/oauth_creds.json` exists, is
+non-empty JSON, and contains a non-empty `refresh_token`. **Do NOT** require `expiry_date` in the
+future — that would false-negative (mark agy unavailable when it actually works). This corrects the
+v3 plan's assumed expiry check and the brainstorm's `AV_API_KEY` env-var assumption (no env var is
+used).
+
+## Posture floor: agy flags do NOT enforce it
+
+R5 requires every non-Claude arm to run read-only, no-web-search, no-tools — symmetric with codex
+`-s read-only`. Empirical test (`agy --sandbox`, clean cwd, prompted to read an out-of-workspace
+sentinel and write a canary):
+
+- 🔴 **Read leak:** agy **read** `/var/folders/.../secret.txt` (outside the workspace) and printed the sentinel token.
+- 🔴 **Write:** agy **created** `/tmp/agy-canary-*.txt`.
+- No `--disable-web-search` exists, so the web-search tool can't be flag-disabled either.
+
+`--sandbox` restricts *terminal command execution*, not the FS read/write tools. **No agy flag
+combination delivers R5's floor** — so, per R5/U2, the agy arm would be "unavailable" unless the
+floor is supplied externally.
+
+Normal operation caveat: when given a plain review prompt (doc on stdin, "return findings"), agy
+does **not** touch the filesystem — the leak/write only happened because the prompt explicitly
+asked. But R5's floor is a hard guarantee, not a best-effort behavior, and a hostile/garbage plan
+doc could induce FS access. Hence the OS sandbox.
+
+## OS sandbox (the chosen mechanism) — PoC + status
+
+Decision: wrap every non-Claude arm in a macOS `sandbox-exec` (seatbelt) profile that enforces the
+floor at the process boundary, independent of the CLI's own flags (the same seatbelt mechanism grok
+uses internally).
+
+**Iteration (2026-05-28) — what failed, what works:**
+- ✅ A `(deny file-write*)` profile **blocks** agy's writes (PoC: `$HOME` canary never created).
+- ❌ **`(deny file-write*)` (deny-all, allowlist needed) HANGS agy** (>11–25 min, ignoring its own
+  `--print-timeout`): it retries denied writes to un-allowlisted state paths and blocks at the
+  syscall level. Its write-set is too large/dynamic to enumerate (denials don't surface in the
+  sandbox log).
+- ❌ **Any `(deny file-read* ...)` rule ALSO hangs agy** (it stats `~/.config`-ish paths during
+  init and wedges on a denied read).
+- ✅ **`(deny file-write* <specific paths>)` with `(allow default)` works** — agy writes its own
+  state freely (no hang) and reviews cleanly, while writes to the named sensitive paths are blocked.
+
+**Validated production floor — deny-WRITE-only denylist.** Template:
+`scripts/eval/cross_model_review/validation/agy-readonly.sb.tmpl` (substitute `__REPO_DIR__` +
+`__HOME__`). `(allow default)` then `(deny file-write* ...)` for: the repo under review, `~/.ssh`,
+`~/.aws`, `~/.config/gcloud`, `~/.zshrc`, `~/.gitconfig`, `~/.netrc`. Network allowed (vendor API).
+Invoke: `sandbox-exec -f <generated.sb> agy --print ...` from a clean cwd.
+
+- **Gotcha — canonicalize paths.** macOS seatbelt matches canonical paths; a `mktemp -d`
+  `/var/folders/...` path silently won't match its `/private/var/...` real path (deny won't fire).
+  Substitute the **real** repo path (`git rev-parse --show-toplevel` + `pwd -P`; `/Users/...` is
+  already canonical).
+- **Validated by `agy-smoke.sh`** (committed alongside the template): `PASS(floor)` write-to-repo
+  blocked + `PASS(viable)` agy returns 2 findings on the sentinel under the sandbox. Re-runnable.
+
+**What this floor does and doesn't enforce:**
+- ✅ Blocks agy modifying the repo, credentials (`~/.ssh`/`~/.aws`/gcloud), and shell/git dotfiles.
+- ✅ Network allowed for the vendor API; combined with clean cwd, agy has no ambient repo context.
+- ⚠️ **Does NOT block agy *reading* secrets** (deny-read hangs agy). Secret-read-then-exfil via an
+  induced/injected finding is a **documented residual**, mitigated by: clean cwd, a review-only
+  prompt, and the fact that v1 reviews the user's *own* internal plans. It is a real prompt-injection
+  vector for **untrusted** docs — out of scope for v1's threat model; revisit if untrusted-doc review
+  is ever in scope (would need a confinable agy or an OS read-jail agy tolerates).
+
+**Integration point (for the harness work, post-Phase-0):** `arms.py`'s agy branch should generate
+the concrete `.sb` from the template (real repo path + `$HOME`) and wrap the agy invocation in
+`sandbox-exec -f <profile>`. The arm continues to pass the doc via stdin from a clean cwd.
+
+## Phase 0 gate consequence
+
+agy is **viable and accepted for v1, confined via the OS sandbox** (not via agy flags). Combined
+with grok being dropped (separate doc), v1's cross-model arms are **codex + agy**. The brainstorm's
+R5 (agy posture) and Dependencies/Assumptions (auth mechanism) are corrected accordingly.
diff --git a/docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md b/docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md
@@ -0,0 +1,109 @@
+---
+title: "grok arm posture validation (ce-deep-review Phase 0, U1): grok 0.2.8 headless is blocked by a relay-auth bug — deferred from v1"
+date: 2026-05-28
+last_updated: 2026-05-28
+category: skill-design
+module: compound-engineering / cross-model-review-eval
+tags: [cross-model, grok, sandbox, validation, ce-deep-review, phase-0]
+problem_type: integration_issue
+---
+
+# grok arm posture validation (ce-deep-review Phase 0 / U1)
+
+Empirical validation of the Grok Build CLI as a cross-model reviewer arm for `ce-deep-review`,
+run on the original developer's machine on 2026-05-28. Plan: `docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md` (U1).
+
+## Verdict: grok DEFERRED from v1 (Phase 0 gate "drop grok")
+
+grok 0.2.8 **cannot complete a single headless `-p` review on this machine** due to a
+worker/relay authentication bug. The arm *design* is sound (all required flags exist; the
+sandbox posture is ideal), so this is "drop from v1 and re-test on a version bump," not "wrong
+approach." v1 ships without grok (codex + agy).
+
+## Environment
+
+- `grok 0.2.8 (730d2470cda)` at `~/.grok/bin/grok`.
+- Auth: `~/.grok/auth.json` (OIDC cached token). `grok models` reports "You are logged in with
+  grok.com" — **shell-level auth is healthy** (log: `auth_mode: Oidc`, `is_expired: false`,
+  `cached_token handler set api_key (SessionToken)`).
+- Offline auth signal (R9): presence of `~/.grok/auth.json` containing a non-empty
+  `https://auth.x.ai::<id>` scope entry. (No `XAI_API_KEY` env var in use; no flat `expires_at`.)
+
+## CLI surface (grok 0.2.8) — all U1-assumed flags are present
+
+Confirmed via `grok --help`:
+
+- `--permission-mode <MODE>` — values include `plan` (read-only). ✅
+- `--disable-web-search` ✅
+- `--sandbox <PROFILE>` (env `GROK_SANDBOX`) ✅
+- `-p, --single <PROMPT>` (single-turn, prints to stdout and exits) ✅; also `--prompt-file <PATH>`, `--prompt-json <JSON>`.
+- `--output-format <plain|json|streaming-json>` ✅
+- `--no-subagents`, `--verbatim`, `--max-turns <N>`, `--cwd <CWD>` ✅
+- Tool control: `--tools <allowlist>`, `--disallowed-tools`, `--allow`, `--deny`.
+
+The brainstorm's grok flag assumptions hold against 0.2.8. (`--max-turns 1`, however, is **wrong** — see below.)
+
+## Sandbox posture (validated, ideal) — `read-only`
+
+grok ships **built-in seatbelt profiles** (custom ones live in `~/.grok/sandbox.toml`). From
+`~/.grok/sandbox-events.jsonl` (`platform: macos/seatbelt`, `enforced: true`):
+
+| profile | restrict_network | workspace writable | notes |
+|---|---|---|---|
+| `workspace` | false | yes | default dev posture |
+| `read-only` | **true** | **no** (RW only `~/.grok` + tmp) | **ideal arm posture** |
+| `strict` | true | yes (system paths RO) | workspace RW |
+
+`read-only` gives the floor R5 wants: the model's web-search/fetch **tools** are network-blocked
+and the workspace is not writable. (grok's own control-plane API to xAI is a separate transport,
+not blocked by the tool-network restriction — so the arm can still produce a review.)
+
+## The blocker: headless `-p` worker relay-auth failure
+
+Every headless `-p` invocation fails:
+
+```
+ERROR worker quit with fatal: Transport channel closed, when Auth(AuthorizationRequired)
+ERROR error= Internal error: "max_turns exceeded: limit is N, but got N+2 messages"
+```
+
+Reproduced under all of:
+- trivial prompt (`say hi`), full review prompt; `--output-format json` and `plain`;
+- clean cwd (`--cwd <tmp>`), tools disabled (`--tools ""`), `--no-subagents`, `--disable-web-search`, `--permission-mode plan`;
+- `--max-turns` 1, 5, 8, 10, 30 (message count always creeps ~2 over the limit — the worker spins retrying the failed auth, burning messages until max-turns trips);
+- **after `grok login`** (user re-ran) and **after `grok agent --reauth`** (user re-ran).
+
+Root cause (diagnosed via `~/.grok/logs/unified.jsonl` + `grok agent headless --help`): the
+**shell** process auths fine, but the headless **agent worker runs "over the Grok WebSocket
+relay"** (a separate auth path from the shell login), and *that* relay auth fails with
+`AuthorizationRequired`. Because the shell login is healthy, neither `grok login` nor
+`grok agent --reauth` clears it. This is a grok 0.2.8 headless/relay bug on this machine, not a
+stale credential.
+
+Secondary observation (isolation): with tools enabled in the repo cwd, grok went **agentic** —
+it tried to use the `qmd` MCP and search `docs/plans/` ("There are many plans in docs/plans/…
+qmd__search") instead of reviewing the inline text. Confirms the arm must run from a **clean cwd
+with tools disabled** (both to keep it a single-shot reviewer and to prevent ambient-repo egress).
+
+## When grok is fixed: the validated would-be posture
+
+**Re-probe:** `scripts/eval/cross_model_review/validation/grok-smoke.sh` runs the intended posture
+against the sentinel and reports `BLOCKED` (relay bug still present) vs `PASS` (relay fixed → arm
+can ship). Run it after any grok version bump. Land this in `arms.py` once it passes:
+
+```
+grok --cwd <clean-tmp> -p "<lens rubric + doc>" \
+  --output-format json --disable-web-search --no-subagents \
+  --tools "" --permission-mode plan --sandbox read-only \
+  --max-turns <adequate, NOT 1>
+```
+
+- `--max-turns 1` is wrong: a single review uses ~6+ internal messages. Use a generous bound (or omit) so a legitimate single-shot review isn't cut off.
+- `--sandbox read-only` enforces the FS+network-tool floor at the seatbelt layer (defense-in-depth beyond `--permission-mode plan` + `--tools ""`).
+- Pass the doc via stdin or `--prompt-file` (consistent with the harness's isolation model).
+
+## Phase 0 gate consequence
+
+Per the plan's Phase 0 gate ("grok validation fails → drop grok from v1"): **grok is dropped from
+v1.** Combined with the agy posture finding (separate doc), v1's cross-model arms are codex + agy
+(with agy confined via an OS sandbox — see the agy validation doc). Re-test grok on a version bump.