Skip to content
Open
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5ace511
docs(cross-model-review-eval): add brainstorm and eval plan
jaybna May 24, 2026
759cc57
feat(cross-model-eval): add corpus manifest stub and pre-registration…
jaybna May 24, 2026
1de3c89
feat(cross-model-eval): add runner skeleton, record schema, and carri…
jaybna May 24, 2026
e98bf5a
feat(cross-model-eval): add CLI arms (b, c) with arm-b isolation prob…
jaybna May 24, 2026
4d36a30
feat(cross-model-eval): add in-process arm prompts (U4)
jaybna May 25, 2026
27ffe44
feat(cross-model-eval): add blinded judge rubric, dedup/integrity, an…
jaybna May 25, 2026
251f153
feat(cross-model-eval): add decision-artifact template (U7)
jaybna May 25, 2026
0ca59ed
fix(cross-model-eval): make arm-b runnable and parse numbered finding…
jaybna May 25, 2026
313e8e9
docs(cross-model-eval): capture arm-isolation methodology lesson
jaybna May 25, 2026
b0a5cbc
feat(cross-model-eval): add critique.sh one-liner for quick cross-mod…
jaybna May 25, 2026
c9cd048
docs(cross-model-eval): generalize the run reference (drop internal r…
jaybna May 25, 2026
3ae2683
feat(cross-model-eval): add known-bug corpus builder for the code-rev…
jaybna May 25, 2026
78c4dc3
feat(cross-model-eval): add GT-match scoring for the code-review brea…
jaybna May 25, 2026
4cea60f
feat(cross-model-eval): wire the code-review eval pipeline end-to-end
jaybna May 25, 2026
d73a25d
fix(cross-model-eval): give GT-match findings arm-opaque unique ids t…
jaybna May 25, 2026
3f01a4d
docs(cross-model-eval): record the first code-review run findings
jaybna May 25, 2026
2db9787
feat(cross-model-eval): add corpus quality gate to scan-fixes
jaybna May 25, 2026
a317091
docs(cross-model-eval): record the gated second run + the metric finding
jaybna May 25, 2026
aa5d82f
feat(cross-model-eval): add finding-yield metric alongside GT-match
jaybna May 25, 2026
c5f6700
feat(cross-model-eval): wire gemini CLI as the Gemini-family arm; dep…
jaybna May 25, 2026
c86da36
docs(cross-model-eval): record run 3 (gemini + finding-yield) and the…
jaybna May 25, 2026
8cefdd9
test(cross-model-eval): genericize a code-path test fixture filename
jaybna May 26, 2026
8e80383
fix(cross-model-eval): switch critique.sh from deprecated agy to gemini
jaybna May 26, 2026
1dfba11
feat(cross-model-eval): add panel-critique.sh (fair 6-lens compare) +…
jaybna May 26, 2026
82f4325
docs(cross-model-eval): correct the plan-review finding after the fai…
jaybna May 26, 2026
93e0ef6
fix(cross-model-eval): make finding-count reliable (JSON output) + co…
jaybna May 26, 2026
e1d9ee0
feat(cross-model-eval): add a non-Claude blind judge (the last decisi…
jaybna May 26, 2026
9fa8a7a
docs(cross-model-eval): add the decision-grade run procedure (pre-reg…
jaybna May 26, 2026
1706594
docs(cross-model-eval): decision-grade run record (non-Claude judge, …
jaybna May 26, 2026
dba311d
fix(cross-model-eval): harden the decision spine against invalid/part…
jaybna May 26, 2026
6072ec5
feat(cross-model-eval): seatbelt deny-write floor for the agy arm + P…
jaybna May 28, 2026
e1226ed
docs(ce-deep-review): Phase 0 arm-posture findings, onboarding, plan …
jaybna May 28, 2026
006b709
feat(ce-deep-review-beta): Phase 1 thin slice — panel + consent gate …
jaybna May 28, 2026
aa55833
docs(ce-deep-review): discoverability for the beta thin slice
jaybna May 28, 2026
766c730
fix(ce-deep-review-beta): make consent legible to the auto-mode egres…
jaybna May 29, 2026
09d7f73
docs(ce-deep-review): add v4 plan reconciled to committed reality; ma…
jaybna May 29, 2026
335bf6c
docs(ce-deep-review): reconcile v4 plan to OD-4 resolution (dogfood #2)
jaybna May 29, 2026
cf1b388
feat(ce-deep-review): make agy (Antigravity) the default cross-model …
jaybna May 29, 2026
72ca96b
chore(ce-deep-review): retire the gemini arm from the skill (EOL 2026…
jaybna May 29, 2026
df80731
feat(ce-deep-review): parallelize cross-model dispatch + define --mod…
jaybna May 29, 2026
c3badcf
feat(ce-deep-review): verify cross-model findings with a quote-grep b…
jaybna May 29, 2026
62fa59b
feat(ce-deep-review): reconcile into the verified .deep-review.md sid…
jaybna May 29, 2026
b9cd13a
feat(ce-deep-review): verifier-rate measurement + RU6 validation/cleanup
jaybna May 29, 2026
2cc9c70
fix(ce-deep-review): fold markdown emphasis in verify-findings normalize
jaybna May 29, 2026
becffb9
fix(review): address Codex review feedback on #858
jaybna May 29, 2026
1ae8b3b
fix(ce-deep-review): whitelist verify/reconcile helpers + collision-s…
jaybna May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md

Large diffs are not rendered by default.

226 changes: 226 additions & 0 deletions docs/brainstorms/2026-05-28-ce-deep-review-requirements.md

Large diffs are not rendered by default.

334 changes: 334 additions & 0 deletions docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md

Large diffs are not rendered by default.

544 changes: 544 additions & 0 deletions docs/plans/2026-05-28-001-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

637 changes: 637 additions & 0 deletions docs/plans/2026-05-28-002-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

483 changes: 483 additions & 0 deletions docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md

Large diffs are not rendered by default.

68 changes: 68 additions & 0 deletions docs/skills/ce-deep-review-onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# ce-deep-review — onboarding & setup

`ce-deep-review-beta` runs a high-stakes plan through the Claude `ce-doc-review` panel **plus**
one or more non-Claude reviewer CLIs (cross-model decorrelation), verifies every cross-model
finding against the plan, and writes a reconciled sidecar. This doc covers the per-developer
setup it needs.

> **You are responsible for vendor data-handling.** When you opt a model in at the consent gate,
> the plan content is sent to that vendor. You are responsible for having configured each vendor
> with an appropriate data-handling policy (paid plan + DPA where applicable) per your
> organization's requirements. The skill does not verify this for you.

## v1 cross-model arms (status as of 2026-05-28 Phase 0 validation)

| Arm | Status | Why |
|---|---|---|
| **codex** (OpenAI) | ✅ available | `-s read-only` posture; strong, precise reviewer (clean negative control in prior eval) |
| **agy** (Antigravity) | ✅ available, OS-sandboxed | Viable on 1.0.3; read-only floor enforced via a macOS seatbelt profile (agy's own flags don't confine the FS) |
| **grok** (xAI) | ⏸️ deferred | grok 0.2.8 headless reviewer is blocked by a relay-auth bug; re-enabled after a grok fix/version bump (see `docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md`) |

You need **at least one** arm available. With none, the skill still runs the Claude panel and
writes a `*.panel-review.md` (it refuses to be quiet, not to run).

## codex

- Install the OpenAI `codex` CLI and sign in so it runs non-interactively.
- Verify: `codex exec -s read-only --skip-git-repo-check - <<<'say hi'` returns a response.
- No env var required; auth is via codex's own login.

## agy (Antigravity)

- Install `agy` (Antigravity CLI) and sign in to a **paid Antigravity plan**, accepting the
appropriate **DPA** with Google for the content you'll send.
- Auth lands at `~/.gemini/oauth_creds.json` (OAuth; agy auto-refreshes via its `refresh_token`,
so a stale `expiry_date` is fine — it refreshes on use).
- Verify: `agy -p "say hi"` returns a non-empty response.
- **Posture:** agy's `--sandbox` flag does **not** restrict the filesystem, so `ce-deep-review`
runs agy inside a macOS `sandbox-exec` (seatbelt) profile that enforces read-only + no arbitrary
writes at the process boundary. No action needed from you; just be aware the floor is OS-enforced.

## grok (xAI) — deferred

`grok login` authenticates you, and `grok models` will show you logged in — but grok 0.2.8's
**headless `-p` reviewer** currently fails (`Transport channel closed / AuthorizationRequired` at
the WebSocket-relay layer), independent of login state. grok is therefore deferred from v1. When a
future grok version fixes the relay path, re-run the U1 validation and re-enable the arm with the
documented posture (clean cwd + `--tools ""` + `--permission-mode plan` + `--disable-web-search`
+ `--no-subagents` + `--sandbox read-only` + a generous `--max-turns`).

## gitleaks (recommended, not required)

The consent gate previews your plan for secret/PII-shaped content before egress using `gitleaks`.

- Install: `brew install gitleaks`.
- If gitleaks is **not** installed, the gate still opens but shows a "content preview unavailable —
you are the sole filter" notice and escalates the responsibility acknowledgment. Installing it
upgrades the preview from manual-only to automated + manual.

## First run

```
/ce-deep-review-beta docs/plans/<your-plan>.md
```

(The beta is invoked explicitly — typed slash command or an explicit skill call. It does not
auto-trigger.) You'll get the Claude panel, then a consent gate listing the arms available in your
environment (default: none selected — opt in per model), then a verified reconciled sidecar at
`<plan>.deep-review.md`.
128 changes: 128 additions & 0 deletions docs/solutions/skill-design/2026-05-28-agy-arm-posture-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: "agy arm posture validation (ce-deep-review Phase 0, U2): agy 1.0.3 is a viable reviewer but needs an OS sandbox for the read-only floor"
date: 2026-05-28
last_updated: 2026-05-28
category: skill-design
module: compound-engineering / cross-model-review-eval
tags: [cross-model, agy, antigravity, sandbox, seatbelt, validation, ce-deep-review, phase-0]
problem_type: integration_issue
---

# agy arm posture validation (ce-deep-review Phase 0 / U2)

Empirical re-validation of Antigravity (`agy`) as a cross-model reviewer arm for `ce-deep-review`,
run 2026-05-28. Plan: `docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md` (U2).
Supersedes the agy verdict in `cross-model-eval-first-run-2026-05-25.md` ("agy stays dropped"),
which was measured on agy **1.0.2**.

## Verdict: agy 1.0.3 is VIABLE, but its read-only floor must be enforced by an OS sandbox

- ✅ **Viability fixed in 1.0.3.** agy returns clean, specific JSON review findings. The 1.0.2
failure that got it dropped (empty output / its own CLI monologue) is **gone**.
- 🔴 **agy has no flag that delivers R5's read-only/no-tools floor.** `--sandbox` does **not**
confine the filesystem; agy has FS read+write tools and no web-search-disable flag.
- ✅ **Decision (Phase 0 gate):** enforce the floor at the **OS layer** (macOS `sandbox-exec`/
seatbelt), not via agy flags. PoC confirms OS write-confinement works; the production profile is
being finalized (see "OS sandbox" below).

## Environment

- `agy 1.0.3` at `~/.local/bin/agy`.
- Auth: OAuth credentials at `~/.gemini/oauth_creds.json` (keys: `access_token`, `refresh_token`,
`id_token`, `expiry_date`, `scope`, `token_type`). agy state under `~/.gemini/antigravity-cli/`.

## Viability (the headline change from 1.0.2)

`agy --print "<review instruction>"` with the doc on **stdin** returns a clean JSON array of
findings. On a benign planted-flaw doc it correctly surfaced both planted issues
(destructive-before-confirm; plaintext-password) as specific, well-phrased findings, exit 0, no
monologue, no tool use. Parseable directly by the existing `arms.py parse_findings()` (tolerates a
```json fence). **The 1.0.2 blocker is fixed; agy is a usable reviewer on 1.0.3.**

## CLI surface (agy 1.0.3)

`-p`/`--print`/`--prompt` (single-shot, prints response), prompt via arg **or stdin**;
`--print-timeout <dur>` (default 5m); boolean `--sandbox` ("terminal restrictions enabled");
`--add-dir <dir>` (add workspace dir); `--dangerously-skip-permissions`; `--continue`.
**No `--approval-mode`/`--permission-mode`/plan-mode. No `--output-format`. No `--disable-web-search`.**
This confirms the brainstorm's agy surface assumptions. Note: the harness passes the doc via
**stdin** (not `-p "<inline>"`), so the plan's earlier "`-p` argument-length cap" concern is moot.

## Offline auth signal (R9) — do NOT gate on expiry

`~/.gemini/oauth_creds.json` carries `expiry_date` in **ms**. Observed: `expiry_date` was ~52h in
the **past**, yet `agy --print` still worked — agy **silently refreshes** via the `refresh_token`.

**R9 offline-detection rule for agy:** "available" iff `~/.gemini/oauth_creds.json` exists, is
non-empty JSON, and contains a non-empty `refresh_token`. **Do NOT** require `expiry_date` in the
future — that would false-negative (mark agy unavailable when it actually works). This corrects the
v3 plan's assumed expiry check and the brainstorm's `AV_API_KEY` env-var assumption (no env var is
used).

## Posture floor: agy flags do NOT enforce it

R5 requires every non-Claude arm to run read-only, no-web-search, no-tools — symmetric with codex
`-s read-only`. Empirical test (`agy --sandbox`, clean cwd, prompted to read an out-of-workspace
sentinel and write a canary):

- 🔴 **Read leak:** agy **read** `/var/folders/.../secret.txt` (outside the workspace) and printed the sentinel token.
- 🔴 **Write:** agy **created** `/tmp/agy-canary-*.txt`.
- No `--disable-web-search` exists, so the web-search tool can't be flag-disabled either.

`--sandbox` restricts *terminal command execution*, not the FS read/write tools. **No agy flag
combination delivers R5's floor** — so, per R5/U2, the agy arm would be "unavailable" unless the
floor is supplied externally.

Normal operation caveat: when given a plain review prompt (doc on stdin, "return findings"), agy
does **not** touch the filesystem — the leak/write only happened because the prompt explicitly
asked. But R5's floor is a hard guarantee, not a best-effort behavior, and a hostile/garbage plan
doc could induce FS access. Hence the OS sandbox.

## OS sandbox (the chosen mechanism) — PoC + status

Decision: wrap every non-Claude arm in a macOS `sandbox-exec` (seatbelt) profile that enforces the
floor at the process boundary, independent of the CLI's own flags (the same seatbelt mechanism grok
uses internally).

**Iteration (2026-05-28) — what failed, what works:**
- ✅ A `(deny file-write*)` profile **blocks** agy's writes (PoC: `$HOME` canary never created).
- ❌ **`(deny file-write*)` (deny-all, allowlist needed) HANGS agy** (>11–25 min, ignoring its own
`--print-timeout`): it retries denied writes to un-allowlisted state paths and blocks at the
syscall level. Its write-set is too large/dynamic to enumerate (denials don't surface in the
sandbox log).
- ❌ **Any `(deny file-read* ...)` rule ALSO hangs agy** (it stats `~/.config`-ish paths during
init and wedges on a denied read).
- ✅ **`(deny file-write* <specific paths>)` with `(allow default)` works** — agy writes its own
state freely (no hang) and reviews cleanly, while writes to the named sensitive paths are blocked.

**Validated production floor — deny-WRITE-only denylist.** Template:
`scripts/eval/cross_model_review/validation/agy-readonly.sb.tmpl` (substitute `__REPO_DIR__` +
`__HOME__`). `(allow default)` then `(deny file-write* ...)` for: the repo under review, `~/.ssh`,
`~/.aws`, `~/.config/gcloud`, `~/.zshrc`, `~/.gitconfig`, `~/.netrc`. Network allowed (vendor API).
Invoke: `sandbox-exec -f <generated.sb> agy --print ...` from a clean cwd.

- **Gotcha — canonicalize paths.** macOS seatbelt matches canonical paths; a `mktemp -d`
`/var/folders/...` path silently won't match its `/private/var/...` real path (deny won't fire).
Substitute the **real** repo path (`git rev-parse --show-toplevel` + `pwd -P`; `/Users/...` is
already canonical).
- **Validated by `agy-smoke.sh`** (committed alongside the template): `PASS(floor)` write-to-repo
blocked + `PASS(viable)` agy returns 2 findings on the sentinel under the sandbox. Re-runnable.

**What this floor does and doesn't enforce:**
- ✅ Blocks agy modifying the repo, credentials (`~/.ssh`/`~/.aws`/gcloud), and shell/git dotfiles.
- ✅ Network allowed for the vendor API; combined with clean cwd, agy has no ambient repo context.
- ⚠️ **Does NOT block agy *reading* secrets** (deny-read hangs agy). Secret-read-then-exfil via an
induced/injected finding is a **documented residual**, mitigated by: clean cwd, a review-only
prompt, and the fact that v1 reviews the user's *own* internal plans. It is a real prompt-injection
vector for **untrusted** docs — out of scope for v1's threat model; revisit if untrusted-doc review
is ever in scope (would need a confinable agy or an OS read-jail agy tolerates).

**Integration point (for the harness work, post-Phase-0):** `arms.py`'s agy branch should generate
the concrete `.sb` from the template (real repo path + `$HOME`) and wrap the agy invocation in
`sandbox-exec -f <profile>`. The arm continues to pass the doc via stdin from a clean cwd.

## Phase 0 gate consequence

agy is **viable and accepted for v1, confined via the OS sandbox** (not via agy flags). Combined
with grok being dropped (separate doc), v1's cross-model arms are **codex + agy**. The brainstorm's
R5 (agy posture) and Dependencies/Assumptions (auth mechanism) are corrected accordingly.
109 changes: 109 additions & 0 deletions docs/solutions/skill-design/2026-05-28-grok-arm-posture-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: "grok arm posture validation (ce-deep-review Phase 0, U1): grok 0.2.8 headless is blocked by a relay-auth bug — deferred from v1"
date: 2026-05-28
last_updated: 2026-05-28
category: skill-design
module: compound-engineering / cross-model-review-eval
tags: [cross-model, grok, sandbox, validation, ce-deep-review, phase-0]
problem_type: integration_issue
---

# grok arm posture validation (ce-deep-review Phase 0 / U1)

Empirical validation of the Grok Build CLI as a cross-model reviewer arm for `ce-deep-review`,
run on the original developer's machine on 2026-05-28. Plan: `docs/plans/2026-05-28-003-feat-ce-deep-review-skill-plan.md` (U1).

## Verdict: grok DEFERRED from v1 (Phase 0 gate "drop grok")

grok 0.2.8 **cannot complete a single headless `-p` review on this machine** due to a
worker/relay authentication bug. The arm *design* is sound (all required flags exist; the
sandbox posture is ideal), so this is "drop from v1 and re-test on a version bump," not "wrong
approach." v1 ships without grok (codex + agy).

## Environment

- `grok 0.2.8 (730d2470cda)` at `~/.grok/bin/grok`.
- Auth: `~/.grok/auth.json` (OIDC cached token). `grok models` reports "You are logged in with
grok.com" — **shell-level auth is healthy** (log: `auth_mode: Oidc`, `is_expired: false`,
`cached_token handler set api_key (SessionToken)`).
- Offline auth signal (R9): presence of `~/.grok/auth.json` containing a non-empty
`https://auth.x.ai::<id>` scope entry. (No `XAI_API_KEY` env var in use; no flat `expires_at`.)

## CLI surface (grok 0.2.8) — all U1-assumed flags are present

Confirmed via `grok --help`:

- `--permission-mode <MODE>` — values include `plan` (read-only). ✅
- `--disable-web-search` ✅
- `--sandbox <PROFILE>` (env `GROK_SANDBOX`) ✅
- `-p, --single <PROMPT>` (single-turn, prints to stdout and exits) ✅; also `--prompt-file <PATH>`, `--prompt-json <JSON>`.
- `--output-format <plain|json|streaming-json>` ✅
- `--no-subagents`, `--verbatim`, `--max-turns <N>`, `--cwd <CWD>` ✅
- Tool control: `--tools <allowlist>`, `--disallowed-tools`, `--allow`, `--deny`.

The brainstorm's grok flag assumptions hold against 0.2.8. (`--max-turns 1`, however, is **wrong** — see below.)

## Sandbox posture (validated, ideal) — `read-only`

grok ships **built-in seatbelt profiles** (custom ones live in `~/.grok/sandbox.toml`). From
`~/.grok/sandbox-events.jsonl` (`platform: macos/seatbelt`, `enforced: true`):

| profile | restrict_network | workspace writable | notes |
|---|---|---|---|
| `workspace` | false | yes | default dev posture |
| `read-only` | **true** | **no** (RW only `~/.grok` + tmp) | **ideal arm posture** |
| `strict` | true | yes (system paths RO) | workspace RW |

`read-only` gives the floor R5 wants: the model's web-search/fetch **tools** are network-blocked
and the workspace is not writable. (grok's own control-plane API to xAI is a separate transport,
not blocked by the tool-network restriction — so the arm can still produce a review.)

## The blocker: headless `-p` worker relay-auth failure

Every headless `-p` invocation fails:

```
ERROR worker quit with fatal: Transport channel closed, when Auth(AuthorizationRequired)
ERROR error= Internal error: "max_turns exceeded: limit is N, but got N+2 messages"
```

Reproduced under all of:
- trivial prompt (`say hi`), full review prompt; `--output-format json` and `plain`;
- clean cwd (`--cwd <tmp>`), tools disabled (`--tools ""`), `--no-subagents`, `--disable-web-search`, `--permission-mode plan`;
- `--max-turns` 1, 5, 8, 10, 30 (message count always creeps ~2 over the limit — the worker spins retrying the failed auth, burning messages until max-turns trips);
- **after `grok login`** (user re-ran) and **after `grok agent --reauth`** (user re-ran).

Root cause (diagnosed via `~/.grok/logs/unified.jsonl` + `grok agent headless --help`): the
**shell** process auths fine, but the headless **agent worker runs "over the Grok WebSocket
relay"** (a separate auth path from the shell login), and *that* relay auth fails with
`AuthorizationRequired`. Because the shell login is healthy, neither `grok login` nor
`grok agent --reauth` clears it. This is a grok 0.2.8 headless/relay bug on this machine, not a
stale credential.

Secondary observation (isolation): with tools enabled in the repo cwd, grok went **agentic** —
it tried to use the `qmd` MCP and search `docs/plans/` ("There are many plans in docs/plans/…
qmd__search") instead of reviewing the inline text. Confirms the arm must run from a **clean cwd
with tools disabled** (both to keep it a single-shot reviewer and to prevent ambient-repo egress).

## When grok is fixed: the validated would-be posture

**Re-probe:** `scripts/eval/cross_model_review/validation/grok-smoke.sh` runs the intended posture
against the sentinel and reports `BLOCKED` (relay bug still present) vs `PASS` (relay fixed → arm
can ship). Run it after any grok version bump. Land this in `arms.py` once it passes:

```
grok --cwd <clean-tmp> -p "<lens rubric + doc>" \
--output-format json --disable-web-search --no-subagents \
--tools "" --permission-mode plan --sandbox read-only \
--max-turns <adequate, NOT 1>
```

- `--max-turns 1` is wrong: a single review uses ~6+ internal messages. Use a generous bound (or omit) so a legitimate single-shot review isn't cut off.
- `--sandbox read-only` enforces the FS+network-tool floor at the seatbelt layer (defense-in-depth beyond `--permission-mode plan` + `--tools ""`).
- Pass the doc via stdin or `--prompt-file` (consistent with the harness's isolation model).

## Phase 0 gate consequence

Per the plan's Phase 0 gate ("grok validation fails → drop grok from v1"): **grok is dropped from
v1.** Combined with the agy posture finding (separate doc), v1's cross-model arms are codex + agy
(with agy confined via an OS sandbox — see the agy validation doc). Re-test grok on a version bump.
Loading