feat(ce-deep-review): verified cross-model deep-review skill (RU1–RU6) + the eval harness behind it#858
feat(ce-deep-review): verified cross-model deep-review skill (RU1–RU6) + the eval harness behind it#858jaybna wants to merge 46 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… README (U1) Schema stub with placeholder pre-registration values and one example entry per subset; concrete corpus and threshold/N values are filled at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er tests (U2) Resolves the P1 record-store seam: a single canonical record schema both producers validate against; orchestrator records are file-dropped into a shared run dir and pooled by reading all records; the circuit breaker and per-arm timeout apply only to CLI arms the runner spawns. 8 deterministic-carrier tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e (U3) Resolves the P1 arm-b isolation gap (AD2): argv+stdin assembly (no doc interpolation), arm-b clean-cwd + HOME/CODEX_HOME env overrides, and a positive sentinel leak-detection probe proving the model had no context rather than only checking that isolation flags were set. 15 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline (arm a) and self-critic (arm d) prompts produced by the orchestrator via in-process subagent dispatch and ingested as schema-conformant records; arm d makes no external call and no document egress (AE4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d aggregation (U5, U6) U5: per-finding blinded judge rubric (anchored 0/25/50/75/100), cross-arm dedup, and the blind-integrity verdict (above-chance arm-guessing -> confounded). U6: three-way aggregation (build:<arm> / build_nothing / inconclusive) with known-failure subset as the primary signal, below-N and negative-control-moved forcing inconclusive. 20 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-way decision record (build:<arm> / build nothing / inconclusive) filled from U6 aggregates at eval-run time, with primary known-failure table, validity checks (blind-integrity, negative control, power), and secondary tie-breakers. Framed as cross-model critique per R10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (smoke-found) A live N=1 smoke run surfaced two bugs the stubbed unit tests could not: (1) arm-b's isolation stripped HOME, which killed codex auth, and the clean CWD tripped codex's trusted-dir check; reworked to keep HOME (auth preserved) + clean CWD + --skip-git-repo-check, so the only b-vs-c delta is arm c's injected fixed context. (2) parse_findings only split '-'/'*' bullets, collapsing codex/agy numbered critiques into one finding; now splits numbered lists too. Adds the run-arm entry point that makes the live path invokable. Re-smoke: arm b ok (17 findings, isolated workdir), arm c ok. 21 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When comparing review models/configs, every arm must get identical controlled context; unequal context (e.g., repo access for in-process subagent arms but not isolated CLI arms) confounds the comparison and makes context masquerade as model diversity. Captured from a live run that this confound nearly led to a wrong build/no-build conclusion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el critique Wraps the cross-model arms (codex + agy) into a single command that critiques one document and prints each model's findings. Built-in default rubric, optional context file (switches to the fixed-context arm), CMRE_TIMEOUT override, and graceful skip when a CLI is missing/unauthenticated. Smoke-tested: codex returns findings; a slow/unavailable agy is reported as a timeout rather than crashing the run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…epo name) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9cd048050
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…iew breakpoint Mines a git repo for changes the project itself later judged wrong, so the cross-model review eval can score arms against validated outcomes (the R7 known-failure subset) ported from plan review to code review. - scan: Tier-1 reverts (the team's own verdict; trust=high) - scan-fixes: Tier-3 fix->blame (trust=needs_confirmation, alternates kept for R6) - attribute-fix: single-fix blame attribution - pure parsers (revert-SHA, PR numbers, hunk ranges, code-path filter, entry conformance) are unit-tested; the git walk is integration-tested against a constructed repo. Validated on BlueprintOS: Tier-1 yields 0 usable items (no git revert), Tier-3 yields ~180 unique known-failure candidates from ~200 fix commits, with real surfaced-after latency up to 53 days. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kpoint Where plan review can only forward-rate whether a finding looks decision-changing, a known-bug corpus has a target -- the bug the fix proved mattered -- so the known-failure metric becomes an objective hit/miss (R7 made concrete). - gt-resolve: join blind per-finding matches_bug verdicts back to per-(arm,doc) gt_hit, preserving blinding (the arm is recovered after the judge, never shown). - gt-score: per-arm known-failure hit counts, scoped to known_failure docs. - aggregate: uses gt_hit as the known-failure predicate when present, falling back to decision_changing -- one three-way decision rule serves both breakpoints. - gt_match_rubric.md: the blind per-finding match rubric (defect mechanism, not surface/file overlap), with R6 human confirmation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds drive_eval.py, the deterministic spine that ties corpus -> arms -> judge -> decision into one runnable flow, with the model-driven steps as explicit handoffs. - drive_eval.py plan: enumerates arm x doc x trial work, emits the CLI-arm (b, c) commands and the in-process-arm/judge todo, writes run-state.json. Refuses to plan a run whose threshold/N are not pre-registered (R9). - drive_eval.py finalize: runs gt-resolve -> gt-score -> aggregate over ingested records + judge verdicts and renders the decision artifact; forces inconclusive on confounded blind-integrity, below-N, or negative-control movement. - build_corpus.py to-manifest: wraps scan/scan-fixes entries into a manifest skeleton with null pre_registration, so the human fills the decision rule and confirms needs_confirmation entries before running. Smoke-validated on BlueprintOS: scan-fixes -> to-manifest -> plan resolves real culprit-diff paths into runnable arm commands (3 docs -> 36 expected records). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o stop credit bleed A real run surfaced the bug: GT-match verdicts were keyed on (doc_id, finding_id), but finding ids are local to a record, so one matches_bug verdict credited every arm that reused a local id like f1/f2 -- inflating per-arm hits. Fix: gt_pool builds the blind judging pool under globally-unique, arm-opaque uids (content+arm hash, order-independent, does not encode the arm) plus an arm provenance map; gt_hits_from_verdicts resolves uid-keyed verdicts to per-(arm,doc) gt_hit. The judge still never sees the arm. drive_eval finalize uses the pool flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanitized decision record (public repo): inconclusive by design, with the durable methodology findings -- decorrelation signal observed, one external CLI arm unusable, the credit-bleed bug found+fixed, and the Tier-3 corpus-quality gate the run showed is needed. No proprietary target details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first run showed Tier-3 blame on a repo that ships large feature commits collapses many fixes onto a few giant culprits (one ~150k lines), producing a non-independent, unreviewable corpus. Gate: drop culprits whose diff exceeds --max-culprit-lines/--max-culprit-files (too large to review, or a foundational/import commit), and dedup fixes that share a culprit so they don't become N non-independent docs. parse_numstat + culprit_within_caps are unit-tested; the gate + dedup are integration-tested on a constructed repo. Defaults 2000 lines / 30 files; tighter = cleaner but smaller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second run (quality-gated corpus, blind judging, credit-bleed fix): no arm cleared threshold and the run-1 cross-model GT advantage did not reproduce -- it was largely a corpus artifact. Bigger finding: GT-match against historical fixes is too narrow -- every arm found many real, serious bugs but rarely the specific historical one, so GT-match alone undercounts reviewer value and must be paired with a finding-yield metric. Sanitized; no proprietary target details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The gated second run showed GT-match against historical fixes is too narrow: every arm found many real, serious bugs but rarely the one a historical fix targeted, so GT-match alone undercounts reviewer value. yield_score tallies per-arm total / unique-actionable / decision-changing findings from the same blind judge verdicts (resolved to arms via pool provenance, blinding preserved). drive_eval finalize takes optional --yield-verdicts and renders a yield table beside the GT-match table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…recate agy Troubleshooting agy (Antigravity 1.0.2) found it unusable as a non-interactive reviewer: prompt-in-argv hangs, doc-on-stdin returns exit 0 empty, no no-tools mode, and its only unblock flag grants workspace tool access that breaks arm isolation. Replace it with gemini 0.43 (validated mechanically: auth + trust gate pass, attempts the review): `gemini -p "<instr>" --approval-mode plan --skip-trust -o text` with the doc on stdin (-p is appended to stdin; plan = read-only; --skip-trust required for a clean headless cwd). Needs GEMINI_API_KEY. arms.py gains a gemini branch + choice, drive_eval defaults arm c to gemini, agy retained but documented as deprecated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… precision hole agy->gemini run with the new finding-yield metric scored. GT-match and yield gave opposite verdicts: GT-match shows no cross-model advantage, but yield shows the cross-model arms found 6-13x more actionable bugs than the Claude baseline -- the value GT-match hides. However the negative control caught gemini fabricating specific defects (line numbers not in the diff), so raw yield rewards confabulation and must be precision-weighted; codex (clean control, ~30 actionable) is the credible standout. Sanitized; no proprietary target details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c86da36ec7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e80383152
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… persist full records The plan-review run exposed two gaps: (1) critique.sh truncated findings to 240 chars and persisted nothing, so cross-model findings couldn't be judged after the run; (2) the cross-model arms got one generic rubric while the Claude panel got six specialized lenses, confounding the comparison. - critique.sh: persist the full record per model when CMRE_OUT_DIR is set. - panel-critique.sh: run codex + gemini through the SAME six ce-doc-review lenses (coherence/feasibility/security/scope/product/adversarial), full-record persistence, for a prompt-symmetric Claude-panel-vs-cross-model comparison. - docs: record the plan-review breakpoint finding -- cross-model converged on the panel's top premise rather than decorrelating; the lever's value is breakpoint-dependent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dfba11610
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Review feedback addressed in dba311dThanks — these were good catches. Verified each against current HEAD, fixed the legitimate ones, and noted the few I'm deferring with rationale. Fixed as suggested
Fixed, but differently than suggestedRather than make the runbook clunkier, I made the code match the documented ergonomics:
Intentionally deferred (correct, but lower priority)
|
|
The two deferred items (renamed-file blame path; trial-in-finding-UID methodology) are now tracked in #864. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dba311d6d1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…hase 0 validation harness agy 1.0.3 is a viable cross-model reviewer (1.0.2's empty-output bug is fixed), but its own --sandbox does not confine the filesystem. Wrap the agy arm in a macOS seatbelt deny-write profile applied at run time (deny writes to the repo + home creds/dotfiles; allow-default otherwise -- agy hangs under deny-all-write or any deny-read). Logical argv is unchanged; the sandbox is an execution concern declared via a 'sandbox' spec field and applied in run_invocation (profile generated from validation/agy-readonly.sb.tmpl, cleaned up after). Adds the Phase 0 validation harness (scripts/eval/cross_model_review/validation/): agy-smoke.sh (PASS: floor + viability), grok-smoke.sh (re-probe for the deferred grok arm), sentinels, and the seatbelt template. Floor + viability validated live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iteration + brainstorm corrections Phase 0 validation outcome (v3 plan): grok 0.2.8 headless is blocked by a relay-auth bug (dropped from v1, re-probe on a version bump); agy 1.0.3 accepted and OS-sandboxed. v1 cross-model arms = codex + agy. Adds the two arm-posture-validation solution docs, the ce-deep-review onboarding doc, and the v1->v2->v3 plan iteration. Corrects the brainstorm's stale agy assumptions: R5 posture (agy --sandbox does not confine the FS; floor enforced via OS seatbelt), and the auth assumption (OAuth at ~/.gemini/oauth_creds.json + refresh_token, not AV_API_KEY env vars; detection must not gate on expiry since agy auto-refreshes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1226edac0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…+ egress-safe cross-model dispatch Adds the ce-deep-review-beta skill (disable-model-invocation, explicit-only): Phase 0 arm detection (env-detect.sh, codex+gemini, JSON-only/no credential leakage); Phase 1 headless ce-doc-review with a fail-stop UX; Phase 2 single consent gate (multi-select per-model opt-in default-none, ack in the stem, graceful gitleaks degradation via gitleaks-scan.sh + escalated ack); Phase 3 egress-safe dispatch (only consented models) shelling the bundled harness; Phase 4 raw UNVERIFIED thin-slice output to <plan>.deep-review-draft.md (the verified .deep-review.md name is reserved). P0 fix: panel-critique.sh gains a minimal '--models <subset>' guard so a deselected vendor is never run (egress == consent), default preserves the current codex+gemini behavior. Bundles the canonical harness (bundle-harness.sh) into the self-contained skill with a CI-enforced normalized drift test. Adds a minimal contract test. grok deferred; agy's full panel integration is Phase 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the [BETA] ce-deep-review-beta row to the plugin README and docs/skills index, plus a thin user-facing doc (docs/skills/ce-deep-review.md): what it does, how it differs from ce-doc-review, the v1 arms (codex + gemini; grok deferred; agy migrating), consent/safety, and the explicit-invocation + unverified-thin-slice caveats. Links the onboarding doc. The doc grows when verification lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa558334fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…s classifier (RU1/OD-4)
The dogfood run found the cross-model dispatch (bash panel-critique.sh) blocked
by Claude Code's auto-mode permission classifier even after the in-skill consent
gate was granted — allowed-tools alone is not sufficient. A second-session probe
characterized the classifier: it is consent-SCOPE-keyed, not path-keyed. It reads
what the conversation records the user *chose*, and AskUserQuestion returns the
selected option *labels*, not the gate stem. The gate's bare model-name labels
(codex (OpenAI)) did not register as egress authorization; a top-level probe
phrased as "Send the plan to gemini (Google)" cleared the classifier and ran real
egress.
RU1 (path chosen: legible gate + settings fallback):
- consent-gate.md / SKILL.md Phase 2: option labels now carry the egress verb +
vendor ("Send the plan to codex (OpenAI)") so the recorded consent is legible to
the classifier; new "Egress-gate legibility" section explains why.
- SKILL.md Phase 3 + arm-invocation.md: "If the dispatch is blocked" fallback
ladder (restate consent -> retry once -> !-handoff -> settings rule).
- onboarding doc: new "Egress permission" section + permissions.allow rule for
headless runs (flagged untested for the headless-only path).
- contract test: asserts verb-carrying labels + documented block fallback.
- decision record: docs/solutions/skill-design/2026-05-28-od4-...md.
Verification caveat: that the reworded IN-SKILL gate clears the classifier is a
single-data-point hypothesis (the confirmed clear was a top-level probe). It
cannot be tested in the authoring session (skill caches at session start) — needs
a fresh-session /ce-deep-review-beta run. The fallback ladder is the documented
degradation if it does not.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rk v3 superseded v4 collapses U1-U6 into Current State (Phase 0/1 done on PR EveryInc#858), renumbers the residual units RU1-RU6, records the as-built corrections where v3 drifted from the committed code (env-detect is codex+gemini only; arms.py already has the agy arm; the --models guard already shipped; the bundle-drift bun test is the gate, not the phantom CI step), and folds in the dogfood OD-4 finding. v3 frontmatter flips to status: superseded / superseded_by: 004. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09d7f73b31
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…veryInc#2) Second dogfood run on this v4 plan surfaced that the plan's own OD-4/RU1 narrative was stale: the work it framed as "OPEN — load-bearing" had already shipped in 766c730, and this run gave the fresh-session confirmation the plan said it was waiting for (the in-skill consent gate's verb-carrying labels cleared the auto-mode egress classifier with no `!`-handoff). Reconciliation (panel-driven, high-confidence fixes only): - OD-4: OPEN -> RESOLVED; record root cause + 766c730 fix + residual sub-questions (headless/permissions.allow path still untested). - RU1: DONE; verification met by dogfood EveryInc#2. - Risk table: "turnkey blocked" -> resolved; "v1 ships codex + agy" -> "currently codex + gemini; agy post-cutoff via RU2 before 2026-06-18". - RU2: name the macOS-only platform-gate scope + add the arms.py off-mac hard-guard (security panel). - RU6: split recommendation (RU6a doc-cleanup / RU6b verifier-rates). - Add ce-deep-review -> ce-deep-review-beta naming caveat; new .gitignore deferred question for the draft sidecars. Judgment calls left as residual/Outstanding Questions, not silently decided. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arm (RU2) RU2 migration: agy becomes the default non-codex arm ahead of the gemini 2026-06-18 HTTP-410 cutoff. gemini stays SELECTABLE via `--models codex,gemini` until the cutoff -- it is NOT removed; it remains the fallback while agy is the sole new non-codex arm. - panel-critique.sh: default arms codex+agy; export CMRE_REPO_DIR from the reviewed plan's repo root. - arms.py: _repo_root() honors CMRE_REPO_DIR (protect the user's plan repo, not arms.py's location); run_invocation refuses agy when the seatbelt prefix is empty (off-darwin / missing template) rather than running unfloored (R5). - env-detect.sh: detect agy (OAuth-file presence, presence-only); macOS platform-gate -> "unavailable" off-darwin so the gate never offers an unfloored arm; emit the agy key. - SKILL.md / consent-gate.md / arm-invocation.md: agy in the gate (Send the plan to agy (Antigravity)); gemini selectable pre-cutoff. - docs/skills/ce-deep-review.md: arm table synced. - Re-bundled (drift green); new tests/skills/ce-deep-review-beta-arms-ru2.test.ts (7 tests); v4 plan marks RU2 DONE. Verified: env-detect agy ok on macOS / unavailable on simulated Linux; live 1-model agy run produced records for all 6 lenses via the full skill path; agy-smoke floor PASS (repo write blocked) + viable under seatbelt; off-mac guard refuses unfloored agy; drift green; full bun test 1427 pass; release:validate in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-06-18)
gemini's CLI hits an HTTP-410 cutoff on 2026-06-18, so shipping it as a
selectable fallback that dies in June added no durable value. Remove it from the
ce-deep-review skill now; agy (Antigravity) is the sole non-codex skill arm.
Skill-only removal -- the shared arms.py gemini arm is KEPT for the cross-model
eval (it is the eval's default arm c, a valid blind-judge family, and is covered
by cross-model-review-eval.test.ts). Ripping it out of arms.py would break the
eval's codex-vs-gemini comparison for no benefit.
- env-detect.sh: drop the gemini arm + key; emit {codex, agy} only (the
~/.gemini OAuth path stays -- it is agy's auth file too).
- SKILL.md / consent-gate.md / arm-invocation.md: gate no longer offers gemini;
verb-label examples use agy (Antigravity); "gemini retired" notes added.
- panel-critique.sh header: arms = codex + agy (no gemini-selectable advertising;
the script still passes any --models through to arms.py).
- docs/skills/ce-deep-review.md + onboarding: arm table / example updated.
- tests: contract asserts the agy verb-label; ru2 test asserts env-detect emits
no gemini key.
- v4 plan: RU2 status / risk table / Phase 2b reconciled (gemini removed, not
retained).
Verified: env-detect emits {codex, agy}; full bun test 1427 pass (incl. the eval
gemini-arm tests, still green); release:validate in sync.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…els semantics (RU3)
panel-critique.sh now forks one background subshell PER MODEL (each running the
six lenses sequentially) and waits on all, instead of the sequential lens x model
loop. Parallelizing across models (not across lenses) overlaps the slow arms
while bounding concurrency to one in-flight request per vendor -- the
rate-limit/resource mitigation the feasibility lens flagged. Per-(model,lens)
progress lines stream as each cell completes (R15); they interleave (each is
self-labeled; records key on ${cli}__${lens}.json so parallel writers never
collide).
--models semantics, now defined: default = all available (codex + agy);
unavailable / off-platform arms warn-SKIP per cell, never fatal -- a missing
binary, or agy off-macOS (its read-only floor is a macOS seatbelt) -- and the
rest still run.
- panel-critique.sh: per-model subshell fan-out + wait; agy off-mac SKIP;
default/skip semantics documented.
- tests: 3 new RU3 tests (subshell-per-model structure; bogus arms warn-skip not
fatal; agy off-mac SKIP) in ce-deep-review-beta-arms-ru2.test.ts.
- v4 plan: RU3 marked DONE; Phase 2b complete.
Verified: live --models codex,agy produced all 12 records with interleaved
progress (proves concurrency); bogus arms -> exit 0, no records; agy under a
Linux uname stub -> SKIP, no record; full bun test 1430 pass; drift green;
release:validate in sync.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ackstop (RU4) Adds the verification step that grounds every raw cross-model finding against the reviewed plan and assigns one verdict: CONFIRMED (a substantial verbatim quote that exists in the plan), NOT-FOUND-IN-DOC (a claimed quote that is absent), or NEEDS-HUMAN (no substantial quote to check). Replaces the thin slice's `verification: none` with `verification: quote-grep-backstop`. The backstop (scripts/verify-findings.py) is a pure function of (finding text, doc): deterministic, authoritative (a model may not override CONFIRMED / NOT-FOUND-IN-DOC), and blind to the producing model (the verdict never reads the model label). Scope decision: v1 uses the deterministic quote-grep as the SOLE authoritative gate -- no LLM verifier, which would re-introduce the verifier-contamination failure mode the panel flagged. CONFIRMED certifies the quoted evidence exists, NOT that the finding is correct/important -- that stays a human call. A high NEEDS-HUMAN count is expected and honest (a quote-grep can only adjudicate findings that quote the doc). - scripts/verify-findings.py: verify-one + verify-records subcommands; quote extraction + normalized (smart-quote/dash/whitespace-folded) substring grep; multi-word grounding quotes only (a lone filename can't trivially confirm). - references/verification-protocol.md: verdict semantics + brittleness caveats. - SKILL.md: Phase 3.5 (verification) added; Phase 4 + intro + description reframed from "raw and unverified" to verdict-tagged. - docs/skills/ce-deep-review.md reframed. - tests: 7 verify tests + a contract test. Verification is skill-only (not bundled -- it is skill-specific, not eval-shared). RU5 (reconciliation + the verified .deep-review.md sidecar) is still pending. Verified: verdict cases pass; model-blindness (same text -> same verdict under different model labels); normalization avoids false NOT-FOUND; full bun test 1438 pass; release:validate in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecar (RU5) Promotes the verdict-tagged findings into the reserved verified sidecar <plan>.deep-review.md -- the skill's terminal output is now the verified file, not the thin-slice draft. Completes Phase 3 (RU4 verify + RU5 reconcile). scripts/reconcile.py (skill-only) provides two deterministic helpers: - rotate: rename an existing <plan>.deep-review.md to .deep-review.<ISO>.md and keep only the 5 newest rotations. DATA-LOSS-SAFE: the prune glob matches rotations only (<plan>.deep-review.<infix>.md) -- never the base (just renamed away) nor the -draft sidecar (a -draft infix, not a .-delimited one). This is the rotation data-loss surface the feasibility review flagged. - render-cross-model: deterministic by-lens, verdict-tagged Markdown from verify-records output; grounding quote shown on CONFIRMED. SKILL.md Phase 4 restructured: rotate -> render -> write <plan>.deep-review.md with skill_phase: verified, verification: quote-grep-backstop, verdict counts, coverage; banner precedence is coverage-only (the UNVERIFIED banner is gone) plus a NEEDS-HUMAN triage note; a decision-changing union closes the body. An existing .deep-review-draft.md is left in place (historical dogfood artifact). .gitignore untouched (still an open decision) + committed-leak reminder when content_preview is unavailable. references/reconciliation.md documents the contract. Intro / description / user doc reframed; H1 drops "thin slice". - tests: 4 reconcile tests (rotate keep-5 + base/draft safety + path guard; render grouping/tagging) + contract test updated to the verified output. Verified: rotate keeps 5 newest by ISO infix, prunes older, spares base/draft, refuses non-.deep-review.md paths; render groups by lens + orders verdicts; full bun test 1442 pass; release:validate in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RU6 (Phase 4 build). Adds the verifier-rate measurement and finalizes docs. RU6b -- verifier rates (re-scoped for the deterministic verifier): - verify-findings.py gains a `measure` subcommand over a labeled corpus (references/calibration/verifier-corpus.json: 11 grounded + 11 confabulated, including format-variant grounded items that exercise normalization). Computes false-CONFIRM (expected NOT-FOUND -> CONFIRMED) and false-NOT-FOUND (expected CONFIRMED -> not confirmed); eligible when both <=5%. Measured 0% / 0%. - Re-scope (deviation from v3 U12, justified by RU4's deterministic decision): the verifier is a deterministic, model-blind quote-grep, so v3's agy-voiced sampling, synthetic-fallback, calibration_scope, and N=3 trials are MOOT -- verdict is a pure function of (text, doc), no voice, no variance. N=1 eval. RU6a -- docs/cleanup: - README row reframed (verified output, not "thin slice unverified"). - Brainstorm: grok -p retention line corrected to OD-3 = CONFIRMED acceptable (the AV_API_KEY -> oauth_creds correction was already in via supersedes notes). - Drift-gate DECIDED: the committed bun equality test is the gate (runs under `bun test` -> CI's test check), no redundant .github step; v3's phantom "CI step" claim lives only in superseded v3 and v4 already corrects it. - Full contract test is in place (accreted across RU4/RU5/RU6). - v4 plan: RU6 DONE; Phase 4 build complete. Promotion (beta->stable) remains gated on human-run checks, not more code: the manual end-to-end over F1-F5 and the OD-1 dogfood (>=2 devs), which needs the skill shipped first. Verified: corpus eligible at 0%/0%; 2 RU6b tests; full bun test 1444 pass; release:validate in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9cd13a62d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The quote-grep backstop's normalize() folded smart quotes/dashes/whitespace but not markdown emphasis markers (* and _), so a cross-model finding that quoted an emphasized phrase verbatim was verdicted NOT-FOUND-IN-DOC: a doc writing `the order *is* the container` did not match a finding quoting "the order is the container". Surfaced by the first real end-to-end run on a brainstorm doc. Strip * and _ from both doc and quote in normalize(). Safe against snake_case false-merges: removal inserts no space, so `market_id` -> `marketid` only collapses to a match when both doc and quote carry the underscore (a true verbatim quote); a spaced paraphrase keeps its space and still won't match. Adds a markdown-emphasis grounded case (g12) to the RU6 calibration corpus (measure stays 0%/0% eligible) and two unit tests. Verified on the saved run: exactly one finding flipped NOT-FOUND -> CONFIRMED (the clean emphasis artifact); the other genuine paraphrase/quote-char-swap cases correctly remained NOT-FOUND. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2cc9c70f0d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Resolves the genuinely-open automated-review threads. Most prior-round findings were already fixed in dba311d but their threads were never resolved on GitHub (handled as replies); these are the items still open against current code. drive_eval.py: - coverage() requires exact (arm, doc_id, trial) tuple membership, not cardinality, so stale/extra records can't look complete - any incomplete run is forced to inconclusive (not just build:<arm>) - load_records drops failed (timeout/error) records so a failed trial counts as missing, not a clean zero-finding review - plan() validates go_threshold / minimum_corpus_n / trials_per_arm as positive integers run_arms.py: - sanitize doc_id into filename components (_slug); ingest catches OSError - negative-control judge wording gated on subset==negative_control; forward_rated docs get neutral merit-based judging - parse_judge_verdicts rejects arrays of non-objects (returns []) so a failed judge run can't feed garbage downstream build_corpus.py: - blame only the deleted/changed old-side lines of a fix hunk, not the unchanged context lines (no longer mis-attributes context authors) verify-findings.py: - verify-records filters records to the current plan's doc_id, so stale records in a reused CMRE_OUT_DIR aren't published into this sidecar (verify-one and model-blindness untouched) gitleaks-scan.sh: - a gitleaks invocation error (or missing/unparseable report) now maps to the escalated "unavailable" path, not a false clean "ran" before egress SKILL.md: - Phase 0 guard against silently treating an unresolved CLAUDE_SKILL_DIR as "zero arms" README.md (eval runbook): - correct the ingest pipe to invoke run_arms.py via python3 Deferred (tracked in EveryInc#864): renamed-file blame path; trial-in-GT-UID. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: becffb998c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…afe rotation Two follow-up review findings on the latest push: - allowed-tools only whitelisted the three bash *.sh wrappers, but Phase 3.5 and Phase 4 invoke `python3 verify-findings.py` and `python3 reconcile.py`. Under a restricted permission environment the flow would clear consent and dispatch, then block at the mandatory verification/reconciliation steps before writing the verified sidecar. Added Bash(python3 *verify-findings.py*) and Bash(python3 *reconcile.py*), and normalized all entries to the trailing-* form the onboarding doc already documents (so arg-bearing invocations match). - reconcile.py rotate() used a per-second stamp; two runs in the same second (or an explicit duplicate --now) produced an identical filename and os.rename silently clobbered the earlier backup, violating the data-loss-safe contract. Now disambiguates with a numeric suffix (<stamp>-1, -2, ...) which still sorts newest-first for keep-N pruning. Regression test added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1ae8b3be01
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if rec_doc_id is not None and rec_doc_id != f"{doc_id_base}__{lens}": | ||
| continue # stale record from another plan (or another lens) in a reused dir |
There was a problem hiding this comment.
Filter out same-plan stale model records
This still lets stale records from a previous run of the same plan through: the current guard only rejects records whose doc_id differs from <plan>__<lens>, so if one run consented to codex,agy and a later run of the same plan selects only codex in the default /tmp/cmre-panel directory, the old agy__*.json files have matching doc_ids and are verified/published into the new .deep-review.md. Fresh evidence beyond the prior thread is that the implemented filter checks only doc/lens here and has no run id or selected-model check, so same-plan stale records still pass.
Useful? React with 👍 / 👎.
|
Recasting this PR to fit the contributions policy in the README — I saw that outside PRs aren't merged directly and that you have Claude/Codex review submissions and decide independently. That's completely fair, so:
Take whatever's useful, re-implement it your way, or ignore it — all good. Happy to close this in favor of the issue if a PR isn't the surface you want. Thanks for the plugin; it's what this was built on. |
What this PR delivers
Two related things, built in sequence:
ce-deep-review-beta— a turnkey skill (plugins/compound-engineering/skills/ce-deep-review-beta/) that runs the Claudece-doc-reviewpanel on a high-stakes plan, then — after one consent gate — fans the plan across non-Claude reviewer CLIs for decorrelated findings, verdict-tags every cross-model finding against the plan with a deterministic quote-grep backstop, and writes a reconciled verified<plan>.deep-review.mdsidecar. Phases 0–4 are complete (RU1–RU6).scripts/eval/cross_model_review/) it was built on — the decision tool that measured whether cross-model review is worth shipping, before the skill existed.The skill is the headline deliverable; the harness is the evidence base behind it. The skill is beta (
-betasuffix,disable-model-invocation: true) — invoked explicitly, never auto-triggered.The
ce-deep-review-betaskill (Phases 0–4 complete)A high-stakes plan goes through:
env-detect.shreportscodex+agyavailability (offline:command -v+ credential-file presence, never an API call, never prints secrets). Zero arms → panel-only sidecar.ce-doc-reviewheadless, parses its envelope; fail-stops if the panel didn't complete (no gate without panel results).Send the plan to codex (OpenAI)). Selecting any model = consent for exactly that subset.panel-critique.sh --models <subset>runs only the consented arms across the same six lenses the Claude panel uses, parallel across models (one in-flight request per vendor — the rate-limit bound). A deselected vendor is never sent the plan (the--modelsguard filters before the run, never post-hoc).verify-findings.pyassigns each cross-model finding exactly one verdict — CONFIRMED (a substantial verbatim quote exists in the plan), NOT-FOUND-IN-DOC (a claimed quote that's absent — flagged, not dropped), or NEEDS-HUMAN (no verbatim quote to check). The backstop is authoritative and model-blind: the verdict function never sees the producing model, so a model verifier can't inherit the confabulation it's meant to catch. CONFIRMED certifies the quoted evidence exists — not that the finding is correct or important (a human's call).<plan>.deep-review.mdwith the trusted panel findings, the verdict-tagged cross-model section, and a decision-changing union (panel + CONFIRMED findings that would change a go/no-go).Arm set: codex + agy
sandbox-execdeny-write seatbelt (agy's own flags don't confine the FS, and it hangs under a pure deny-read profile), soenv-detectreports itunavailableoff-darwin andarms.pyrefuses it when the seatbelt prefix is empty.arms.pygemini arm is kept for the eval (a blind-judge family). grok stays deferred (0.2.8 relay-auth bug).RU1 / OD-4 — the auto-mode egress classifier (resolved)
The first dogfood run surfaced OD-4: the cross-model dispatch was blocked by Claude Code's auto-mode permission classifier even after consent —
allowed-toolsis not sufficient alone. Characterization: the classifier is consent-scope-keyed, not path-keyed, andAskUserQuestionrecords the selected option labels, not the gate stem. Fix: option labels now carry the egress verb + vendor so the recorded consent is legible; plus a "dispatch blocked" fallback ladder (restate → retry once →!-handoff →permissions.allowsettings rule) and a decision record (docs/solutions/skill-design/2026-05-28-od4-egress-classifier-consent-scope.md). Confirmed in two independent fresh-session runs (dogfood #2 on the v4 plan, and today's acceptance run on a real foreign-repo plan) that the in-skill gate clears the classifier with no!-handoff. The durablepermissions.allow/ headless path (no interactive consent turn) remains untested.RU6 — verifier-rate validation
verify-findings.py measureover a labeled corpus (grounded + confabulated, incl. format variants) gates false-CONFIRM and false-NOT-FOUND each ≤ 5%; measured 0% / 0%. Because the verifier is deterministic and model-blind, the eval is straight N=1 (no trials, no voice sampling).First end-to-end acceptance run (real plan)
Ran the full pipeline on a real
docs/brainstorms/requirements doc in a separate repo (the manual end-to-end acceptance gate):codex+agyboth ok → Phase 1 7-persona panel (40 findings) → Phase 2 gate (both arms consented) → Phase 3 dispatch cleared the egress classifier with no!-handoff (CMRE_REPO_DIRauto-resolved to the plan's repo; agy ran under the seatbelt) → 12/12 cells ok, coverage full → Phase 3.5 verdicts → Phase 4 verified sidecar.Fix surfaced by the run (this PR)
normalize()folded smart quotes/dashes/whitespace but not markdown emphasis (*,_), so a finding quoting an emphasized phrase verbatim was wrongly NOT-FOUND (the order *is* the container≠the order is the container). Now strips*/_for both doc and quote — safe against snake_case false-merges (removal inserts no space). Adds a markdown-emphasis corpus case (g12) + two unit tests; re-verifying the saved run flipped exactly one finding NOT-FOUND→CONFIRMED (the clean artifact), leaving genuine paraphrases correctly NOT-FOUND.The evaluation harness (
scripts/eval/cross_model_review/)The decision tool behind the skill — measures whether/which review-improvement lever is worth building, across two breakpoints:
Findings from the real runs: GT-match against historical fixes is too narrow (arms find real bugs, rarely the specific historical one); finding-yield showed the cross-model arms surfaced 6–13× more actionable bugs than the Claude baseline; raw yield must be precision-weighted (the negative control caught one model fabricating defects);
codexwas the credible standout. Verdict: inconclusive on a build decision (underpowered) — the harness, not a build recommendation, was the deliverable. The skill above is the productized follow-through.Test plan
bun test— full suite green (1445 tests). Pure logic (parsers, scoring, blind resolution, verifier verdicts, reconciliation, rotation) unit-tested; git walks and arm invocations integration-tested against constructed repos. The skill adds a contract test, a bundle-drift test, verifier tests, and reconcile tests.bun run release:validate— in sync (43 agents / 39 skills).claude -p).Still open (not blocking this PR)
permissions.allowrule clears the classifier for unattended runs (no interactive consent turn) is untested.This is a routine feature PR — no release version bump or changelog entry (release-please owns those).
🤖 Generated with Claude Code