Skip to content

feat(ce-deep-review): verified cross-model deep-review skill (RU1–RU6) + the eval harness behind it#858

Open
jaybna wants to merge 46 commits into
EveryInc:mainfrom
jaybna:feat/cross-model-review-eval
Open

feat(ce-deep-review): verified cross-model deep-review skill (RU1–RU6) + the eval harness behind it#858
jaybna wants to merge 46 commits into
EveryInc:mainfrom
jaybna:feat/cross-model-review-eval

Conversation

@jaybna
Copy link
Copy Markdown

@jaybna jaybna commented May 25, 2026

What this PR delivers

Two related things, built in sequence:

  1. ce-deep-review-beta — a turnkey skill (plugins/compound-engineering/skills/ce-deep-review-beta/) that runs the Claude ce-doc-review panel on a high-stakes plan, then — after one consent gate — fans the plan across non-Claude reviewer CLIs for decorrelated findings, verdict-tags every cross-model finding against the plan with a deterministic quote-grep backstop, and writes a reconciled verified <plan>.deep-review.md sidecar. Phases 0–4 are complete (RU1–RU6).
  2. The cross-model critique evaluation harness (scripts/eval/cross_model_review/) it was built on — the decision tool that measured whether cross-model review is worth shipping, before the skill existed.

The skill is the headline deliverable; the harness is the evidence base behind it. The skill is beta (-beta suffix, disable-model-invocation: true) — invoked explicitly, never auto-triggered.


The ce-deep-review-beta skill (Phases 0–4 complete)

A high-stakes plan goes through:

  • Phase 0 — detect arms. env-detect.sh reports codex + agy availability (offline: command -v + credential-file presence, never an API call, never prints secrets). Zero arms → panel-only sidecar.
  • Phase 1 — Claude panel (no egress). Invokes ce-doc-review headless, parses its envelope; fail-stops if the panel didn't complete (no gate without panel results).
  • Phase 2 — consent gate (single interaction). Runs gitleaks for a content preview, then one blocking question whose option labels carry the egress verb + vendor (Send the plan to codex (OpenAI)). Selecting any model = consent for exactly that subset.
  • Phase 3 — cross-model dispatch (egress = consent). panel-critique.sh --models <subset> runs only the consented arms across the same six lenses the Claude panel uses, parallel across models (one in-flight request per vendor — the rate-limit bound). A deselected vendor is never sent the plan (the --models guard filters before the run, never post-hoc).
  • Phase 3.5 — verify (deterministic quote-grep backstop). verify-findings.py assigns each cross-model finding exactly one verdict — CONFIRMED (a substantial verbatim quote exists in the plan), NOT-FOUND-IN-DOC (a claimed quote that's absent — flagged, not dropped), or NEEDS-HUMAN (no verbatim quote to check). The backstop is authoritative and model-blind: the verdict function never sees the producing model, so a model verifier can't inherit the confabulation it's meant to catch. CONFIRMED certifies the quoted evidence exists — not that the finding is correct or important (a human's call).
  • Phase 4 — reconcile → verified sidecar. Rotates any prior sidecar (data-loss-safe, keeps 5), renders the cross-model section deterministically (by lens, verdict-tagged, grounding quote on CONFIRMED), and writes <plan>.deep-review.md with the trusted panel findings, the verdict-tagged cross-model section, and a decision-changing union (panel + CONFIRMED findings that would change a go/no-go).

Arm set: codex + agy

  • codex (OpenAI), headless read-only.
  • agy (Antigravity), macOS-only — its read-only floor is a macOS sandbox-exec deny-write seatbelt (agy's own flags don't confine the FS, and it hangs under a pure deny-read profile), so env-detect reports it unavailable off-darwin and arms.py refuses it when the seatbelt prefix is empty.
  • gemini was retired from the skill (it 410s on 2026-06-18); the shared arms.py gemini arm is kept for the eval (a blind-judge family). grok stays deferred (0.2.8 relay-auth bug).

RU1 / OD-4 — the auto-mode egress classifier (resolved)

The first dogfood run surfaced OD-4: the cross-model dispatch was blocked by Claude Code's auto-mode permission classifier even after consent — allowed-tools is not sufficient alone. Characterization: the classifier is consent-scope-keyed, not path-keyed, and AskUserQuestion records the selected option labels, not the gate stem. Fix: option labels now carry the egress verb + vendor so the recorded consent is legible; plus a "dispatch blocked" fallback ladder (restate → retry once → !-handoff → permissions.allow settings rule) and a decision record (docs/solutions/skill-design/2026-05-28-od4-egress-classifier-consent-scope.md). Confirmed in two independent fresh-session runs (dogfood #2 on the v4 plan, and today's acceptance run on a real foreign-repo plan) that the in-skill gate clears the classifier with no !-handoff. The durable permissions.allow / headless path (no interactive consent turn) remains untested.

RU6 — verifier-rate validation

verify-findings.py measure over a labeled corpus (grounded + confabulated, incl. format variants) gates false-CONFIRM and false-NOT-FOUND each ≤ 5%; measured 0% / 0%. Because the verifier is deterministic and model-blind, the eval is straight N=1 (no trials, no voice sampling).


First end-to-end acceptance run (real plan)

Ran the full pipeline on a real docs/brainstorms/ requirements doc in a separate repo (the manual end-to-end acceptance gate):

  • Phase 0 codex+agy both ok → Phase 1 7-persona panel (40 findings) → Phase 2 gate (both arms consented) → Phase 3 dispatch cleared the egress classifier with no !-handoff (CMRE_REPO_DIR auto-resolved to the plan's repo; agy ran under the seatbelt) → 12/12 cells ok, coverage full → Phase 3.5 verdicts → Phase 4 verified sidecar.
  • The arms earned their egress: they corroborated the panel's biggest findings (the "verified zero data loss" vs. its own Hive-API-gap contradiction was independently CONFIRMED by multiple cells with grounding quotes) and added decorrelated findings the panel missed (bulk-reschedule as a metric-gaming loophole, one-owner workload-stat distortion, premature-decommission pressure, order-container instability, internal-form abuse controls).

Fix surfaced by the run (this PR)

normalize() folded smart quotes/dashes/whitespace but not markdown emphasis (*, _), so a finding quoting an emphasized phrase verbatim was wrongly NOT-FOUND (the order *is* the containerthe order is the container). Now strips */_ for both doc and quote — safe against snake_case false-merges (removal inserts no space). Adds a markdown-emphasis corpus case (g12) + two unit tests; re-verifying the saved run flipped exactly one finding NOT-FOUND→CONFIRMED (the clean artifact), leaving genuine paraphrases correctly NOT-FOUND.


The evaluation harness (scripts/eval/cross_model_review/)

The decision tool behind the skill — measures whether/which review-improvement lever is worth building, across two breakpoints:

  • Plan review: four-arm eval (Claude baseline, cross-model isolated, cross-model + fixed context, same-model self-critic) with a blinded judge, dedup, blind-integrity probe, pre-registration, three-way decision.
  • Code review: a known-bug corpus builder (Tier-1 reverts + Tier-3 fix→blame with a quality gate), a GT-match metric, a finding-yield metric, and an end-to-end driver.

Findings from the real runs: GT-match against historical fixes is too narrow (arms find real bugs, rarely the specific historical one); finding-yield showed the cross-model arms surfaced 6–13× more actionable bugs than the Claude baseline; raw yield must be precision-weighted (the negative control caught one model fabricating defects); codex was the credible standout. Verdict: inconclusive on a build decision (underpowered) — the harness, not a build recommendation, was the deliverable. The skill above is the productized follow-through.


Test plan

  • bun test — full suite green (1445 tests). Pure logic (parsers, scoring, blind resolution, verifier verdicts, reconciliation, rotation) unit-tested; git walks and arm invocations integration-tested against constructed repos. The skill adds a contract test, a bundle-drift test, verifier tests, and reconcile tests.
  • bun run release:validate — in sync (43 agents / 39 skills).
  • Model-driven arms/judge run via the orchestrator (no claude -p).

Still open (not blocking this PR)

  • OD-1 multi-dev dogfood — the beta→stable promotion gate: the skill in ≥2 distinct devs' hands over ~1–2 weeks vs. a baseline. Instruments deferred.
  • Durable headless egress path — whether a permissions.allow rule clears the classifier for unattended runs (no interactive consent turn) is untested.

This is a routine feature PR — no release version bump or changelog entry (release-please owns those).

🤖 Generated with Claude Code

jaybna and others added 11 commits May 24, 2026 18:52
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… README (U1)

Schema stub with placeholder pre-registration values and one example entry per subset; concrete corpus and threshold/N values are filled at run time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er tests (U2)

Resolves the P1 record-store seam: a single canonical record schema both producers validate against; orchestrator records are file-dropped into a shared run dir and pooled by reading all records; the circuit breaker and per-arm timeout apply only to CLI arms the runner spawns. 8 deterministic-carrier tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e (U3)

Resolves the P1 arm-b isolation gap (AD2): argv+stdin assembly (no doc interpolation), arm-b clean-cwd + HOME/CODEX_HOME env overrides, and a positive sentinel leak-detection probe proving the model had no context rather than only checking that isolation flags were set. 15 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline (arm a) and self-critic (arm d) prompts produced by the orchestrator via in-process subagent dispatch and ingested as schema-conformant records; arm d makes no external call and no document egress (AE4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d aggregation (U5, U6)

U5: per-finding blinded judge rubric (anchored 0/25/50/75/100), cross-arm dedup, and the blind-integrity verdict (above-chance arm-guessing -> confounded). U6: three-way aggregation (build:<arm> / build_nothing / inconclusive) with known-failure subset as the primary signal, below-N and negative-control-moved forcing inconclusive. 20 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-way decision record (build:<arm> / build nothing / inconclusive) filled from U6 aggregates at eval-run time, with primary known-failure table, validity checks (blind-integrity, negative control, power), and secondary tie-breakers. Framed as cross-model critique per R10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (smoke-found)

A live N=1 smoke run surfaced two bugs the stubbed unit tests could not: (1) arm-b's isolation stripped HOME, which killed codex auth, and the clean CWD tripped codex's trusted-dir check; reworked to keep HOME (auth preserved) + clean CWD + --skip-git-repo-check, so the only b-vs-c delta is arm c's injected fixed context. (2) parse_findings only split '-'/'*' bullets, collapsing codex/agy numbered critiques into one finding; now splits numbered lists too. Adds the run-arm entry point that makes the live path invokable. Re-smoke: arm b ok (17 findings, isolated workdir), arm c ok. 21 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When comparing review models/configs, every arm must get identical controlled context; unequal context (e.g., repo access for in-process subagent arms but not isolated CLI arms) confounds the comparison and makes context masquerade as model diversity. Captured from a live run that this confound nearly led to a wrong build/no-build conclusion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el critique

Wraps the cross-model arms (codex + agy) into a single command that critiques one document and prints each model's findings. Built-in default rubric, optional context file (switches to the fixed-context arm), CMRE_TIMEOUT override, and graceful skip when a CLI is missing/unauthenticated. Smoke-tested: codex returns findings; a slow/unavailable agy is reported as a timeout rather than crashing the run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…epo name)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaybna jaybna marked this pull request as ready for review May 25, 2026 02:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9cd048050

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/run_arms.py Outdated
Comment thread scripts/eval/cross_model_review/run_arms.py Outdated
jaybna and others added 10 commits May 25, 2026 12:14
…iew breakpoint

Mines a git repo for changes the project itself later judged wrong, so the
cross-model review eval can score arms against validated outcomes (the R7
known-failure subset) ported from plan review to code review.

- scan: Tier-1 reverts (the team's own verdict; trust=high)
- scan-fixes: Tier-3 fix->blame (trust=needs_confirmation, alternates kept for R6)
- attribute-fix: single-fix blame attribution
- pure parsers (revert-SHA, PR numbers, hunk ranges, code-path filter, entry
  conformance) are unit-tested; the git walk is integration-tested against a
  constructed repo.

Validated on BlueprintOS: Tier-1 yields 0 usable items (no git revert), Tier-3
yields ~180 unique known-failure candidates from ~200 fix commits, with real
surfaced-after latency up to 53 days.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kpoint

Where plan review can only forward-rate whether a finding looks decision-changing,
a known-bug corpus has a target -- the bug the fix proved mattered -- so the
known-failure metric becomes an objective hit/miss (R7 made concrete).

- gt-resolve: join blind per-finding matches_bug verdicts back to per-(arm,doc)
  gt_hit, preserving blinding (the arm is recovered after the judge, never shown).
- gt-score: per-arm known-failure hit counts, scoped to known_failure docs.
- aggregate: uses gt_hit as the known-failure predicate when present, falling
  back to decision_changing -- one three-way decision rule serves both breakpoints.
- gt_match_rubric.md: the blind per-finding match rubric (defect mechanism, not
  surface/file overlap), with R6 human confirmation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds drive_eval.py, the deterministic spine that ties corpus -> arms -> judge ->
decision into one runnable flow, with the model-driven steps as explicit handoffs.

- drive_eval.py plan: enumerates arm x doc x trial work, emits the CLI-arm (b, c)
  commands and the in-process-arm/judge todo, writes run-state.json. Refuses to
  plan a run whose threshold/N are not pre-registered (R9).
- drive_eval.py finalize: runs gt-resolve -> gt-score -> aggregate over ingested
  records + judge verdicts and renders the decision artifact; forces inconclusive
  on confounded blind-integrity, below-N, or negative-control movement.
- build_corpus.py to-manifest: wraps scan/scan-fixes entries into a manifest
  skeleton with null pre_registration, so the human fills the decision rule and
  confirms needs_confirmation entries before running.

Smoke-validated on BlueprintOS: scan-fixes -> to-manifest -> plan resolves real
culprit-diff paths into runnable arm commands (3 docs -> 36 expected records).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o stop credit bleed

A real run surfaced the bug: GT-match verdicts were keyed on (doc_id, finding_id),
but finding ids are local to a record, so one matches_bug verdict credited every
arm that reused a local id like f1/f2 -- inflating per-arm hits.

Fix: gt_pool builds the blind judging pool under globally-unique, arm-opaque uids
(content+arm hash, order-independent, does not encode the arm) plus an arm
provenance map; gt_hits_from_verdicts resolves uid-keyed verdicts to per-(arm,doc)
gt_hit. The judge still never sees the arm. drive_eval finalize uses the pool flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanitized decision record (public repo): inconclusive by design, with the durable
methodology findings -- decorrelation signal observed, one external CLI arm
unusable, the credit-bleed bug found+fixed, and the Tier-3 corpus-quality gate the
run showed is needed. No proprietary target details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first run showed Tier-3 blame on a repo that ships large feature commits
collapses many fixes onto a few giant culprits (one ~150k lines), producing a
non-independent, unreviewable corpus.

Gate: drop culprits whose diff exceeds --max-culprit-lines/--max-culprit-files
(too large to review, or a foundational/import commit), and dedup fixes that
share a culprit so they don't become N non-independent docs. parse_numstat +
culprit_within_caps are unit-tested; the gate + dedup are integration-tested on
a constructed repo. Defaults 2000 lines / 30 files; tighter = cleaner but smaller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second run (quality-gated corpus, blind judging, credit-bleed fix): no arm cleared
threshold and the run-1 cross-model GT advantage did not reproduce -- it was largely
a corpus artifact. Bigger finding: GT-match against historical fixes is too narrow --
every arm found many real, serious bugs but rarely the specific historical one, so
GT-match alone undercounts reviewer value and must be paired with a finding-yield
metric. Sanitized; no proprietary target details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The gated second run showed GT-match against historical fixes is too narrow: every
arm found many real, serious bugs but rarely the one a historical fix targeted, so
GT-match alone undercounts reviewer value. yield_score tallies per-arm total /
unique-actionable / decision-changing findings from the same blind judge verdicts
(resolved to arms via pool provenance, blinding preserved). drive_eval finalize
takes optional --yield-verdicts and renders a yield table beside the GT-match table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…recate agy

Troubleshooting agy (Antigravity 1.0.2) found it unusable as a non-interactive
reviewer: prompt-in-argv hangs, doc-on-stdin returns exit 0 empty, no no-tools mode,
and its only unblock flag grants workspace tool access that breaks arm isolation.

Replace it with gemini 0.43 (validated mechanically: auth + trust gate pass, attempts
the review): `gemini -p "<instr>" --approval-mode plan --skip-trust -o text` with the
doc on stdin (-p is appended to stdin; plan = read-only; --skip-trust required for a
clean headless cwd). Needs GEMINI_API_KEY. arms.py gains a gemini branch + choice,
drive_eval defaults arm c to gemini, agy retained but documented as deprecated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… precision hole

agy->gemini run with the new finding-yield metric scored. GT-match and yield gave
opposite verdicts: GT-match shows no cross-model advantage, but yield shows the
cross-model arms found 6-13x more actionable bugs than the Claude baseline -- the
value GT-match hides. However the negative control caught gemini fabricating specific
defects (line numbers not in the diff), so raw yield rewards confabulation and must be
precision-weighted; codex (clean control, ~30 actionable) is the credible standout.
Sanitized; no proprietary target details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c86da36ec7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/run_arms.py
Comment thread scripts/eval/cross_model_review/drive_eval.py
Comment thread scripts/eval/cross_model_review/critique.sh Outdated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaybna jaybna changed the title chore(cross-model-eval): add cross-model review evaluation harness feat(cross-model-eval): cross-model critique evaluation harness (plan + code-review breakpoints) May 26, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e80383152

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/drive_eval.py
Comment thread scripts/eval/cross_model_review/run_arms.py
… persist full records

The plan-review run exposed two gaps: (1) critique.sh truncated findings to 240 chars and
persisted nothing, so cross-model findings couldn't be judged after the run; (2) the
cross-model arms got one generic rubric while the Claude panel got six specialized lenses,
confounding the comparison.

- critique.sh: persist the full record per model when CMRE_OUT_DIR is set.
- panel-critique.sh: run codex + gemini through the SAME six ce-doc-review lenses
  (coherence/feasibility/security/scope/product/adversarial), full-record persistence,
  for a prompt-symmetric Claude-panel-vs-cross-model comparison.
- docs: record the plan-review breakpoint finding -- cross-model converged on the panel's
  top premise rather than decorrelating; the lever's value is breakpoint-dependent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dfba11610

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/run_arms.py Outdated
Comment thread scripts/eval/cross_model_review/panel-critique.sh Outdated
@jaybna
Copy link
Copy Markdown
Author

jaybna commented May 26, 2026

Review feedback addressed in dba311d

Thanks — these were good catches. Verified each against current HEAD, fixed the legitimate ones, and noted the few I'm deferring with rationale. bun test green (1390 pass, +13 new tests covering each fix).

Fixed as suggested

  • Exclude control arm from build-winner selection + treat ties as inconclusiveaggregate now drops a_baseline from candidates and returns inconclusive on a top-count tie instead of picking by enum order.
  • Strict booleans for judge verdicts — added an _as_bool coercion so a loose "false" string can't be Python-truthy; applied to gt-resolve and yield-score (the same class affected the yield fields).
  • Exclude known_failure class verdicts from fallback scoringfinalize now rejects class verdicts mislabeled subset:known_failure (those must come from --gt-verdicts).
  • Validate trials_per_arm before planningplan rejects anything that isn't an integer ≥ 1.
  • Reject finalize runs with missing recordsfinalize computes coverage vs docs × arms × trials and forces inconclusive on a partial run.
  • Abort when judge CLI failsrun-judge.sh now uses pipefail and exits non-zero on CLI failure or zero parseable verdicts.
  • mktemp portability (×3)critique.sh, panel-critique.sh, run-judge.sh now use -XXXXXX templates (the bare -t form fails on GNU/Linux).
  • gt-resolve README example — fixed to pass provenance, not records.

Fixed, but differently than suggested

Rather than make the runbook clunkier, I made the code match the documented ergonomics:

  • stdin ingestingest now accepts - to read a record from stdin, so the documented run-arm … | ingest <dir> - one-liner works as written.
  • gt-pool globgt-pool now accepts multiple single-record files (nargs="+"), so gt-pool <run>/records/*.json works (matching the records-dir layout) while still accepting a single array file.

Intentionally deferred (correct, but lower priority)

  • Renamed-file blame path (build_corpus.py) — valid, but impact is under-sampling renamed-file fixes in the corpus, not wrong results; lower severity than its P1 badge. Tracked for a follow-up.
  • Trial in the finding UID — a deliberate design choice: the UID is content-based so it dedups identical findings; whether a finding repeated across trials counts as one or three is a methodology decision I'd rather make explicitly than change implicitly.
  • doc_id filename sanitization — defensive only; corpus IDs are slugs today. Will add if we ever accept path-like IDs.

@jaybna
Copy link
Copy Markdown
Author

jaybna commented May 26, 2026

The two deferred items (renamed-file blame path; trial-in-finding-UID methodology) are now tracked in #864.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dba311d6d1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/drive_eval.py Outdated
Comment thread scripts/eval/cross_model_review/drive_eval.py Outdated
Comment thread scripts/eval/cross_model_review/drive_eval.py
jaybna and others added 2 commits May 28, 2026 16:21
…hase 0 validation harness

agy 1.0.3 is a viable cross-model reviewer (1.0.2's empty-output bug is fixed), but its own --sandbox does not confine the filesystem. Wrap the agy arm in a macOS seatbelt deny-write profile applied at run time (deny writes to the repo + home creds/dotfiles; allow-default otherwise -- agy hangs under deny-all-write or any deny-read). Logical argv is unchanged; the sandbox is an execution concern declared via a 'sandbox' spec field and applied in run_invocation (profile generated from validation/agy-readonly.sb.tmpl, cleaned up after).

Adds the Phase 0 validation harness (scripts/eval/cross_model_review/validation/): agy-smoke.sh (PASS: floor + viability), grok-smoke.sh (re-probe for the deferred grok arm), sentinels, and the seatbelt template. Floor + viability validated live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iteration + brainstorm corrections

Phase 0 validation outcome (v3 plan): grok 0.2.8 headless is blocked by a relay-auth bug (dropped from v1, re-probe on a version bump); agy 1.0.3 accepted and OS-sandboxed. v1 cross-model arms = codex + agy.

Adds the two arm-posture-validation solution docs, the ce-deep-review onboarding doc, and the v1->v2->v3 plan iteration. Corrects the brainstorm's stale agy assumptions: R5 posture (agy --sandbox does not confine the FS; floor enforced via OS seatbelt), and the auth assumption (OAuth at ~/.gemini/oauth_creds.json + refresh_token, not AV_API_KEY env vars; detection must not gate on expiry since agy auto-refreshes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1226edac0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/drive_eval.py Outdated
Comment thread scripts/eval/cross_model_review/run_arms.py
jaybna and others added 2 commits May 28, 2026 16:47
…+ egress-safe cross-model dispatch

Adds the ce-deep-review-beta skill (disable-model-invocation, explicit-only): Phase 0 arm detection (env-detect.sh, codex+gemini, JSON-only/no credential leakage); Phase 1 headless ce-doc-review with a fail-stop UX; Phase 2 single consent gate (multi-select per-model opt-in default-none, ack in the stem, graceful gitleaks degradation via gitleaks-scan.sh + escalated ack); Phase 3 egress-safe dispatch (only consented models) shelling the bundled harness; Phase 4 raw UNVERIFIED thin-slice output to <plan>.deep-review-draft.md (the verified .deep-review.md name is reserved).

P0 fix: panel-critique.sh gains a minimal '--models <subset>' guard so a deselected vendor is never run (egress == consent), default preserves the current codex+gemini behavior. Bundles the canonical harness (bundle-harness.sh) into the self-contained skill with a CI-enforced normalized drift test. Adds a minimal contract test. grok deferred; agy's full panel integration is Phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the [BETA] ce-deep-review-beta row to the plugin README and docs/skills index, plus a thin user-facing doc (docs/skills/ce-deep-review.md): what it does, how it differs from ce-doc-review, the v1 arms (codex + gemini; grok deferred; agy migrating), consent/safety, and the explicit-invocation + unverified-thin-slice caveats. Links the onboarding doc. The doc grows when verification lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa558334fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-deep-review-beta/SKILL.md
Comment thread scripts/eval/cross_model_review/README.md Outdated
jaybna and others added 2 commits May 28, 2026 21:07
…s classifier (RU1/OD-4)

The dogfood run found the cross-model dispatch (bash panel-critique.sh) blocked
by Claude Code's auto-mode permission classifier even after the in-skill consent
gate was granted — allowed-tools alone is not sufficient. A second-session probe
characterized the classifier: it is consent-SCOPE-keyed, not path-keyed. It reads
what the conversation records the user *chose*, and AskUserQuestion returns the
selected option *labels*, not the gate stem. The gate's bare model-name labels
(codex (OpenAI)) did not register as egress authorization; a top-level probe
phrased as "Send the plan to gemini (Google)" cleared the classifier and ran real
egress.

RU1 (path chosen: legible gate + settings fallback):
- consent-gate.md / SKILL.md Phase 2: option labels now carry the egress verb +
  vendor ("Send the plan to codex (OpenAI)") so the recorded consent is legible to
  the classifier; new "Egress-gate legibility" section explains why.
- SKILL.md Phase 3 + arm-invocation.md: "If the dispatch is blocked" fallback
  ladder (restate consent -> retry once -> !-handoff -> settings rule).
- onboarding doc: new "Egress permission" section + permissions.allow rule for
  headless runs (flagged untested for the headless-only path).
- contract test: asserts verb-carrying labels + documented block fallback.
- decision record: docs/solutions/skill-design/2026-05-28-od4-...md.

Verification caveat: that the reworded IN-SKILL gate clears the classifier is a
single-data-point hypothesis (the confirmed clear was a top-level probe). It
cannot be tested in the authoring session (skill caches at session start) — needs
a fresh-session /ce-deep-review-beta run. The fallback ladder is the documented
degradation if it does not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rk v3 superseded

v4 collapses U1-U6 into Current State (Phase 0/1 done on PR EveryInc#858), renumbers the
residual units RU1-RU6, records the as-built corrections where v3 drifted from the
committed code (env-detect is codex+gemini only; arms.py already has the agy arm;
the --models guard already shipped; the bundle-drift bun test is the gate, not the
phantom CI step), and folds in the dogfood OD-4 finding. v3 frontmatter flips to
status: superseded / superseded_by: 004.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09d7f73b31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/run_arms.py Outdated
jaybna and others added 7 commits May 28, 2026 22:09
…veryInc#2)

Second dogfood run on this v4 plan surfaced that the plan's own OD-4/RU1
narrative was stale: the work it framed as "OPEN — load-bearing" had already
shipped in 766c730, and this run gave the fresh-session confirmation the plan
said it was waiting for (the in-skill consent gate's verb-carrying labels
cleared the auto-mode egress classifier with no `!`-handoff).

Reconciliation (panel-driven, high-confidence fixes only):
- OD-4: OPEN -> RESOLVED; record root cause + 766c730 fix + residual
  sub-questions (headless/permissions.allow path still untested).
- RU1: DONE; verification met by dogfood EveryInc#2.
- Risk table: "turnkey blocked" -> resolved; "v1 ships codex + agy" ->
  "currently codex + gemini; agy post-cutoff via RU2 before 2026-06-18".
- RU2: name the macOS-only platform-gate scope + add the arms.py off-mac
  hard-guard (security panel).
- RU6: split recommendation (RU6a doc-cleanup / RU6b verifier-rates).
- Add ce-deep-review -> ce-deep-review-beta naming caveat; new .gitignore
  deferred question for the draft sidecars.

Judgment calls left as residual/Outstanding Questions, not silently decided.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arm (RU2)

RU2 migration: agy becomes the default non-codex arm ahead of the gemini
2026-06-18 HTTP-410 cutoff. gemini stays SELECTABLE via `--models codex,gemini`
until the cutoff -- it is NOT removed; it remains the fallback while agy is the
sole new non-codex arm.

- panel-critique.sh: default arms codex+agy; export CMRE_REPO_DIR from the
  reviewed plan's repo root.
- arms.py: _repo_root() honors CMRE_REPO_DIR (protect the user's plan repo, not
  arms.py's location); run_invocation refuses agy when the seatbelt prefix is
  empty (off-darwin / missing template) rather than running unfloored (R5).
- env-detect.sh: detect agy (OAuth-file presence, presence-only); macOS
  platform-gate -> "unavailable" off-darwin so the gate never offers an
  unfloored arm; emit the agy key.
- SKILL.md / consent-gate.md / arm-invocation.md: agy in the gate
  (Send the plan to agy (Antigravity)); gemini selectable pre-cutoff.
- docs/skills/ce-deep-review.md: arm table synced.
- Re-bundled (drift green); new tests/skills/ce-deep-review-beta-arms-ru2.test.ts
  (7 tests); v4 plan marks RU2 DONE.

Verified: env-detect agy ok on macOS / unavailable on simulated Linux; live
1-model agy run produced records for all 6 lenses via the full skill path;
agy-smoke floor PASS (repo write blocked) + viable under seatbelt; off-mac guard
refuses unfloored agy; drift green; full bun test 1427 pass; release:validate
in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-06-18)

gemini's CLI hits an HTTP-410 cutoff on 2026-06-18, so shipping it as a
selectable fallback that dies in June added no durable value. Remove it from the
ce-deep-review skill now; agy (Antigravity) is the sole non-codex skill arm.

Skill-only removal -- the shared arms.py gemini arm is KEPT for the cross-model
eval (it is the eval's default arm c, a valid blind-judge family, and is covered
by cross-model-review-eval.test.ts). Ripping it out of arms.py would break the
eval's codex-vs-gemini comparison for no benefit.

- env-detect.sh: drop the gemini arm + key; emit {codex, agy} only (the
  ~/.gemini OAuth path stays -- it is agy's auth file too).
- SKILL.md / consent-gate.md / arm-invocation.md: gate no longer offers gemini;
  verb-label examples use agy (Antigravity); "gemini retired" notes added.
- panel-critique.sh header: arms = codex + agy (no gemini-selectable advertising;
  the script still passes any --models through to arms.py).
- docs/skills/ce-deep-review.md + onboarding: arm table / example updated.
- tests: contract asserts the agy verb-label; ru2 test asserts env-detect emits
  no gemini key.
- v4 plan: RU2 status / risk table / Phase 2b reconciled (gemini removed, not
  retained).

Verified: env-detect emits {codex, agy}; full bun test 1427 pass (incl. the eval
gemini-arm tests, still green); release:validate in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…els semantics (RU3)

panel-critique.sh now forks one background subshell PER MODEL (each running the
six lenses sequentially) and waits on all, instead of the sequential lens x model
loop. Parallelizing across models (not across lenses) overlaps the slow arms
while bounding concurrency to one in-flight request per vendor -- the
rate-limit/resource mitigation the feasibility lens flagged. Per-(model,lens)
progress lines stream as each cell completes (R15); they interleave (each is
self-labeled; records key on ${cli}__${lens}.json so parallel writers never
collide).

--models semantics, now defined: default = all available (codex + agy);
unavailable / off-platform arms warn-SKIP per cell, never fatal -- a missing
binary, or agy off-macOS (its read-only floor is a macOS seatbelt) -- and the
rest still run.

- panel-critique.sh: per-model subshell fan-out + wait; agy off-mac SKIP;
  default/skip semantics documented.
- tests: 3 new RU3 tests (subshell-per-model structure; bogus arms warn-skip not
  fatal; agy off-mac SKIP) in ce-deep-review-beta-arms-ru2.test.ts.
- v4 plan: RU3 marked DONE; Phase 2b complete.

Verified: live --models codex,agy produced all 12 records with interleaved
progress (proves concurrency); bogus arms -> exit 0, no records; agy under a
Linux uname stub -> SKIP, no record; full bun test 1430 pass; drift green;
release:validate in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ackstop (RU4)

Adds the verification step that grounds every raw cross-model finding against the
reviewed plan and assigns one verdict: CONFIRMED (a substantial verbatim quote
that exists in the plan), NOT-FOUND-IN-DOC (a claimed quote that is absent), or
NEEDS-HUMAN (no substantial quote to check). Replaces the thin slice's
`verification: none` with `verification: quote-grep-backstop`.

The backstop (scripts/verify-findings.py) is a pure function of (finding text,
doc): deterministic, authoritative (a model may not override CONFIRMED /
NOT-FOUND-IN-DOC), and blind to the producing model (the verdict never reads the
model label). Scope decision: v1 uses the deterministic quote-grep as the SOLE
authoritative gate -- no LLM verifier, which would re-introduce the
verifier-contamination failure mode the panel flagged. CONFIRMED certifies the
quoted evidence exists, NOT that the finding is correct/important -- that stays a
human call. A high NEEDS-HUMAN count is expected and honest (a quote-grep can
only adjudicate findings that quote the doc).

- scripts/verify-findings.py: verify-one + verify-records subcommands; quote
  extraction + normalized (smart-quote/dash/whitespace-folded) substring grep;
  multi-word grounding quotes only (a lone filename can't trivially confirm).
- references/verification-protocol.md: verdict semantics + brittleness caveats.
- SKILL.md: Phase 3.5 (verification) added; Phase 4 + intro + description
  reframed from "raw and unverified" to verdict-tagged.
- docs/skills/ce-deep-review.md reframed.
- tests: 7 verify tests + a contract test.

Verification is skill-only (not bundled -- it is skill-specific, not eval-shared).
RU5 (reconciliation + the verified .deep-review.md sidecar) is still pending.

Verified: verdict cases pass; model-blindness (same text -> same verdict under
different model labels); normalization avoids false NOT-FOUND; full bun test 1438
pass; release:validate in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecar (RU5)

Promotes the verdict-tagged findings into the reserved verified sidecar
<plan>.deep-review.md -- the skill's terminal output is now the verified file,
not the thin-slice draft. Completes Phase 3 (RU4 verify + RU5 reconcile).

scripts/reconcile.py (skill-only) provides two deterministic helpers:
- rotate: rename an existing <plan>.deep-review.md to .deep-review.<ISO>.md and
  keep only the 5 newest rotations. DATA-LOSS-SAFE: the prune glob matches
  rotations only (<plan>.deep-review.<infix>.md) -- never the base (just renamed
  away) nor the -draft sidecar (a -draft infix, not a .-delimited one). This is
  the rotation data-loss surface the feasibility review flagged.
- render-cross-model: deterministic by-lens, verdict-tagged Markdown from
  verify-records output; grounding quote shown on CONFIRMED.

SKILL.md Phase 4 restructured: rotate -> render -> write <plan>.deep-review.md
with skill_phase: verified, verification: quote-grep-backstop, verdict counts,
coverage; banner precedence is coverage-only (the UNVERIFIED banner is gone) plus
a NEEDS-HUMAN triage note; a decision-changing union closes the body. An existing
.deep-review-draft.md is left in place (historical dogfood artifact). .gitignore
untouched (still an open decision) + committed-leak reminder when content_preview
is unavailable. references/reconciliation.md documents the contract. Intro /
description / user doc reframed; H1 drops "thin slice".

- tests: 4 reconcile tests (rotate keep-5 + base/draft safety + path guard;
  render grouping/tagging) + contract test updated to the verified output.

Verified: rotate keeps 5 newest by ISO infix, prunes older, spares base/draft,
refuses non-.deep-review.md paths; render groups by lens + orders verdicts; full
bun test 1442 pass; release:validate in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RU6 (Phase 4 build). Adds the verifier-rate measurement and finalizes docs.

RU6b -- verifier rates (re-scoped for the deterministic verifier):
- verify-findings.py gains a `measure` subcommand over a labeled corpus
  (references/calibration/verifier-corpus.json: 11 grounded + 11 confabulated,
  including format-variant grounded items that exercise normalization). Computes
  false-CONFIRM (expected NOT-FOUND -> CONFIRMED) and false-NOT-FOUND (expected
  CONFIRMED -> not confirmed); eligible when both <=5%. Measured 0% / 0%.
- Re-scope (deviation from v3 U12, justified by RU4's deterministic decision):
  the verifier is a deterministic, model-blind quote-grep, so v3's agy-voiced
  sampling, synthetic-fallback, calibration_scope, and N=3 trials are MOOT --
  verdict is a pure function of (text, doc), no voice, no variance. N=1 eval.

RU6a -- docs/cleanup:
- README row reframed (verified output, not "thin slice unverified").
- Brainstorm: grok -p retention line corrected to OD-3 = CONFIRMED acceptable
  (the AV_API_KEY -> oauth_creds correction was already in via supersedes notes).
- Drift-gate DECIDED: the committed bun equality test is the gate (runs under
  `bun test` -> CI's test check), no redundant .github step; v3's phantom
  "CI step" claim lives only in superseded v3 and v4 already corrects it.
- Full contract test is in place (accreted across RU4/RU5/RU6).
- v4 plan: RU6 DONE; Phase 4 build complete.

Promotion (beta->stable) remains gated on human-run checks, not more code: the
manual end-to-end over F1-F5 and the OD-1 dogfood (>=2 devs), which needs the
skill shipped first.

Verified: corpus eligible at 0%/0%; 2 RU6b tests; full bun test 1444 pass;
release:validate in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9cd13a62d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The quote-grep backstop's normalize() folded smart quotes/dashes/whitespace
but not markdown emphasis markers (* and _), so a cross-model finding that
quoted an emphasized phrase verbatim was verdicted NOT-FOUND-IN-DOC: a doc
writing `the order *is* the container` did not match a finding quoting
"the order is the container". Surfaced by the first real end-to-end run on a
brainstorm doc.

Strip * and _ from both doc and quote in normalize(). Safe against snake_case
false-merges: removal inserts no space, so `market_id` -> `marketid` only
collapses to a match when both doc and quote carry the underscore (a true
verbatim quote); a spaced paraphrase keeps its space and still won't match.

Adds a markdown-emphasis grounded case (g12) to the RU6 calibration corpus
(measure stays 0%/0% eligible) and two unit tests. Verified on the saved run:
exactly one finding flipped NOT-FOUND -> CONFIRMED (the clean emphasis
artifact); the other genuine paraphrase/quote-char-swap cases correctly
remained NOT-FOUND.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cc9c70f0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/eval/cross_model_review/build_corpus.py
@jaybna jaybna changed the title feat(cross-model-eval): cross-model critique evaluation harness (plan + code-review breakpoints) feat(ce-deep-review): verified cross-model deep-review skill (RU1–RU6) + the eval harness behind it May 29, 2026
Resolves the genuinely-open automated-review threads. Most prior-round
findings were already fixed in dba311d but their threads were never
resolved on GitHub (handled as replies); these are the items still open
against current code.

drive_eval.py:
- coverage() requires exact (arm, doc_id, trial) tuple membership, not
  cardinality, so stale/extra records can't look complete
- any incomplete run is forced to inconclusive (not just build:<arm>)
- load_records drops failed (timeout/error) records so a failed trial
  counts as missing, not a clean zero-finding review
- plan() validates go_threshold / minimum_corpus_n / trials_per_arm as
  positive integers

run_arms.py:
- sanitize doc_id into filename components (_slug); ingest catches OSError
- negative-control judge wording gated on subset==negative_control;
  forward_rated docs get neutral merit-based judging
- parse_judge_verdicts rejects arrays of non-objects (returns []) so a
  failed judge run can't feed garbage downstream

build_corpus.py:
- blame only the deleted/changed old-side lines of a fix hunk, not the
  unchanged context lines (no longer mis-attributes context authors)

verify-findings.py:
- verify-records filters records to the current plan's doc_id, so stale
  records in a reused CMRE_OUT_DIR aren't published into this sidecar
  (verify-one and model-blindness untouched)

gitleaks-scan.sh:
- a gitleaks invocation error (or missing/unparseable report) now maps to
  the escalated "unavailable" path, not a false clean "ran" before egress

SKILL.md:
- Phase 0 guard against silently treating an unresolved CLAUDE_SKILL_DIR
  as "zero arms"

README.md (eval runbook):
- correct the ingest pipe to invoke run_arms.py via python3

Deferred (tracked in EveryInc#864): renamed-file blame path; trial-in-GT-UID.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: becffb998c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-deep-review-beta/SKILL.md Outdated
…afe rotation

Two follow-up review findings on the latest push:

- allowed-tools only whitelisted the three bash *.sh wrappers, but Phase 3.5
  and Phase 4 invoke `python3 verify-findings.py` and `python3 reconcile.py`.
  Under a restricted permission environment the flow would clear consent and
  dispatch, then block at the mandatory verification/reconciliation steps
  before writing the verified sidecar. Added Bash(python3 *verify-findings.py*)
  and Bash(python3 *reconcile.py*), and normalized all entries to the trailing-*
  form the onboarding doc already documents (so arg-bearing invocations match).

- reconcile.py rotate() used a per-second stamp; two runs in the same second
  (or an explicit duplicate --now) produced an identical filename and
  os.rename silently clobbered the earlier backup, violating the data-loss-safe
  contract. Now disambiguates with a numeric suffix (<stamp>-1, -2, ...) which
  still sorts newest-first for keep-N pruning. Regression test added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ae8b3be01

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +133 to +134
if rec_doc_id is not None and rec_doc_id != f"{doc_id_base}__{lens}":
continue # stale record from another plan (or another lens) in a reused dir
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter out same-plan stale model records

This still lets stale records from a previous run of the same plan through: the current guard only rejects records whose doc_id differs from <plan>__<lens>, so if one run consented to codex,agy and a later run of the same plan selects only codex in the default /tmp/cmre-panel directory, the old agy__*.json files have matching doc_ids and are verified/published into the new .deep-review.md. Fresh evidence beyond the prior thread is that the implemented filter checks only doc/lens here and has no run id or selected-model check, so same-plan stale records still pass.

Useful? React with 👍 / 👎.

@jaybna
Copy link
Copy Markdown
Author

jaybna commented May 29, 2026

Recasting this PR to fit the contributions policy in the README — I saw that outside PRs aren't merged directly and that you have Claude/Codex review submissions and decide independently. That's completely fair, so:

Take whatever's useful, re-implement it your way, or ignore it — all good. Happy to close this in favor of the issue if a PR isn't the surface you want. Thanks for the plugin; it's what this was built on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant