Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5ace511
docs(cross-model-review-eval): add brainstorm and eval plan
jaybna May 24, 2026
759cc57
feat(cross-model-eval): add corpus manifest stub and pre-registration…
jaybna May 24, 2026
1de3c89
feat(cross-model-eval): add runner skeleton, record schema, and carri…
jaybna May 24, 2026
e98bf5a
feat(cross-model-eval): add CLI arms (b, c) with arm-b isolation prob…
jaybna May 24, 2026
4d36a30
feat(cross-model-eval): add in-process arm prompts (U4)
jaybna May 25, 2026
27ffe44
feat(cross-model-eval): add blinded judge rubric, dedup/integrity, an…
jaybna May 25, 2026
251f153
feat(cross-model-eval): add decision-artifact template (U7)
jaybna May 25, 2026
0ca59ed
fix(cross-model-eval): make arm-b runnable and parse numbered finding…
jaybna May 25, 2026
313e8e9
docs(cross-model-eval): capture arm-isolation methodology lesson
jaybna May 25, 2026
b0a5cbc
feat(cross-model-eval): add critique.sh one-liner for quick cross-mod…
jaybna May 25, 2026
c9cd048
docs(cross-model-eval): generalize the run reference (drop internal r…
jaybna May 25, 2026
3ae2683
feat(cross-model-eval): add known-bug corpus builder for the code-rev…
jaybna May 25, 2026
78c4dc3
feat(cross-model-eval): add GT-match scoring for the code-review brea…
jaybna May 25, 2026
4cea60f
feat(cross-model-eval): wire the code-review eval pipeline end-to-end
jaybna May 25, 2026
d73a25d
fix(cross-model-eval): give GT-match findings arm-opaque unique ids t…
jaybna May 25, 2026
3f01a4d
docs(cross-model-eval): record the first code-review run findings
jaybna May 25, 2026
2db9787
feat(cross-model-eval): add corpus quality gate to scan-fixes
jaybna May 25, 2026
a317091
docs(cross-model-eval): record the gated second run + the metric finding
jaybna May 25, 2026
aa5d82f
feat(cross-model-eval): add finding-yield metric alongside GT-match
jaybna May 25, 2026
c5f6700
feat(cross-model-eval): wire gemini CLI as the Gemini-family arm; dep…
jaybna May 25, 2026
c86da36
docs(cross-model-eval): record run 3 (gemini + finding-yield) and the…
jaybna May 25, 2026
8cefdd9
test(cross-model-eval): genericize a code-path test fixture filename
jaybna May 26, 2026
8e80383
fix(cross-model-eval): switch critique.sh from deprecated agy to gemini
jaybna May 26, 2026
1dfba11
feat(cross-model-eval): add panel-critique.sh (fair 6-lens compare) +…
jaybna May 26, 2026
82f4325
docs(cross-model-eval): correct the plan-review finding after the fai…
jaybna May 26, 2026
93e0ef6
fix(cross-model-eval): make finding-count reliable (JSON output) + co…
jaybna May 26, 2026
e1d9ee0
feat(cross-model-eval): add a non-Claude blind judge (the last decisi…
jaybna May 26, 2026
9fa8a7a
docs(cross-model-eval): add the decision-grade run procedure (pre-reg…
jaybna May 26, 2026
1706594
docs(cross-model-eval): decision-grade run record (non-Claude judge, …
jaybna May 26, 2026
dba311d
fix(cross-model-eval): harden the decision spine against invalid/part…
jaybna May 26, 2026
6072ec5
feat(cross-model-eval): seatbelt deny-write floor for the agy arm + P…
jaybna May 28, 2026
e1226ed
docs(ce-deep-review): Phase 0 arm-posture findings, onboarding, plan …
jaybna May 28, 2026
006b709
feat(ce-deep-review-beta): Phase 1 thin slice — panel + consent gate …
jaybna May 28, 2026
aa55833
docs(ce-deep-review): discoverability for the beta thin slice
jaybna May 28, 2026
766c730
fix(ce-deep-review-beta): make consent legible to the auto-mode egres…
jaybna May 29, 2026
09d7f73
docs(ce-deep-review): add v4 plan reconciled to committed reality; ma…
jaybna May 29, 2026
335bf6c
docs(ce-deep-review): reconcile v4 plan to OD-4 resolution (dogfood #2)
jaybna May 29, 2026
cf1b388
feat(ce-deep-review): make agy (Antigravity) the default cross-model …
jaybna May 29, 2026
72ca96b
chore(ce-deep-review): retire the gemini arm from the skill (EOL 2026…
jaybna May 29, 2026
df80731
feat(ce-deep-review): parallelize cross-model dispatch + define --mod…
jaybna May 29, 2026
c3badcf
feat(ce-deep-review): verify cross-model findings with a quote-grep b…
jaybna May 29, 2026
62fa59b
feat(ce-deep-review): reconcile into the verified .deep-review.md sid…
jaybna May 29, 2026
b9cd13a
feat(ce-deep-review): verifier-rate measurement + RU6 validation/cleanup
jaybna May 29, 2026
2cc9c70
fix(ce-deep-review): fold markdown emphasis in verify-findings normalize
jaybna May 29, 2026
becffb9
fix(review): address Codex review feedback on #858
jaybna May 29, 2026
1ae8b3b
fix(ce-deep-review): whitelist verify/reconcile helpers + collision-s…
jaybna May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md

Large diffs are not rendered by default.

334 changes: 334 additions & 0 deletions docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Cross-model review eval: isolate all arms to identical context, or context masquerades as model diversity"
date: 2026-05-24
last_updated: 2026-05-24
category: skill-design
module: compound-engineering / cross-model-review-eval
problem_type: design_pattern
component: scripts/eval/cross_model_review
severity: medium
tags:
- eval
- eval-methodology
- cross-model
- blinding
- fairness
- confound
- subagent-dispatch
related_plan: docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
related_brainstorm: docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
---

## Context

The cross-model review eval compares review "arms" on the same document: a Claude
baseline (a), a cross-model CLI with no repo context (b), a cross-model CLI with a fixed
context set (c), and a same-model self-critic (d). The comparison is only meaningful if
every arm receives **identical, controlled context** — the design's whole point is to
isolate *model* and *context* as separate variables.

A live run on two real plans from a separate internal repo violated this and produced a confidently-wrong
conclusion. The cross-model arms (b, c) were correctly isolated — `codex` ran from a clean
CWD with the plan piped via stdin, no repo access. But the in-process Claude arms (a, d)
were dispatched as general-purpose subagents **given a path into the live repo**, so they
explored sibling files and discovered that both plans were already implemented and had
drifted from the plan. That made the Claude arms look impressively decorrelated — they
"found" things the codex arms missed.

It was an artifact, not a result. Re-running with the Claude arms isolated to a
**standalone copy of just the plan in OS temp** (no surrounding repo) plus a hard "read
ONLY this file, do not explore the filesystem or any repo" instruction produced none of
the drift findings. Fairly matched, the models mostly **agreed** on premise-level issues;
the biggest finding-count delta came from **context** (codex +context produced 35 findings
vs 10 without on one plan), not from model identity. The isolation re-run overturned the
contaminated run's apparent "build cross-model review" conclusion.

## Guidance

When evaluating multiple review configs (models, with/without context, self-critic),
isolate every arm to the same input shape before comparing:

- **CLI arms:** clean CWD + document via stdin only. For `codex`, add
`--skip-git-repo-check` (it refuses to run from a clean dir without it) and do **not**
strip `HOME` (that kills the CLI's auth — isolate via a clean CWD, not by overriding
`HOME`). `agy --print` is the keyless Gemini path; the `gemini` CLI needs `GEMINI_API_KEY`.
- **In-process subagent arms:** pass a **standalone copy of the document in OS temp**, never
a path into the live repo. A subagent handed a repo path will explore siblings and gain
context the other arms lack. Add an explicit "read ONLY this file; do not read, search,
glob, or list any other file; do not inspect any repository" instruction.
- The **arm-b vs arm-c context delta is the experimental control** — nothing else should
differ between them.
- Run the **blind-integrity probe** (have the judge guess each finding's arm); treat
above-chance accuracy as confounded and the per-arm metric as untrusted.
- **Operational notes from the run:** keep per-arm staging files **outside** the shared run
dir, or `pool` (which globs `*.json`) double-counts; only count canonically-named records.

## Why This Matters

Unequal context across arms doesn't just add noise — it can invert the conclusion. The
contaminated run made the expensive cross-model lever look clearly justified (Claude found
drift codex missed). The isolated run showed the opposite: the apparent "model diversity"
was mostly a context difference, which a **cheaper same-model-with-context pass could also
deliver**. An eval whose entire purpose is a build/no-build decision must not let a context
confound decide it. This is the eval-first approach working — it caught the confound the
moment it ran on real inputs.

## When to Apply

Any multi-arm evaluation that compares models or review configurations on the same input —
especially when some arms are subprocess CLIs (no ambient context) and others are in-process
subagents (ambient tool/repo access). The asymmetry is the trap.

## Related

- `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md` — N≥3 trials and
variance-as-signal; the same harness-discipline family (single trials and unequal context
both produce confidently-wrong, reversed conclusions).
- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — per-finding
(not batched) blinded judging; the blind-integrity rationale.
- `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md` — the harness this lesson
governs (arm isolation, fair b-vs-c context, blind-integrity check are all requirements there).
122 changes: 122 additions & 0 deletions docs/solutions/skill-design/cross-model-eval-first-run-2026-05-25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
module: cross-model-review-eval
tags: [evaluation, code-review, cross-model, corpus, blinding]
problem_type: workflow-pattern
---

# Cross-model eval — first code-review run (decision record)

First end-to-end run of the code-review breakpoint (`scripts/eval/cross_model_review/`)
against a private internal codebase (~200 `fix:` commits). Target details are deliberately
omitted — this repo is public. The run is a **mechanics validation**, not a decision-grade
result.

## Outcome: inconclusive (correctly)

Single trial, a non-blind same-family judge, and a loosely-built corpus all hold below the
bar for a real decision. The pipeline reported `inconclusive` rather than a count-driven
`build_nothing` — the safeguards (pre-registration, blind-integrity gate, human override)
fired as designed. A naive read of the raw counts would have been the exact "measure
activity not outcomes" trap the eval exists to prevent.

Per-arm GT-match hits (10 known-failure docs, 1 trial): baseline 0, cross-model-isolated 1,
cross-model-context 0, self-critic 1.

## What the run actually taught (the value)

1. **Decorrelation showed real signal.** The cross-model arm and the same-model self-critic
each surfaced a *different* known bug that every other arm — including the baseline —
missed. Even same-model "fresh adversarial pass" decorrelates from the baseline. This is
the first concrete evidence the lever might pay off, and it justifies a real (blinded,
multi-trial, clean-corpus) run.

2. **External-CLI arms are not uniformly viable.** One cross-model CLI returned usable,
specific findings; the other returned unusable output (it emitted its own CLI's internal
monologue instead of a review). Per-CLI viability must be smoke-checked before a CLI is
trusted as an arm — a configured arm can silently no-op.

3. **Harness bug found and fixed: cross-arm credit bleed.** GT-match verdicts were keyed on
`(doc_id, finding_id)`, but finding ids are only local to a record, so one `matches_bug`
verdict credited *every* arm that reused a local id like `f1`. Fixed by pooling findings
under arm-opaque, globally-unique uids (`gt_pool` / `gt_hits_from_verdicts`); the judge
still never sees the arm. Regression-tested.

4. **Tier-3 auto-corpus needs a quality gate.** Blame-built corpora collapsed multiple
distinct fixes onto a single large culprit commit (10 docs -> 6 distinct diffs; one
culprit was a ~150k-line foundational commit), and the fix's target bug was frequently
*not* the most salient defect in the (huge) culprit diff — so arms found real bugs that
weren't the GT bug, depressing the hit rate for corpus reasons rather than arm quality.
`build_corpus` should cap culprit-diff size, exclude foundational/import commits, require
blame to a small recent commit, and dedup docs sharing a culprit.

## Process constraint discovered

The agent harness **hard-blocks** sending a private codebase's diffs to external model CLIs
(data exfiltration) — user authorization in-session does not clear it. Cross-model arms over
proprietary code must be run by the user on their own machine, or the eval must use a public
corpus. The in-process arms (baseline, self-critic) have no egress and run normally.

## Second run — gated corpus + blind judging (same day)

Re-ran with the quality gate (10 distinct, tight-blame logic/data culprits + 1 control),
the credit-bleed fix, and **blind judging** (decided `matches_bug` from finding text + GT
only, arms resolved afterward). Still `trials_per_arm=1` and a Claude-family judge.

Per-arm GT-match hits: baseline 1, cross-model-isolated **0**, cross-model-context 1
(its single coherent finding; empty/garbage on 9/11 docs), self-critic 1. **No arm cleared
the threshold, and the run-1 cross-model advantage did not reproduce** — on a clean corpus
under blind judging the isolated cross-model arm went 0/10 and the Claude arms matched or
beat both cross-model arms. Run 1's edge was largely a corpus artifact.

**The bigger finding is about the metric.** Every arm produced 5-7 specific, serious
findings per document (injection, TOCTOU, silent data corruption, double-counting, collation
errors) — they simply rarely surfaced the *one* bug the historical fix targeted. GT-match
against historical fixes asks "did you find the bug fixed *then*," but a competent reviewer
finds the bugs that matter *now*. A 0/10 GT-match does not mean a weak reviewer; it means the
metric is narrow. **GT-match alone systematically undercounts reviewer value** and must be
paired with a unique-actionable-finding-yield measure (the forward-rated `decision_changing`
count) before any build/kill decision.

Honest verdict of both runs: **inconclusive / underpowered** (single trial). Directionally,
cross-model critique has not shown a GT-match advantage once the corpus and judging confounds
are removed; the Gemini/agy arm is non-viable as configured.

## Third run — gemini replaces agy, finding-yield added (same day)

Re-ran the gated corpus with `agy` swapped for `gemini` (arm c) and the new finding-yield
metric scored alongside GT-match. agy's empty arm became a real reviewer; gemini produced
4-11 findings per document.

GT-match and finding-yield told **opposite stories**:

- **GT-match:** baseline 1, codex 0, gemini 1, self-critic 1 — no cross-model advantage, and
removing agy actually *lost* a GT hit (agy had coincidentally nailed one collation bug in
run 2 that neither codex nor gemini caught).
- **Finding-yield (actionable findings):** baseline 5, **codex ~30**, **gemini ~55**,
self-critic 9. The cross-model arms found 6-13x more real bugs than the Claude baseline —
exactly what GT-match hides. This is the run-2 thesis confirmed: GT-match against historical
fixes measures "did you find the one bug fixed then," not reviewer value.

**But raw yield has a precision hole, and the negative control exposed it.** gemini reported
two specific, plausible defects on the behavior-preserving negative-control diff, citing line
numbers that were not in the diff (it ran from a clean cwd and could not have read them — it
fabricated them). codex's negative control was clean. So gemini's high volume is partly
confabulation; codex's volume is credible. **Raw yield rewards verbosity and confabulation —
it must be precision-weighted** (verify a sample of each arm's findings; track a per-arm
false-positive rate via the negative control + spot checks). A model that invents plausible
bugs scores high on unweighted yield.

Verdict across three runs: **inconclusive** on a build decision (single trial, single
Claude-family judge, unverified yield), but the picture is now sharp: cross-model arms add
substantial finding *volume*; whether that is *value* hinges on precision, where codex looks
strong (clean control, ~30 actionable) and gemini looks volume-heavy-but-confabulating.

## Next steps for a decision-grade run

- Score **precision-weighted yield**: verify a random sample of each arm's findings against
the real code and compute a per-arm true-positive rate; multiply yield by it. Track the
negative-control false-positive rate per arm. Do not rank arms on raw yield.
- Run blinded with `trials_per_arm >= 3` and a **non-Claude** judge.
- Prefer `codex` (clean control, high credible yield); treat `gemini` yield as suspect until
precision-verified; `agy` stays dropped.
- For the cross-model arms, use a public repo (or user-run egress).
Loading