Skip to content
Open
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5ace511
docs(cross-model-review-eval): add brainstorm and eval plan
jaybna May 24, 2026
759cc57
feat(cross-model-eval): add corpus manifest stub and pre-registration…
jaybna May 24, 2026
1de3c89
feat(cross-model-eval): add runner skeleton, record schema, and carri…
jaybna May 24, 2026
e98bf5a
feat(cross-model-eval): add CLI arms (b, c) with arm-b isolation prob…
jaybna May 24, 2026
4d36a30
feat(cross-model-eval): add in-process arm prompts (U4)
jaybna May 25, 2026
27ffe44
feat(cross-model-eval): add blinded judge rubric, dedup/integrity, an…
jaybna May 25, 2026
251f153
feat(cross-model-eval): add decision-artifact template (U7)
jaybna May 25, 2026
0ca59ed
fix(cross-model-eval): make arm-b runnable and parse numbered finding…
jaybna May 25, 2026
313e8e9
docs(cross-model-eval): capture arm-isolation methodology lesson
jaybna May 25, 2026
b0a5cbc
feat(cross-model-eval): add critique.sh one-liner for quick cross-mod…
jaybna May 25, 2026
c9cd048
docs(cross-model-eval): generalize the run reference (drop internal r…
jaybna May 25, 2026
3ae2683
feat(cross-model-eval): add known-bug corpus builder for the code-rev…
jaybna May 25, 2026
78c4dc3
feat(cross-model-eval): add GT-match scoring for the code-review brea…
jaybna May 25, 2026
4cea60f
feat(cross-model-eval): wire the code-review eval pipeline end-to-end
jaybna May 25, 2026
d73a25d
fix(cross-model-eval): give GT-match findings arm-opaque unique ids t…
jaybna May 25, 2026
3f01a4d
docs(cross-model-eval): record the first code-review run findings
jaybna May 25, 2026
2db9787
feat(cross-model-eval): add corpus quality gate to scan-fixes
jaybna May 25, 2026
a317091
docs(cross-model-eval): record the gated second run + the metric finding
jaybna May 25, 2026
aa5d82f
feat(cross-model-eval): add finding-yield metric alongside GT-match
jaybna May 25, 2026
c5f6700
feat(cross-model-eval): wire gemini CLI as the Gemini-family arm; dep…
jaybna May 25, 2026
c86da36
docs(cross-model-eval): record run 3 (gemini + finding-yield) and the…
jaybna May 25, 2026
8cefdd9
test(cross-model-eval): genericize a code-path test fixture filename
jaybna May 26, 2026
8e80383
fix(cross-model-eval): switch critique.sh from deprecated agy to gemini
jaybna May 26, 2026
1dfba11
feat(cross-model-eval): add panel-critique.sh (fair 6-lens compare) +…
jaybna May 26, 2026
82f4325
docs(cross-model-eval): correct the plan-review finding after the fai…
jaybna May 26, 2026
93e0ef6
fix(cross-model-eval): make finding-count reliable (JSON output) + co…
jaybna May 26, 2026
e1d9ee0
feat(cross-model-eval): add a non-Claude blind judge (the last decisi…
jaybna May 26, 2026
9fa8a7a
docs(cross-model-eval): add the decision-grade run procedure (pre-reg…
jaybna May 26, 2026
1706594
docs(cross-model-eval): decision-grade run record (non-Claude judge, …
jaybna May 26, 2026
dba311d
fix(cross-model-eval): harden the decision spine against invalid/part…
jaybna May 26, 2026
6072ec5
feat(cross-model-eval): seatbelt deny-write floor for the agy arm + P…
jaybna May 28, 2026
e1226ed
docs(ce-deep-review): Phase 0 arm-posture findings, onboarding, plan …
jaybna May 28, 2026
006b709
feat(ce-deep-review-beta): Phase 1 thin slice — panel + consent gate …
jaybna May 28, 2026
aa55833
docs(ce-deep-review): discoverability for the beta thin slice
jaybna May 28, 2026
766c730
fix(ce-deep-review-beta): make consent legible to the auto-mode egres…
jaybna May 29, 2026
09d7f73
docs(ce-deep-review): add v4 plan reconciled to committed reality; ma…
jaybna May 29, 2026
335bf6c
docs(ce-deep-review): reconcile v4 plan to OD-4 resolution (dogfood #2)
jaybna May 29, 2026
cf1b388
feat(ce-deep-review): make agy (Antigravity) the default cross-model …
jaybna May 29, 2026
72ca96b
chore(ce-deep-review): retire the gemini arm from the skill (EOL 2026…
jaybna May 29, 2026
df80731
feat(ce-deep-review): parallelize cross-model dispatch + define --mod…
jaybna May 29, 2026
c3badcf
feat(ce-deep-review): verify cross-model findings with a quote-grep b…
jaybna May 29, 2026
62fa59b
feat(ce-deep-review): reconcile into the verified .deep-review.md sid…
jaybna May 29, 2026
b9cd13a
feat(ce-deep-review): verifier-rate measurement + RU6 validation/cleanup
jaybna May 29, 2026
2cc9c70
fix(ce-deep-review): fold markdown emphasis in verify-findings normalize
jaybna May 29, 2026
becffb9
fix(review): address Codex review feedback on #858
jaybna May 29, 2026
1ae8b3b
fix(ce-deep-review): whitelist verify/reconcile helpers + collision-s…
jaybna May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md

Large diffs are not rendered by default.

334 changes: 334 additions & 0 deletions docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Cross-model review eval: isolate all arms to identical context, or context masquerades as model diversity"
date: 2026-05-24
last_updated: 2026-05-24
category: skill-design
module: compound-engineering / cross-model-review-eval
problem_type: design_pattern
component: scripts/eval/cross_model_review
severity: medium
tags:
- eval
- eval-methodology
- cross-model
- blinding
- fairness
- confound
- subagent-dispatch
related_plan: docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
related_brainstorm: docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
---

## Context

The cross-model review eval compares review "arms" on the same document: a Claude
baseline (a), a cross-model CLI with no repo context (b), a cross-model CLI with a fixed
context set (c), and a same-model self-critic (d). The comparison is only meaningful if
every arm receives **identical, controlled context** — the design's whole point is to
isolate *model* and *context* as separate variables.

A live run on two real plans from a separate internal repo violated this and produced a confidently-wrong
conclusion. The cross-model arms (b, c) were correctly isolated — `codex` ran from a clean
CWD with the plan piped via stdin, no repo access. But the in-process Claude arms (a, d)
were dispatched as general-purpose subagents **given a path into the live repo**, so they
explored sibling files and discovered that both plans were already implemented and had
drifted from the plan. That made the Claude arms look impressively decorrelated — they
"found" things the codex arms missed.

It was an artifact, not a result. Re-running with the Claude arms isolated to a
**standalone copy of just the plan in OS temp** (no surrounding repo) plus a hard "read
ONLY this file, do not explore the filesystem or any repo" instruction produced none of
the drift findings. Fairly matched, the models mostly **agreed** on premise-level issues;
the biggest finding-count delta came from **context** (codex +context produced 35 findings
vs 10 without on one plan), not from model identity. The isolation re-run overturned the
contaminated run's apparent "build cross-model review" conclusion.

## Guidance

When evaluating multiple review configs (models, with/without context, self-critic),
isolate every arm to the same input shape before comparing:

- **CLI arms:** clean CWD + document via stdin only. For `codex`, add
`--skip-git-repo-check` (it refuses to run from a clean dir without it) and do **not**
strip `HOME` (that kills the CLI's auth — isolate via a clean CWD, not by overriding
`HOME`). `agy --print` is the keyless Gemini path; the `gemini` CLI needs `GEMINI_API_KEY`.
- **In-process subagent arms:** pass a **standalone copy of the document in OS temp**, never
a path into the live repo. A subagent handed a repo path will explore siblings and gain
context the other arms lack. Add an explicit "read ONLY this file; do not read, search,
glob, or list any other file; do not inspect any repository" instruction.
- The **arm-b vs arm-c context delta is the experimental control** — nothing else should
differ between them.
- Run the **blind-integrity probe** (have the judge guess each finding's arm); treat
above-chance accuracy as confounded and the per-arm metric as untrusted.
- **Operational notes from the run:** keep per-arm staging files **outside** the shared run
dir, or `pool` (which globs `*.json`) double-counts; only count canonically-named records.

## Why This Matters

Unequal context across arms doesn't just add noise — it can invert the conclusion. The
contaminated run made the expensive cross-model lever look clearly justified (Claude found
drift codex missed). The isolated run showed the opposite: the apparent "model diversity"
was mostly a context difference, which a **cheaper same-model-with-context pass could also
deliver**. An eval whose entire purpose is a build/no-build decision must not let a context
confound decide it. This is the eval-first approach working — it caught the confound the
moment it ran on real inputs.

## When to Apply

Any multi-arm evaluation that compares models or review configurations on the same input —
especially when some arms are subprocess CLIs (no ambient context) and others are in-process
subagents (ambient tool/repo access). The asymmetry is the trap.

## Related

- `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md` — N≥3 trials and
variance-as-signal; the same harness-discipline family (single trials and unequal context
both produce confidently-wrong, reversed conclusions).
- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — per-finding
(not batched) blinded judging; the blind-integrity rationale.
- `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md` — the harness this lesson
governs (arm isolation, fair b-vs-c context, blind-integrity check are all requirements there).
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
module: cross-model-review-eval
tags: [evaluation, code-review, cross-model, decision-grade, judge]
problem_type: decision-record
---

# Cross-model critique — decision-grade run (decision record)

The first run with all four decision-grade guards on at once: a **non-Claude blind judge**
(codex), **3 trials per arm**, a pre-registered decision rule, and the negative-control +
yield precision checks. Corpus: a private code-review corpus of 10 known-failure culprit
diffs (each paired with the historical fix that proved the bug mattered) + 1 behavior-
preserving negative control. Target specifics omitted — public repo.

## Outcome: inconclusive / underpowered (by design)

Pre-registered `minimum_corpus_n = 20`; the confirmed corpus is 10, so the R9 safeguard
fires and the run reports `inconclusive` rather than a confident build/kill. This is the
guard working, not a failure — a 10-item corpus can't carry a confident verdict, and the
pre-registration prevents an underpowered run from masquerading as one.

## Primary signal — GT-match (validated), best-of-3-trials per doc

| Arm | GT hits /10 | Caught a bug NO other arm caught |
|-----|-------------|----------------------------------|
| baseline (Claude) | 1 | — |
| cross-model, isolated (codex) | 2 | yes (1) |
| cross-model, +context (gemini) | 4 | yes (1) |
| self-critic (Claude) | 3 | — |

Union across all arms: **5/10** known bugs caught by someone. The decisive result: **each
cross-model arm uniquely surfaced a validated bug that neither Claude arm caught** — grounded
in the actual historical fix, not plausibility. On this corpus, under a fair non-Claude
judge, the cross-model lever **decorrelates and adds GT coverage the Claude panel misses.**
The self-critic (Claude, fresh adversarial pass) also beat the baseline (3 vs 1), but caught
nothing the cross-model+context arm didn't.

## Finding yield — and its precision caveat

Yield (judge-classified unique-actionable findings, 11 docs × 3 trials): baseline 11,
codex 45, gemini **134**, self-critic 18. The cross-model arms produce far more — but yield
is **judge-plausibility, not code-verified truth**: the code-blind judge can confirm a finding
is *specific and plausible*, not that it is *real*. gemini's volume is the precision-suspect
one (it confabulated on the negative control in an earlier run). The negative control here was
clean for all arms (the judge rejected all control findings, 0 false positives) — but that
only catches blatant confabulation. **Raw yield must still be precision-weighted by
human spot-verification** before it ranks arms; this run did not do that.

## Validity checks

- **Negative control:** did not move (0 decision-changing findings on the control, all arms).
- **Blind judge:** held by construction (the judge saw finding text + ground-truth bug, never
the arm; arms re-attached afterward via `gt-resolve`).
- **Judge-family overlap (disclosed limitation):** with only codex/gemini as non-Claude CLIs,
any non-Claude judge shares a family with one cross-model arm — codex-judge overlaps the
codex arm (b). No fully-disjoint judge is available; mitigated by the blind pool. A future
run should cross-check with a gemini judge (overlaps c instead) and compare.
- **Power:** corpus_n 10 < pre-registered 20 → inconclusive.

## What this concludes (and doesn't)

- **Directionally, the lever looks worth building:** on validated outcomes, under a fair
non-Claude judge across 3 trials, the cross-model arms catch real bugs the Claude arms miss,
and the +context arm caught the most (4/10) — suggesting context, not just model diversity,
carries weight.
- **It is not a confident build/kill.** It is underpowered (N=10 < 20), gemini's high yield is
not code-verified, and the judge shares a family with one arm.
- **A confident verdict needs:** a larger human-confirmed known-failure corpus (≥ the
pre-registered floor), human precision-verification of a finding sample (true-positive rate
per arm, not judge plausibility), and a judge cross-checked across families.

If a build proceeds on the directional signal, the winning shape is **cross-model + fixed
context** (arm c) — the highest GT coverage — with codex as the higher-precision, lower-volume
alternative and gemini's yield gated behind precision verification.
Loading