Skip to content
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5ace511
docs(cross-model-review-eval): add brainstorm and eval plan
jaybna May 24, 2026
759cc57
feat(cross-model-eval): add corpus manifest stub and pre-registration…
jaybna May 24, 2026
1de3c89
feat(cross-model-eval): add runner skeleton, record schema, and carri…
jaybna May 24, 2026
e98bf5a
feat(cross-model-eval): add CLI arms (b, c) with arm-b isolation prob…
jaybna May 24, 2026
4d36a30
feat(cross-model-eval): add in-process arm prompts (U4)
jaybna May 25, 2026
27ffe44
feat(cross-model-eval): add blinded judge rubric, dedup/integrity, an…
jaybna May 25, 2026
251f153
feat(cross-model-eval): add decision-artifact template (U7)
jaybna May 25, 2026
0ca59ed
fix(cross-model-eval): make arm-b runnable and parse numbered finding…
jaybna May 25, 2026
313e8e9
docs(cross-model-eval): capture arm-isolation methodology lesson
jaybna May 25, 2026
b0a5cbc
feat(cross-model-eval): add critique.sh one-liner for quick cross-mod…
jaybna May 25, 2026
c9cd048
docs(cross-model-eval): generalize the run reference (drop internal r…
jaybna May 25, 2026
3ae2683
feat(cross-model-eval): add known-bug corpus builder for the code-rev…
jaybna May 25, 2026
78c4dc3
feat(cross-model-eval): add GT-match scoring for the code-review brea…
jaybna May 25, 2026
4cea60f
feat(cross-model-eval): wire the code-review eval pipeline end-to-end
jaybna May 25, 2026
d73a25d
fix(cross-model-eval): give GT-match findings arm-opaque unique ids t…
jaybna May 25, 2026
3f01a4d
docs(cross-model-eval): record the first code-review run findings
jaybna May 25, 2026
2db9787
feat(cross-model-eval): add corpus quality gate to scan-fixes
jaybna May 25, 2026
a317091
docs(cross-model-eval): record the gated second run + the metric finding
jaybna May 25, 2026
aa5d82f
feat(cross-model-eval): add finding-yield metric alongside GT-match
jaybna May 25, 2026
c5f6700
feat(cross-model-eval): wire gemini CLI as the Gemini-family arm; dep…
jaybna May 25, 2026
c86da36
docs(cross-model-eval): record run 3 (gemini + finding-yield) and the…
jaybna May 25, 2026
8cefdd9
test(cross-model-eval): genericize a code-path test fixture filename
jaybna May 26, 2026
8e80383
fix(cross-model-eval): switch critique.sh from deprecated agy to gemini
jaybna May 26, 2026
1dfba11
feat(cross-model-eval): add panel-critique.sh (fair 6-lens compare) +…
jaybna May 26, 2026
82f4325
docs(cross-model-eval): correct the plan-review finding after the fai…
jaybna May 26, 2026
93e0ef6
fix(cross-model-eval): make finding-count reliable (JSON output) + co…
jaybna May 26, 2026
e1d9ee0
feat(cross-model-eval): add a non-Claude blind judge (the last decisi…
jaybna May 26, 2026
9fa8a7a
docs(cross-model-eval): add the decision-grade run procedure (pre-reg…
jaybna May 26, 2026
1706594
docs(cross-model-eval): decision-grade run record (non-Claude judge, …
jaybna May 26, 2026
dba311d
fix(cross-model-eval): harden the decision spine against invalid/part…
jaybna May 26, 2026
6072ec5
feat(cross-model-eval): seatbelt deny-write floor for the agy arm + P…
jaybna May 28, 2026
e1226ed
docs(ce-deep-review): Phase 0 arm-posture findings, onboarding, plan …
jaybna May 28, 2026
006b709
feat(ce-deep-review-beta): Phase 1 thin slice — panel + consent gate …
jaybna May 28, 2026
aa55833
docs(ce-deep-review): discoverability for the beta thin slice
jaybna May 28, 2026
766c730
fix(ce-deep-review-beta): make consent legible to the auto-mode egres…
jaybna May 29, 2026
09d7f73
docs(ce-deep-review): add v4 plan reconciled to committed reality; ma…
jaybna May 29, 2026
335bf6c
docs(ce-deep-review): reconcile v4 plan to OD-4 resolution (dogfood #2)
jaybna May 29, 2026
cf1b388
feat(ce-deep-review): make agy (Antigravity) the default cross-model …
jaybna May 29, 2026
72ca96b
chore(ce-deep-review): retire the gemini arm from the skill (EOL 2026…
jaybna May 29, 2026
df80731
feat(ce-deep-review): parallelize cross-model dispatch + define --mod…
jaybna May 29, 2026
c3badcf
feat(ce-deep-review): verify cross-model findings with a quote-grep b…
jaybna May 29, 2026
62fa59b
feat(ce-deep-review): reconcile into the verified .deep-review.md sid…
jaybna May 29, 2026
b9cd13a
feat(ce-deep-review): verifier-rate measurement + RU6 validation/cleanup
jaybna May 29, 2026
2cc9c70
fix(ce-deep-review): fold markdown emphasis in verify-findings normalize
jaybna May 29, 2026
becffb9
fix(review): address Codex review feedback on #858
jaybna May 29, 2026
1ae8b3b
fix(ce-deep-review): whitelist verify/reconcile helpers + collision-s…
jaybna May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md

Large diffs are not rendered by default.

334 changes: 334 additions & 0 deletions docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Cross-model review eval: isolate all arms to identical context, or context masquerades as model diversity"
date: 2026-05-24
last_updated: 2026-05-24
category: skill-design
module: compound-engineering / cross-model-review-eval
problem_type: design_pattern
component: scripts/eval/cross_model_review
severity: medium
tags:
- eval
- eval-methodology
- cross-model
- blinding
- fairness
- confound
- subagent-dispatch
related_plan: docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
related_brainstorm: docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
---

## Context

The cross-model review eval compares review "arms" on the same document: a Claude
baseline (a), a cross-model CLI with no repo context (b), a cross-model CLI with a fixed
context set (c), and a same-model self-critic (d). The comparison is only meaningful if
every arm receives **identical, controlled context** — the design's whole point is to
isolate *model* and *context* as separate variables.

A live run on two real plans from a separate internal repo violated this and produced a confidently-wrong
conclusion. The cross-model arms (b, c) were correctly isolated — `codex` ran from a clean
CWD with the plan piped via stdin, no repo access. But the in-process Claude arms (a, d)
were dispatched as general-purpose subagents **given a path into the live repo**, so they
explored sibling files and discovered that both plans were already implemented and had
drifted from the plan. That made the Claude arms look impressively decorrelated — they
"found" things the codex arms missed.

It was an artifact, not a result. Re-running with the Claude arms isolated to a
**standalone copy of just the plan in OS temp** (no surrounding repo) plus a hard "read
ONLY this file, do not explore the filesystem or any repo" instruction produced none of
the drift findings. Fairly matched, the models mostly **agreed** on premise-level issues;
the biggest finding-count delta came from **context** (codex +context produced 35 findings
vs 10 without on one plan), not from model identity. The isolation re-run overturned the
contaminated run's apparent "build cross-model review" conclusion.

## Guidance

When evaluating multiple review configs (models, with/without context, self-critic),
isolate every arm to the same input shape before comparing:

- **CLI arms:** clean CWD + document via stdin only. For `codex`, add
`--skip-git-repo-check` (it refuses to run from a clean dir without it) and do **not**
strip `HOME` (that kills the CLI's auth — isolate via a clean CWD, not by overriding
`HOME`). `agy --print` is the keyless Gemini path; the `gemini` CLI needs `GEMINI_API_KEY`.
- **In-process subagent arms:** pass a **standalone copy of the document in OS temp**, never
a path into the live repo. A subagent handed a repo path will explore siblings and gain
context the other arms lack. Add an explicit "read ONLY this file; do not read, search,
glob, or list any other file; do not inspect any repository" instruction.
- The **arm-b vs arm-c context delta is the experimental control** — nothing else should
differ between them.
- Run the **blind-integrity probe** (have the judge guess each finding's arm); treat
above-chance accuracy as confounded and the per-arm metric as untrusted.
- **Operational notes from the run:** keep per-arm staging files **outside** the shared run
dir, or `pool` (which globs `*.json`) double-counts; only count canonically-named records.

## Why This Matters

Unequal context across arms doesn't just add noise — it can invert the conclusion. The
contaminated run made the expensive cross-model lever look clearly justified (Claude found
drift codex missed). The isolated run showed the opposite: the apparent "model diversity"
was mostly a context difference, which a **cheaper same-model-with-context pass could also
deliver**. An eval whose entire purpose is a build/no-build decision must not let a context
confound decide it. This is the eval-first approach working — it caught the confound the
moment it ran on real inputs.

## When to Apply

Any multi-arm evaluation that compares models or review configurations on the same input —
especially when some arms are subprocess CLIs (no ambient context) and others are in-process
subagents (ambient tool/repo access). The asymmetry is the trap.

## Related

- `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md` — N≥3 trials and
variance-as-signal; the same harness-discipline family (single trials and unequal context
both produce confidently-wrong, reversed conclusions).
- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — per-finding
(not batched) blinded judging; the blind-integrity rationale.
- `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md` — the harness this lesson
governs (arm isolation, fair b-vs-c context, blind-integrity check are all requirements there).
111 changes: 111 additions & 0 deletions scripts/eval/cross_model_review/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Cross-Model Review Evaluation Harness

A repeatable, four-arm evaluation that decides whether — and which — review-improvement
lever is worth building, **before** any cross-model review machinery ships. It is a
decision tool, not a shipped feature.

Origin requirements: `docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md`
Plan: `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md`

## The four arms

| Arm | What it is | Hypothesis it isolates |
|-----|------------|------------------------|
| `a_baseline` | Claude-only review (current `ce-doc-review`) | the control |
| `b_isolated` | cross-model CLI, run isolated from the repo (no workspace context) | does a different model add unique value with no context? |
| `c_fixed_context` | cross-model CLI + a fixed, documented repo-context set | is context-poverty (not the model) the limiting factor? |
| `d_self_critic` | Claude re-reviews in-process, own failure modes supplied, prior output hidden | is the gain the model, or just a fresh adversarial pass? |

Arms `b`/`c` shell out via the `codex` and `agy` CLIs (run by `run_arms.py`).
Arms `a`/`d` and the judge are produced by the **orchestrator** (the agent running the
eval) via in-process subagent dispatch — there is no `claude -p` and arm `d` performs no
document egress.

## How the two halves cooperate (the record-store seam)

Both producers write **schema-conformant record files** into a single shared run
directory (an OS-temp dir created by `run_arms.py`). Neither half writes into the other's
memory:

- `run_arms.py` spawns the CLI arms (`b`, `c`) directly and writes their records.
- The orchestrator dispatches the in-process arms (`a`, `d`) and the judge, then writes
each result as a record file into the same run directory (or via `run_arms.py ingest`).
- Aggregation (`run_arms.py`) pools by reading **all** record files in the run dir,
regardless of which producer wrote them.

The per-arm timeout and circuit breaker apply **only** to the CLI arms `run_arms.py`
spawns. In-process arm records are ingested as-is.

See `record-schema.json` for the canonical record contract both producers must satisfy.

## Pre-registration (required before any arm runs)

Editing `tests/fixtures/cross-model-review/corpus-manifest.json`, commit these values
**before** running so the decision rule is independent of the observed counts (R9):

- `go_threshold` — the per-arm count of confirmed unique decision-changing findings (on
the known-failure subset) that justifies building a lever.
- `minimum_corpus_n` — the smallest corpus the run is allowed to draw a conclusion from.
A run below this N reports **inconclusive**, never "build nothing".
- `trials_per_arm` — number of trials per (document × arm). Floor is **3**; model arms are
non-deterministic and a single trial produces confidently-wrong, reversed conclusions
(see `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md`).
- `arm_c_context_rule` — the fixed, documented context set arm `c` receives, applied
identically to every document. This is the experimental control for the model-vs-context
comparison; it must be defined before running, not curated per-document.

The corpus list itself (the `docs` array) is also filled at this step. The committed file
in this repo is a **schema stub** with placeholder values and one example entry per
subset; a run replaces them with the real corpus and pre-registered values.

## Running

Per (document × trial), produce one schema-conformant record per arm into a shared run
dir, then pool, judge, and aggregate.

**CLI arms (b, c)** — `arms.py` runs the external model and emits a record on stdout (both
run from a clean CWD with auth preserved; arm b has no context, arm c adds the fixed
`--context` set):

```
python3 arms.py run-arm b_isolated codex <doc> <rubric> --doc-id <id> --trial <n> > rec.json
python3 arms.py run-arm c_fixed_context agy <doc> <rubric> --context <ctx> --doc-id <id> --trial <n> > rec.json
python3 run_arms.py ingest <run_dir> rec.json
```

**In-process arms (a, d) and the judge** — produced by the orchestrator via subagent
dispatch (see `prompts/baseline.md`, `prompts/self-critic.md`, `judge_rubric.md`), each
written as a record and ingested into the same run dir.

**Then** pool and decide:

```
python3 run_arms.py pool <run_dir>
python3 run_arms.py aggregate <scored.json> <manifest.json>
```

The decision artifact is written under `docs/` from `decision-artifact-template.md`.

### Quick one-off critique (turnkey)

To get cross-model critiques of a single document without the full eval, use the wrapper —
it runs `codex` and `agy` as isolated reviewers and prints each model's findings:

```
bash scripts/eval/cross_model_review/critique.sh <plan.md> [rubric.md] [context.md]
```

A built-in rubric is used if none is given; pass a `context.md` to switch the arms to the
fixed-context variant. Override the per-arm timeout with `CMRE_TIMEOUT=<seconds>` (agy can
be slow). A missing/unauthenticated CLI is skipped, not fatal. Each run sends the document
to that vendor (codex -> OpenAI, agy -> Google).

## Outcomes

The decision artifact records exactly one of:

- **build `<arm>`** — a lever cleared the pre-registered threshold; the winning arm shapes
the deferred build.
- **build nothing** — corpus met `minimum_corpus_n` and no arm cleared the threshold.
- **inconclusive / underpowered** — corpus below `minimum_corpus_n`, or the blind-integrity
check came back confounded; re-run larger / with a different judge.
Loading