EveryInc · jaybna · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md b/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
diff --git a/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md b/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
diff --git a/docs/solutions/skill-design/cross-model-eval-arm-isolation-2026-05-24.md b/docs/solutions/skill-design/cross-model-eval-arm-isolation-2026-05-24.md
@@ -0,0 +1,90 @@
+---
+title: "Cross-model review eval: isolate all arms to identical context, or context masquerades as model diversity"
+date: 2026-05-24
+last_updated: 2026-05-24
+category: skill-design
+module: compound-engineering / cross-model-review-eval
+problem_type: design_pattern
+component: scripts/eval/cross_model_review
+severity: medium
+tags:
+  - eval
+  - eval-methodology
+  - cross-model
+  - blinding
+  - fairness
+  - confound
+  - subagent-dispatch
+related_plan: docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
+related_brainstorm: docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
+---
+
+## Context
+
+The cross-model review eval compares review "arms" on the same document: a Claude
+baseline (a), a cross-model CLI with no repo context (b), a cross-model CLI with a fixed
+context set (c), and a same-model self-critic (d). The comparison is only meaningful if
+every arm receives **identical, controlled context** — the design's whole point is to
+isolate *model* and *context* as separate variables.
+
+A live run on two real plans from a separate internal repo violated this and produced a confidently-wrong
+conclusion. The cross-model arms (b, c) were correctly isolated — `codex` ran from a clean
+CWD with the plan piped via stdin, no repo access. But the in-process Claude arms (a, d)
+were dispatched as general-purpose subagents **given a path into the live repo**, so they
+explored sibling files and discovered that both plans were already implemented and had
+drifted from the plan. That made the Claude arms look impressively decorrelated — they
+"found" things the codex arms missed.
+
+It was an artifact, not a result. Re-running with the Claude arms isolated to a
+**standalone copy of just the plan in OS temp** (no surrounding repo) plus a hard "read
+ONLY this file, do not explore the filesystem or any repo" instruction produced none of
+the drift findings. Fairly matched, the models mostly **agreed** on premise-level issues;
+the biggest finding-count delta came from **context** (codex +context produced 35 findings
+vs 10 without on one plan), not from model identity. The isolation re-run overturned the
+contaminated run's apparent "build cross-model review" conclusion.
+
+## Guidance
+
+When evaluating multiple review configs (models, with/without context, self-critic),
+isolate every arm to the same input shape before comparing:
+
+- **CLI arms:** clean CWD + document via stdin only. For `codex`, add
+  `--skip-git-repo-check` (it refuses to run from a clean dir without it) and do **not**
+  strip `HOME` (that kills the CLI's auth — isolate via a clean CWD, not by overriding
+  `HOME`). `agy --print` is the keyless Gemini path; the `gemini` CLI needs `GEMINI_API_KEY`.
+- **In-process subagent arms:** pass a **standalone copy of the document in OS temp**, never
+  a path into the live repo. A subagent handed a repo path will explore siblings and gain
+  context the other arms lack. Add an explicit "read ONLY this file; do not read, search,
+  glob, or list any other file; do not inspect any repository" instruction.
+- The **arm-b vs arm-c context delta is the experimental control** — nothing else should
+  differ between them.
+- Run the **blind-integrity probe** (have the judge guess each finding's arm); treat
+  above-chance accuracy as confounded and the per-arm metric as untrusted.
+- **Operational notes from the run:** keep per-arm staging files **outside** the shared run
+  dir, or `pool` (which globs `*.json`) double-counts; only count canonically-named records.
+
+## Why This Matters
+
+Unequal context across arms doesn't just add noise — it can invert the conclusion. The
+contaminated run made the expensive cross-model lever look clearly justified (Claude found
+drift codex missed). The isolated run showed the opposite: the apparent "model diversity"
+was mostly a context difference, which a **cheaper same-model-with-context pass could also
+deliver**. An eval whose entire purpose is a build/no-build decision must not let a context
+confound decide it. This is the eval-first approach working — it caught the confound the
+moment it ran on real inputs.
+
+## When to Apply
+
+Any multi-arm evaluation that compares models or review configurations on the same input —
+especially when some arms are subprocess CLIs (no ambient context) and others are in-process
+subagents (ambient tool/repo access). The asymmetry is the trap.
+
+## Related
+
+- `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md` — N≥3 trials and
+  variance-as-signal; the same harness-discipline family (single trials and unequal context
+  both produce confidently-wrong, reversed conclusions).
+- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — per-finding
+  (not batched) blinded judging; the blind-integrity rationale.
+- `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md` — the harness this lesson
+  governs (arm isolation, fair b-vs-c context, blind-integrity check are all requirements there).
diff --git a/scripts/eval/cross_model_review/README.md b/scripts/eval/cross_model_review/README.md
@@ -0,0 +1,111 @@
+# Cross-Model Review Evaluation Harness
+
+A repeatable, four-arm evaluation that decides whether — and which — review-improvement
+lever is worth building, **before** any cross-model review machinery ships. It is a
+decision tool, not a shipped feature.
+
+Origin requirements: `docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md`
+Plan: `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md`
+
+## The four arms
+
+| Arm | What it is | Hypothesis it isolates |
+|-----|------------|------------------------|
+| `a_baseline` | Claude-only review (current `ce-doc-review`) | the control |
+| `b_isolated` | cross-model CLI, run isolated from the repo (no workspace context) | does a different model add unique value with no context? |
+| `c_fixed_context` | cross-model CLI + a fixed, documented repo-context set | is context-poverty (not the model) the limiting factor? |
+| `d_self_critic` | Claude re-reviews in-process, own failure modes supplied, prior output hidden | is the gain the model, or just a fresh adversarial pass? |
+
+Arms `b`/`c` shell out via the `codex` and `agy` CLIs (run by `run_arms.py`).
+Arms `a`/`d` and the judge are produced by the **orchestrator** (the agent running the
+eval) via in-process subagent dispatch — there is no `claude -p` and arm `d` performs no
+document egress.
+
+## How the two halves cooperate (the record-store seam)
+
+Both producers write **schema-conformant record files** into a single shared run
+directory (an OS-temp dir created by `run_arms.py`). Neither half writes into the other's
+memory:
+
+- `run_arms.py` spawns the CLI arms (`b`, `c`) directly and writes their records.
+- The orchestrator dispatches the in-process arms (`a`, `d`) and the judge, then writes
+  each result as a record file into the same run directory (or via `run_arms.py ingest`).
+- Aggregation (`run_arms.py`) pools by reading **all** record files in the run dir,
+  regardless of which producer wrote them.
+
+The per-arm timeout and circuit breaker apply **only** to the CLI arms `run_arms.py`
+spawns. In-process arm records are ingested as-is.
+
+See `record-schema.json` for the canonical record contract both producers must satisfy.
+
+## Pre-registration (required before any arm runs)
+
+Editing `tests/fixtures/cross-model-review/corpus-manifest.json`, commit these values
+**before** running so the decision rule is independent of the observed counts (R9):
+
+- `go_threshold` — the per-arm count of confirmed unique decision-changing findings (on
+  the known-failure subset) that justifies building a lever.
+- `minimum_corpus_n` — the smallest corpus the run is allowed to draw a conclusion from.
+  A run below this N reports **inconclusive**, never "build nothing".
+- `trials_per_arm` — number of trials per (document × arm). Floor is **3**; model arms are
+  non-deterministic and a single trial produces confidently-wrong, reversed conclusions
+  (see `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md`).
+- `arm_c_context_rule` — the fixed, documented context set arm `c` receives, applied
+  identically to every document. This is the experimental control for the model-vs-context
+  comparison; it must be defined before running, not curated per-document.
+
+The corpus list itself (the `docs` array) is also filled at this step. The committed file
+in this repo is a **schema stub** with placeholder values and one example entry per
+subset; a run replaces them with the real corpus and pre-registered values.
+
+## Running
+
+Per (document × trial), produce one schema-conformant record per arm into a shared run
+dir, then pool, judge, and aggregate.
+
+**CLI arms (b, c)** — `arms.py` runs the external model and emits a record on stdout (both
+run from a clean CWD with auth preserved; arm b has no context, arm c adds the fixed
+`--context` set):
+
+```
+python3 arms.py run-arm b_isolated      codex <doc> <rubric>                 --doc-id <id> --trial <n> > rec.json
+python3 arms.py run-arm c_fixed_context agy   <doc> <rubric> --context <ctx> --doc-id <id> --trial <n> > rec.json
+python3 run_arms.py ingest <run_dir> rec.json
+```
+
+**In-process arms (a, d) and the judge** — produced by the orchestrator via subagent
+dispatch (see `prompts/baseline.md`, `prompts/self-critic.md`, `judge_rubric.md`), each
+written as a record and ingested into the same run dir.
+
+**Then** pool and decide:
+
+```
+python3 run_arms.py pool <run_dir>
+python3 run_arms.py aggregate <scored.json> <manifest.json>
+```
+
+The decision artifact is written under `docs/` from `decision-artifact-template.md`.
+
+### Quick one-off critique (turnkey)
+
+To get cross-model critiques of a single document without the full eval, use the wrapper —
+it runs `codex` and `agy` as isolated reviewers and prints each model's findings:
+
+```
+bash scripts/eval/cross_model_review/critique.sh <plan.md> [rubric.md] [context.md]
+```
+
+A built-in rubric is used if none is given; pass a `context.md` to switch the arms to the
+fixed-context variant. Override the per-arm timeout with `CMRE_TIMEOUT=<seconds>` (agy can
+be slow). A missing/unauthenticated CLI is skipped, not fatal. Each run sends the document
+to that vendor (codex -> OpenAI, agy -> Google).
+
+## Outcomes
+
+The decision artifact records exactly one of:
+
+- **build `<arm>`** — a lever cleared the pre-registered threshold; the winning arm shapes
+  the deferred build.
+- **build nothing** — corpus met `minimum_corpus_n` and no arm cleared the threshold.
+- **inconclusive / underpowered** — corpus below `minimum_corpus_n`, or the blind-integrity
+  check came back confounded; re-run larger / with a different judge.