EveryInc · jaybna · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md b/docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
diff --git a/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md b/docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
diff --git a/docs/solutions/skill-design/cross-model-eval-arm-isolation-2026-05-24.md b/docs/solutions/skill-design/cross-model-eval-arm-isolation-2026-05-24.md
@@ -0,0 +1,90 @@
+---
+title: "Cross-model review eval: isolate all arms to identical context, or context masquerades as model diversity"
+date: 2026-05-24
+last_updated: 2026-05-24
+category: skill-design
+module: compound-engineering / cross-model-review-eval
+problem_type: design_pattern
+component: scripts/eval/cross_model_review
+severity: medium
+tags:
+  - eval
+  - eval-methodology
+  - cross-model
+  - blinding
+  - fairness
+  - confound
+  - subagent-dispatch
+related_plan: docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md
+related_brainstorm: docs/brainstorms/2026-05-24-multi-model-plan-review-requirements.md
+---
+
+## Context
+
+The cross-model review eval compares review "arms" on the same document: a Claude
+baseline (a), a cross-model CLI with no repo context (b), a cross-model CLI with a fixed
+context set (c), and a same-model self-critic (d). The comparison is only meaningful if
+every arm receives **identical, controlled context** — the design's whole point is to
+isolate *model* and *context* as separate variables.
+
+A live run on two real plans from a separate internal repo violated this and produced a confidently-wrong
+conclusion. The cross-model arms (b, c) were correctly isolated — `codex` ran from a clean
+CWD with the plan piped via stdin, no repo access. But the in-process Claude arms (a, d)
+were dispatched as general-purpose subagents **given a path into the live repo**, so they
+explored sibling files and discovered that both plans were already implemented and had
+drifted from the plan. That made the Claude arms look impressively decorrelated — they
+"found" things the codex arms missed.
+
+It was an artifact, not a result. Re-running with the Claude arms isolated to a
+**standalone copy of just the plan in OS temp** (no surrounding repo) plus a hard "read
+ONLY this file, do not explore the filesystem or any repo" instruction produced none of
+the drift findings. Fairly matched, the models mostly **agreed** on premise-level issues;
+the biggest finding-count delta came from **context** (codex +context produced 35 findings
+vs 10 without on one plan), not from model identity. The isolation re-run overturned the
+contaminated run's apparent "build cross-model review" conclusion.
+
+## Guidance
+
+When evaluating multiple review configs (models, with/without context, self-critic),
+isolate every arm to the same input shape before comparing:
+
+- **CLI arms:** clean CWD + document via stdin only. For `codex`, add
+  `--skip-git-repo-check` (it refuses to run from a clean dir without it) and do **not**
+  strip `HOME` (that kills the CLI's auth — isolate via a clean CWD, not by overriding
+  `HOME`). `agy --print` is the keyless Gemini path; the `gemini` CLI needs `GEMINI_API_KEY`.
+- **In-process subagent arms:** pass a **standalone copy of the document in OS temp**, never
+  a path into the live repo. A subagent handed a repo path will explore siblings and gain
+  context the other arms lack. Add an explicit "read ONLY this file; do not read, search,
+  glob, or list any other file; do not inspect any repository" instruction.
+- The **arm-b vs arm-c context delta is the experimental control** — nothing else should
+  differ between them.
+- Run the **blind-integrity probe** (have the judge guess each finding's arm); treat
+  above-chance accuracy as confounded and the per-arm metric as untrusted.
+- **Operational notes from the run:** keep per-arm staging files **outside** the shared run
+  dir, or `pool` (which globs `*.json`) double-counts; only count canonically-named records.
+
+## Why This Matters
+
+Unequal context across arms doesn't just add noise — it can invert the conclusion. The
+contaminated run made the expensive cross-model lever look clearly justified (Claude found
+drift codex missed). The isolated run showed the opposite: the apparent "model diversity"
+was mostly a context difference, which a **cheaper same-model-with-context pass could also
+deliver**. An eval whose entire purpose is a build/no-build decision must not let a context
+confound decide it. This is the eval-first approach working — it caught the confound the
+moment it ran on real inputs.
+
+## When to Apply
+
+Any multi-arm evaluation that compares models or review configurations on the same input —
+especially when some arms are subprocess CLIs (no ambient context) and others are in-process
+subagents (ambient tool/repo access). The asymmetry is the trap.
+
+## Related
+
+- `docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md` — N≥3 trials and
+  variance-as-signal; the same harness-discipline family (single trials and unequal context
+  both produce confidently-wrong, reversed conclusions).
+- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — per-finding
+  (not batched) blinded judging; the blind-integrity rationale.
+- `docs/plans/2026-05-24-001-feat-cross-model-review-eval-plan.md` — the harness this lesson
+  governs (arm isolation, fair b-vs-c context, blind-integrity check are all requirements there).
diff --git a/docs/solutions/skill-design/cross-model-eval-decision-grade-2026-05-26.md b/docs/solutions/skill-design/cross-model-eval-decision-grade-2026-05-26.md
@@ -0,0 +1,74 @@
+---
+module: cross-model-review-eval
+tags: [evaluation, code-review, cross-model, decision-grade, judge]
+problem_type: decision-record
+---
+
+# Cross-model critique — decision-grade run (decision record)
+
+The first run with all four decision-grade guards on at once: a **non-Claude blind judge**
+(codex), **3 trials per arm**, a pre-registered decision rule, and the negative-control +
+yield precision checks. Corpus: a private code-review corpus of 10 known-failure culprit
+diffs (each paired with the historical fix that proved the bug mattered) + 1 behavior-
+preserving negative control. Target specifics omitted — public repo.
+
+## Outcome: inconclusive / underpowered (by design)
+
+Pre-registered `minimum_corpus_n = 20`; the confirmed corpus is 10, so the R9 safeguard
+fires and the run reports `inconclusive` rather than a confident build/kill. This is the
+guard working, not a failure — a 10-item corpus can't carry a confident verdict, and the
+pre-registration prevents an underpowered run from masquerading as one.
+
+## Primary signal — GT-match (validated), best-of-3-trials per doc
+
+| Arm | GT hits /10 | Caught a bug NO other arm caught |
+|-----|-------------|----------------------------------|
+| baseline (Claude) | 1 | — |
+| cross-model, isolated (codex) | 2 | yes (1) |
+| cross-model, +context (gemini) | 4 | yes (1) |
+| self-critic (Claude) | 3 | — |
+
+Union across all arms: **5/10** known bugs caught by someone. The decisive result: **each
+cross-model arm uniquely surfaced a validated bug that neither Claude arm caught** — grounded
+in the actual historical fix, not plausibility. On this corpus, under a fair non-Claude
+judge, the cross-model lever **decorrelates and adds GT coverage the Claude panel misses.**
+The self-critic (Claude, fresh adversarial pass) also beat the baseline (3 vs 1), but caught
+nothing the cross-model+context arm didn't.
+
+## Finding yield — and its precision caveat
+
+Yield (judge-classified unique-actionable findings, 11 docs × 3 trials): baseline 11,
+codex 45, gemini **134**, self-critic 18. The cross-model arms produce far more — but yield
+is **judge-plausibility, not code-verified truth**: the code-blind judge can confirm a finding
+is *specific and plausible*, not that it is *real*. gemini's volume is the precision-suspect
+one (it confabulated on the negative control in an earlier run). The negative control here was
+clean for all arms (the judge rejected all control findings, 0 false positives) — but that
+only catches blatant confabulation. **Raw yield must still be precision-weighted by
+human spot-verification** before it ranks arms; this run did not do that.
+
+## Validity checks
+
+- **Negative control:** did not move (0 decision-changing findings on the control, all arms).
+- **Blind judge:** held by construction (the judge saw finding text + ground-truth bug, never
+  the arm; arms re-attached afterward via `gt-resolve`).
+- **Judge-family overlap (disclosed limitation):** with only codex/gemini as non-Claude CLIs,
+  any non-Claude judge shares a family with one cross-model arm — codex-judge overlaps the
+  codex arm (b). No fully-disjoint judge is available; mitigated by the blind pool. A future
+  run should cross-check with a gemini judge (overlaps c instead) and compare.
+- **Power:** corpus_n 10 < pre-registered 20 → inconclusive.
+
+## What this concludes (and doesn't)
+
+- **Directionally, the lever looks worth building:** on validated outcomes, under a fair
+  non-Claude judge across 3 trials, the cross-model arms catch real bugs the Claude arms miss,
+  and the +context arm caught the most (4/10) — suggesting context, not just model diversity,
+  carries weight.
+- **It is not a confident build/kill.** It is underpowered (N=10 < 20), gemini's high yield is
+  not code-verified, and the judge shares a family with one arm.
+- **A confident verdict needs:** a larger human-confirmed known-failure corpus (≥ the
+  pre-registered floor), human precision-verification of a finding sample (true-positive rate
+  per arm, not judge plausibility), and a judge cross-checked across families.
+
+If a build proceeds on the directional signal, the winning shape is **cross-model + fixed
+context** (arm c) — the highest GT coverage — with codex as the higher-precision, lower-volume
+alternative and gemini's yield gated behind precision verification.