Add cluster-bootstrap 95% CIs for the paper's AUC tables by dangng2004 · Pull Request #98 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-06-05T21:53:51Z

Summary

This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the conference study's pairwise-accuracy (AUC) tables, via a cluster bootstrap over papers.

Method

The resampling unit is the paper, not the comment, because comments within a paper are correlated. Each of 5000 draws resamples papers with replacement, stratified by (proxy, side), recomputes the pooled statistic, and the CI is the 2.5/97.5 percentiles. Point estimates match the committed compute_auc.py, so the CIs sit around numbers that are already trusted.

Functions → paper tables

Each table kind produces exactly one paper format:

`kind`	function	paper table	what it shows
`comment_volume`	`run_comment_volume`	Table 1	per-model mean comments (high/low), delta, and overall accuracy
`by_proxy`	`run_by_proxy`	Table 2	accuracy broken down by quality proxy, best model per system
`by_severity`	`run_by_severity`	Table 9	accuracy broken down by severity tier, per model

The full-cohort appendix tables are the same three kinds on the full 240-paper set.

Representative table outputs

These tables are examples of the formats supported by the script (numbers reproduce the paper):

Comment volume and overall accuracy, per model (Table 1)

Accuracy by quality proxy, best model per system (Table 2)

Accuracy by severity tier, per model (Table 9)

Tests

The config, result JSONs, and manifests are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by tests/test_ci_auc.py, which exercises auc_from, cell_summary, and the bootstrap on tiny in-memory count tables (e.g. two low-quality papers with more comments than two high-quality ones gives AUC 1.0).

$ pytest tests/test_ci_auc.py -v

tests/test_ci_auc.py::test_auc_from_perfect_separation PASSED
tests/test_ci_auc.py::test_auc_from_reversed PASSED
tests/test_ci_auc.py::test_auc_from_ties_get_half_credit PASSED
tests/test_ci_auc.py::test_auc_from_empty_side_is_nan PASSED
tests/test_ci_auc.py::test_cell_summary_point_estimates PASSED
tests/test_ci_auc.py::test_bootstrap_tiers_ci_shape PASSED
tests/test_ci_auc.py::test_bootstrap_overall_and_per_proxy PASSED
tests/test_ci_auc.py::test_dispatch_covers_the_three_kinds PASSED

8 passed

Files

ci_auc.py — new, config-driven AUC + CI runner (three kinds: comment_volume / by_proxy / by_severity)
compute_auc.py — now exports the shared auc_from; point-estimate logic unchanged
tests/test_ci_auc.py — unit tests on in-memory data

Config-driven runner (ci_auc.py) for the conference-study pairwise-accuracy (AUC) tables with 95% bootstrap CIs over papers. One function per paper format: comment_volume (Table 1), by_proxy (Table 2), by_severity (Table 9). The shared AUC primitive is consolidated into compute_auc.py. The config and the result JSONs/manifests are kept locally (gitignored, large). Correctness is covered by tests/test_ci_auc.py, which exercises auc_from, cell_summary, and the bootstrap on tiny in-memory count tables. Recall CIs moved to a separate PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

joe32140

minor issue.

joe32140 · 2026-06-25T02:27:08Z

+            per_lists[proxy_id].append(a)
+            H += hits; T += tot
+        overalls.append(H / T if T else np.nan)
+    ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5]))


ci_auc.py:81 (bootstrap) — per-proxy CI crashes when a proxy is single-sided.

The draw loop guards if high.size == 0 or low.size == 0: continue, so for a proxy that has papers on only one side, per_lists[proxy_id] is never appended and stays []. The return then computes ci(per_lists[proxy_id]) → np.percentile([], [2.5, 97.5]), which raises IndexError: index -1 is out of bounds for axis 0 with size 0 (numpy 2.x), aborting the whole by_proxy table.

This is the same partial-data case the point-estimate paths already defend against — cell_summary skips with if not high_total or not low_total: continue, and auc_from returns nan for an empty side — so a (method, model) cell where one side of a proxy is empty (e.g. missing/failed result files for all high-quality papers of that proxy) is anticipated elsewhere but will crash here.

Suggested fix — mirror the point-estimate guard:

ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5])) if xs else (float("nan"), float("nan"))

Good catch, fixed in af3ffad. Guarded the empty case so an empty per_lists[proxy_id] returns (nan, nan) instead of crashing np.percentile, matching the nan the point-estimate paths already return for a single-sided proxy. Added a regression test (test_bootstrap_single_sided_proxy_no_crash).

@joe32140

A proxy with papers on only one side never gets a draw appended, so ci(per_lists[proxy_id]) hit np.percentile([], ...) and crashed the whole by_proxy table. Return (nan, nan) instead, matching the nan the point-estimate paths already produce there. Adds a regression test. Reported by @joe32140 in review of #98. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dangng2004 marked this pull request as draft June 5, 2026 22:27

dangng2004 marked this pull request as ready for review June 6, 2026 02:31

dangng2004 marked this pull request as draft June 24, 2026 02:33

dangng2004 force-pushed the feat/bootstrap-cis branch from 434f99f to b41f749 Compare June 24, 2026 22:40

dangng2004 mentioned this pull request Jun 24, 2026

Add cluster-bootstrap 95% CIs for the perturbation recall tables #103

Open

dangng2004 changed the title ~~Add cluster-bootstrap 95% CIs for the AUC and recall tables~~ Add cluster-bootstrap 95% CIs for the conference AUC tables Jun 24, 2026

dangng2004 force-pushed the feat/bootstrap-cis branch 3 times, most recently from d530555 to 6fd980d Compare June 24, 2026 23:09

dangng2004 force-pushed the feat/bootstrap-cis branch from 6fd980d to a8e3fc7 Compare June 24, 2026 23:13

dangng2004 marked this pull request as ready for review June 24, 2026 23:30

dangng2004 requested a review from joe32140 June 24, 2026 23:30

dangng2004 changed the title ~~Add cluster-bootstrap 95% CIs for the conference AUC tables~~ Add cluster-bootstrap 95% CIs for the paper's AUC tables Jun 24, 2026

joe32140 approved these changes Jun 25, 2026

View reviewed changes

dangng2004 merged commit 6dcdef1 into main Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cluster-bootstrap 95% CIs for the paper's AUC tables#98

Add cluster-bootstrap 95% CIs for the paper's AUC tables#98
dangng2004 merged 2 commits into
mainfrom
feat/bootstrap-cis

dangng2004 commented Jun 5, 2026 •

edited

Loading

Uh oh!

joe32140 left a comment

Uh oh!

joe32140 Jun 25, 2026

Uh oh!

dangng2004 Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dangng2004 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Method

Functions → paper tables

Representative table outputs

Tests

Files

Uh oh!

joe32140 left a comment

Choose a reason for hiding this comment

Uh oh!

joe32140 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dangng2004 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dangng2004 commented Jun 5, 2026 •

edited

Loading