Add cluster-bootstrap 95% CIs for the paper's AUC tables#98
Conversation
434f99f to
b41f749
Compare
d530555 to
6fd980d
Compare
Config-driven runner (ci_auc.py) for the conference-study pairwise-accuracy (AUC) tables with 95% bootstrap CIs over papers. One function per paper format: comment_volume (Table 1), by_proxy (Table 2), by_severity (Table 9). The shared AUC primitive is consolidated into compute_auc.py. The config and the result JSONs/manifests are kept locally (gitignored, large). Correctness is covered by tests/test_ci_auc.py, which exercises auc_from, cell_summary, and the bootstrap on tiny in-memory count tables. Recall CIs moved to a separate PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6fd980d to
a8e3fc7
Compare
| per_lists[proxy_id].append(a) | ||
| H += hits; T += tot | ||
| overalls.append(H / T if T else np.nan) | ||
| ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5])) |
There was a problem hiding this comment.
ci_auc.py:81 (bootstrap) — per-proxy CI crashes when a proxy is single-sided.
The draw loop guards if high.size == 0 or low.size == 0: continue, so for a proxy that has papers on only one side, per_lists[proxy_id] is never appended and stays []. The return then computes ci(per_lists[proxy_id]) → np.percentile([], [2.5, 97.5]), which raises IndexError: index -1 is out of bounds for axis 0 with size 0 (numpy 2.x), aborting the whole by_proxy table.
This is the same partial-data case the point-estimate paths already defend against — cell_summary skips with if not high_total or not low_total: continue, and auc_from returns nan for an empty side — so a (method, model) cell where one side of a proxy is empty (e.g. missing/failed result files for all high-quality papers of that proxy) is anticipated elsewhere but will crash here.
Suggested fix — mirror the point-estimate guard:
ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5])) if xs else (float("nan"), float("nan"))
There was a problem hiding this comment.
Good catch, fixed in af3ffad. Guarded the empty case so an empty per_lists[proxy_id] returns (nan, nan) instead of crashing np.percentile, matching the nan the point-estimate paths already return for a single-sided proxy. Added a regression test (test_bootstrap_single_sided_proxy_no_crash).
A proxy with papers on only one side never gets a draw appended, so ci(per_lists[proxy_id]) hit np.percentile([], ...) and crashed the whole by_proxy table. Return (nan, nan) instead, matching the nan the point-estimate paths already produce there. Adds a regression test. Reported by @joe32140 in review of #98. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the conference study's pairwise-accuracy (AUC) tables, via a cluster bootstrap over papers.
Method
The resampling unit is the paper, not the comment, because comments within a paper are correlated. Each of 5000 draws resamples papers with replacement, stratified by (proxy, side), recomputes the pooled statistic, and the CI is the 2.5/97.5 percentiles. Point estimates match the committed
compute_auc.py, so the CIs sit around numbers that are already trusted.Functions → paper tables
Each table
kindproduces exactly one paper format:kindcomment_volumerun_comment_volumeby_proxyrun_by_proxyby_severityrun_by_severityThe full-cohort appendix tables are the same three kinds on the full 240-paper set.
Representative table outputs
These tables are examples of the formats supported by the script (numbers reproduce the paper):
Comment volume and overall accuracy, per model (Table 1)
Accuracy by quality proxy, best model per system (Table 2)
Accuracy by severity tier, per model (Table 9)
Tests
The config, result JSONs, and manifests are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by
tests/test_ci_auc.py, which exercisesauc_from,cell_summary, and the bootstrap on tiny in-memory count tables (e.g. two low-quality papers with more comments than two high-quality ones gives AUC 1.0).Files
ci_auc.py— new, config-driven AUC + CI runner (three kinds: comment_volume / by_proxy / by_severity)compute_auc.py— now exports the sharedauc_from; point-estimate logic unchangedtests/test_ci_auc.py— unit tests on in-memory data