Skip to content

Add cluster-bootstrap 95% CIs for the paper's AUC tables#98

Merged
dangng2004 merged 2 commits into
mainfrom
feat/bootstrap-cis
Jun 26, 2026
Merged

Add cluster-bootstrap 95% CIs for the paper's AUC tables#98
dangng2004 merged 2 commits into
mainfrom
feat/bootstrap-cis

Conversation

@dangng2004

@dangng2004 dangng2004 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the conference study's pairwise-accuracy (AUC) tables, via a cluster bootstrap over papers.

Method

The resampling unit is the paper, not the comment, because comments within a paper are correlated. Each of 5000 draws resamples papers with replacement, stratified by (proxy, side), recomputes the pooled statistic, and the CI is the 2.5/97.5 percentiles. Point estimates match the committed compute_auc.py, so the CIs sit around numbers that are already trusted.

Functions → paper tables

Each table kind produces exactly one paper format:

kind function paper table what it shows
comment_volume run_comment_volume Table 1 per-model mean comments (high/low), delta, and overall accuracy
by_proxy run_by_proxy Table 2 accuracy broken down by quality proxy, best model per system
by_severity run_by_severity Table 9 accuracy broken down by severity tier, per model

The full-cohort appendix tables are the same three kinds on the full 240-paper set.

Representative table outputs

These tables are examples of the formats supported by the script (numbers reproduce the paper):

Comment volume and overall accuracy, per model (Table 1)

image

Accuracy by quality proxy, best model per system (Table 2)

image

Accuracy by severity tier, per model (Table 9)

image

Tests

The config, result JSONs, and manifests are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by tests/test_ci_auc.py, which exercises auc_from, cell_summary, and the bootstrap on tiny in-memory count tables (e.g. two low-quality papers with more comments than two high-quality ones gives AUC 1.0).

$ pytest tests/test_ci_auc.py -v

tests/test_ci_auc.py::test_auc_from_perfect_separation PASSED
tests/test_ci_auc.py::test_auc_from_reversed PASSED
tests/test_ci_auc.py::test_auc_from_ties_get_half_credit PASSED
tests/test_ci_auc.py::test_auc_from_empty_side_is_nan PASSED
tests/test_ci_auc.py::test_cell_summary_point_estimates PASSED
tests/test_ci_auc.py::test_bootstrap_tiers_ci_shape PASSED
tests/test_ci_auc.py::test_bootstrap_overall_and_per_proxy PASSED
tests/test_ci_auc.py::test_dispatch_covers_the_three_kinds PASSED

8 passed

Files

  • ci_auc.py — new, config-driven AUC + CI runner (three kinds: comment_volume / by_proxy / by_severity)
  • compute_auc.py — now exports the shared auc_from; point-estimate logic unchanged
  • tests/test_ci_auc.py — unit tests on in-memory data

@dangng2004 dangng2004 marked this pull request as draft June 5, 2026 22:27
@dangng2004 dangng2004 marked this pull request as ready for review June 6, 2026 02:31
@dangng2004 dangng2004 marked this pull request as draft June 24, 2026 02:33
@dangng2004 dangng2004 force-pushed the feat/bootstrap-cis branch from 434f99f to b41f749 Compare June 24, 2026 22:40
@dangng2004 dangng2004 changed the title Add cluster-bootstrap 95% CIs for the AUC and recall tables Add cluster-bootstrap 95% CIs for the conference AUC tables Jun 24, 2026
@dangng2004 dangng2004 force-pushed the feat/bootstrap-cis branch 3 times, most recently from d530555 to 6fd980d Compare June 24, 2026 23:09
Config-driven runner (ci_auc.py) for the conference-study pairwise-accuracy
(AUC) tables with 95% bootstrap CIs over papers. One function per paper format:
comment_volume (Table 1), by_proxy (Table 2), by_severity (Table 9). The shared
AUC primitive is consolidated into compute_auc.py.

The config and the result JSONs/manifests are kept locally (gitignored, large).
Correctness is covered by tests/test_ci_auc.py, which exercises auc_from,
cell_summary, and the bootstrap on tiny in-memory count tables.

Recall CIs moved to a separate PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dangng2004 dangng2004 force-pushed the feat/bootstrap-cis branch from 6fd980d to a8e3fc7 Compare June 24, 2026 23:13
@dangng2004 dangng2004 marked this pull request as ready for review June 24, 2026 23:30
@dangng2004 dangng2004 requested a review from joe32140 June 24, 2026 23:30
@dangng2004 dangng2004 changed the title Add cluster-bootstrap 95% CIs for the conference AUC tables Add cluster-bootstrap 95% CIs for the paper's AUC tables Jun 24, 2026

@joe32140 joe32140 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor issue.

per_lists[proxy_id].append(a)
H += hits; T += tot
overalls.append(H / T if T else np.nan)
ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci_auc.py:81 (bootstrap) — per-proxy CI crashes when a proxy is single-sided.

The draw loop guards if high.size == 0 or low.size == 0: continue, so for a proxy that has papers on only one side, per_lists[proxy_id] is never appended and stays []. The return then computes ci(per_lists[proxy_id]) → np.percentile([], [2.5, 97.5]), which raises IndexError: index -1 is out of bounds for axis 0 with size 0 (numpy 2.x), aborting the whole by_proxy table.

This is the same partial-data case the point-estimate paths already defend against — cell_summary skips with if not high_total or not low_total: continue, and auc_from returns nan for an empty side — so a (method, model) cell where one side of a proxy is empty (e.g. missing/failed result files for all high-quality papers of that proxy) is anticipated elsewhere but will crash here.

Suggested fix — mirror the point-estimate guard:

ci = lambda xs: tuple(np.percentile(xs, [2.5, 97.5])) if xs else (float("nan"), float("nan"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in af3ffad. Guarded the empty case so an empty per_lists[proxy_id] returns (nan, nan) instead of crashing np.percentile, matching the nan the point-estimate paths already return for a single-sided proxy. Added a regression test (test_bootstrap_single_sided_proxy_no_crash).

A proxy with papers on only one side never gets a draw appended, so
ci(per_lists[proxy_id]) hit np.percentile([], ...) and crashed the whole
by_proxy table. Return (nan, nan) instead, matching the nan the point-estimate
paths already produce there. Adds a regression test.

Reported by @joe32140 in review of #98.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dangng2004 dangng2004 merged commit 6dcdef1 into main Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants