Overlap (venn) analysis of comments from different models/systems used in paper by dangng2004 · Pull Request #91 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-05-21T20:54:22Z

Summary

Refactors the review_analysis venn/cluster plumbing into a shared helper module, then adds two new comparison axes on top of it, and regenerates all paper Venn figures in a consistent main-paper style.

Refactor

utils.py — shared load / para_set / regions_{2,3} / draw_venn{2,3} / save_fig; plots now written to plots/ in both PNG and PDF
analysis.py, analysis_gpt_claude.py — refactored to use utils; old top-level venn PNGs deleted (regenerated under plots/)

New comparisons

analysis_three_systems.py — 3-way paragraph-index overlap of coarse / OpenAIReview / Reviewer 3, computed on the perturbation benchmark (best model per system)
analysis_with_humans.py — overlap between human OpenReview reviewers and the AI-system union; two-pass LLM concern-extraction + paragraph-mapping with on-disk .cache/
cluster_new.py — KMeans clustering for the two new comparisons

Figure regeneration (second commit)

All Venns restyled to the single-panel main-paper look: model-name titles, no in-figure Jaccard (reported in table columns instead), larger fonts, consistent palette
analysis_gpt_claude.py retargeted to the perturbation results tree (one cell per domain x paper x error_type)
analysis_claude_gpt_efficient.py / analysis_claude_gpt_efficient_outcomes.py — 3-way Claude vs GPT vs efficient-model union, on the perturbation and quality-proxy papers respectively
analysis_union_models.py — three-system overlap where each system unions over all its backbone models
regen_appendix_venns.py — regenerates the appendix per-model Venn grids from the region averages reported in the appendix tables

Other

.gitignore — cover plots/, .cache/, generated cluster_*.json / per_paper_*.json, and the local frontier_subset_progressive symlink

Note: _combine_gpt_claude.py (combined GPT-or-Claude recall) was pulled out to #97; it is recall tooling, not overlap analysis.

Test plan

python analysis.py produces plots/venn_cp.{png,pdf} and plots/venn_all.{png,pdf} without errors
python analysis_three_systems.py produces the 3-way overlap plot
python analysis_with_humans.py produces the human-vs-AI venn (uses cached LLM outputs on rerun)
python analysis_gpt_claude.py produces the Claude-vs-GPT venn from the perturbation tree
python analysis_union_models.py and the two analysis_claude_gpt_efficient* scripts produce their venns
python regen_appendix_venns.py regenerates venn_cp / venn_all grids

🤖 Generated with Claude Code

* utils.py — extract shared load/para_set/regions/venn helpers; move plots under plots/ in both PNG and PDF * analysis.py, analysis_gpt_claude.py — refactor to use utils; add docstrings; drop the old top-level venn PNGs (regenerated to plots/) * analysis_three_systems.py — 3-way paragraph-index overlap of coarse / OpenAIReview / Reviewer 3 on their common 70-paper cohort * analysis_with_humans.py — overlap between human OpenReview reviewers and the AI-system union; two-pass LLM concern-extraction + paragraph mapping with on-disk .cache/ * cluster_new.py — KMeans clustering for the two new comparisons * .gitignore — cover plots/, .cache/, generated cluster/per-paper JSONs, and the local frontier_subset_progressive symlink * _combine_gpt_claude.py — compute combined (GPT-5.5 OR Claude-Opus-4.7) recall on the 24-paper frontier subset for tab:recall-overall

…odel overlap analyses Restyle all Venn figures to the single-panel main-paper look (model-name titles, no in-figure Jaccard, larger fonts, consistent palette). Retarget the Claude-vs-GPT and three-system overlaps to the perturbation benchmark results. New scripts: efficient-model union overlaps on both benchmarks, per-system union-over-models Venn, and a driver that regenerates the appendix per-model Venn grids from the reported region averages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dangng2004 marked this pull request as draft May 21, 2026 20:57

dangng2004 and others added 2 commits June 5, 2026 16:26

dangng2004 mentioned this pull request Jun 5, 2026

Add combined GPT-or-Claude recall of injected errors #97

Open

dangng2004 force-pushed the feat/venn-analyses branch from c0e3639 to 2864971 Compare June 5, 2026 21:38

dangng2004 changed the title ~~Refactor review_analysis + add 3-system and human-vs-AI overlap analyses~~ Overlap (venn) analysis of comments from different models/systems (used in paper) Jun 5, 2026

dangng2004 changed the title ~~Overlap (venn) analysis of comments from different models/systems (used in paper)~~ Overlap (venn) analysis of comments from different models/systems used in paper Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overlap (venn) analysis of comments from different models/systems used in paper#91

Overlap (venn) analysis of comments from different models/systems used in paper#91
dangng2004 wants to merge 2 commits into
mainfrom
feat/venn-analyses

dangng2004 commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dangng2004 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dangng2004 commented May 21, 2026 •

edited

Loading