Overlap (venn) analysis of comments from different models/systems used in paper#91
Draft
dangng2004 wants to merge 2 commits into
Draft
Overlap (venn) analysis of comments from different models/systems used in paper#91dangng2004 wants to merge 2 commits into
dangng2004 wants to merge 2 commits into
Conversation
* utils.py — extract shared load/para_set/regions/venn helpers; move plots under plots/ in both PNG and PDF * analysis.py, analysis_gpt_claude.py — refactor to use utils; add docstrings; drop the old top-level venn PNGs (regenerated to plots/) * analysis_three_systems.py — 3-way paragraph-index overlap of coarse / OpenAIReview / Reviewer 3 on their common 70-paper cohort * analysis_with_humans.py — overlap between human OpenReview reviewers and the AI-system union; two-pass LLM concern-extraction + paragraph mapping with on-disk .cache/ * cluster_new.py — KMeans clustering for the two new comparisons * .gitignore — cover plots/, .cache/, generated cluster/per-paper JSONs, and the local frontier_subset_progressive symlink * _combine_gpt_claude.py — compute combined (GPT-5.5 OR Claude-Opus-4.7) recall on the 24-paper frontier subset for tab:recall-overall
…odel overlap analyses Restyle all Venn figures to the single-panel main-paper look (model-name titles, no in-figure Jaccard, larger fonts, consistent palette). Retarget the Claude-vs-GPT and three-system overlaps to the perturbation benchmark results. New scripts: efficient-model union overlaps on both benchmarks, per-system union-over-models Venn, and a driver that regenerates the appendix per-model Venn grids from the reported region averages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c0e3639 to
2864971
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the review_analysis venn/cluster plumbing into a shared helper module, then adds two new comparison axes on top of it, and regenerates all paper Venn figures in a consistent main-paper style.
Refactor
utils.py— sharedload/para_set/regions_{2,3}/draw_venn{2,3}/save_fig; plots now written toplots/in both PNG and PDFanalysis.py,analysis_gpt_claude.py— refactored to useutils; old top-level venn PNGs deleted (regenerated underplots/)New comparisons
analysis_three_systems.py— 3-way paragraph-index overlap of coarse / OpenAIReview / Reviewer 3, computed on the perturbation benchmark (best model per system)analysis_with_humans.py— overlap between human OpenReview reviewers and the AI-system union; two-pass LLM concern-extraction + paragraph-mapping with on-disk.cache/cluster_new.py— KMeans clustering for the two new comparisonsFigure regeneration (second commit)
analysis_gpt_claude.pyretargeted to the perturbation results tree (one cell per domain x paper x error_type)analysis_claude_gpt_efficient.py/analysis_claude_gpt_efficient_outcomes.py— 3-way Claude vs GPT vs efficient-model union, on the perturbation and quality-proxy papers respectivelyanalysis_union_models.py— three-system overlap where each system unions over all its backbone modelsregen_appendix_venns.py— regenerates the appendix per-model Venn grids from the region averages reported in the appendix tablesOther
.gitignore— coverplots/,.cache/, generatedcluster_*.json/per_paper_*.json, and the localfrontier_subset_progressivesymlinkNote:
_combine_gpt_claude.py(combined GPT-or-Claude recall) was pulled out to #97; it is recall tooling, not overlap analysis.Test plan
python analysis.pyproducesplots/venn_cp.{png,pdf}andplots/venn_all.{png,pdf}without errorspython analysis_three_systems.pyproduces the 3-way overlap plotpython analysis_with_humans.pyproduces the human-vs-AI venn (uses cached LLM outputs on rerun)python analysis_gpt_claude.pyproduces the Claude-vs-GPT venn from the perturbation treepython analysis_union_models.pyand the twoanalysis_claude_gpt_efficient*scripts produce their vennspython regen_appendix_venns.pyregeneratesvenn_cp/venn_allgrids🤖 Generated with Claude Code