Skip to content

Overlap (venn) analysis of comments from different models/systems used in paper#91

Draft
dangng2004 wants to merge 2 commits into
mainfrom
feat/venn-analyses
Draft

Overlap (venn) analysis of comments from different models/systems used in paper#91
dangng2004 wants to merge 2 commits into
mainfrom
feat/venn-analyses

Conversation

@dangng2004

@dangng2004 dangng2004 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Refactors the review_analysis venn/cluster plumbing into a shared helper module, then adds two new comparison axes on top of it, and regenerates all paper Venn figures in a consistent main-paper style.

Refactor

  • utils.py — shared load / para_set / regions_{2,3} / draw_venn{2,3} / save_fig; plots now written to plots/ in both PNG and PDF
  • analysis.py, analysis_gpt_claude.py — refactored to use utils; old top-level venn PNGs deleted (regenerated under plots/)

New comparisons

  • analysis_three_systems.py — 3-way paragraph-index overlap of coarse / OpenAIReview / Reviewer 3, computed on the perturbation benchmark (best model per system)
  • analysis_with_humans.py — overlap between human OpenReview reviewers and the AI-system union; two-pass LLM concern-extraction + paragraph-mapping with on-disk .cache/
  • cluster_new.py — KMeans clustering for the two new comparisons

Figure regeneration (second commit)

  • All Venns restyled to the single-panel main-paper look: model-name titles, no in-figure Jaccard (reported in table columns instead), larger fonts, consistent palette
  • analysis_gpt_claude.py retargeted to the perturbation results tree (one cell per domain x paper x error_type)
  • analysis_claude_gpt_efficient.py / analysis_claude_gpt_efficient_outcomes.py — 3-way Claude vs GPT vs efficient-model union, on the perturbation and quality-proxy papers respectively
  • analysis_union_models.py — three-system overlap where each system unions over all its backbone models
  • regen_appendix_venns.py — regenerates the appendix per-model Venn grids from the region averages reported in the appendix tables

Other

  • .gitignore — cover plots/, .cache/, generated cluster_*.json / per_paper_*.json, and the local frontier_subset_progressive symlink

Note: _combine_gpt_claude.py (combined GPT-or-Claude recall) was pulled out to #97; it is recall tooling, not overlap analysis.

Test plan

  • python analysis.py produces plots/venn_cp.{png,pdf} and plots/venn_all.{png,pdf} without errors
  • python analysis_three_systems.py produces the 3-way overlap plot
  • python analysis_with_humans.py produces the human-vs-AI venn (uses cached LLM outputs on rerun)
  • python analysis_gpt_claude.py produces the Claude-vs-GPT venn from the perturbation tree
  • python analysis_union_models.py and the two analysis_claude_gpt_efficient* scripts produce their venns
  • python regen_appendix_venns.py regenerates venn_cp / venn_all grids

🤖 Generated with Claude Code

@dangng2004 dangng2004 marked this pull request as draft May 21, 2026 20:57
dangng2004 and others added 2 commits June 5, 2026 16:26
* utils.py — extract shared load/para_set/regions/venn helpers; move plots
  under plots/ in both PNG and PDF
* analysis.py, analysis_gpt_claude.py — refactor to use utils; add docstrings;
  drop the old top-level venn PNGs (regenerated to plots/)
* analysis_three_systems.py — 3-way paragraph-index overlap of coarse /
  OpenAIReview / Reviewer 3 on their common 70-paper cohort
* analysis_with_humans.py — overlap between human OpenReview reviewers and
  the AI-system union; two-pass LLM concern-extraction + paragraph mapping
  with on-disk .cache/
* cluster_new.py — KMeans clustering for the two new comparisons
* .gitignore — cover plots/, .cache/, generated cluster/per-paper JSONs,
  and the local frontier_subset_progressive symlink
* _combine_gpt_claude.py — compute combined (GPT-5.5 OR Claude-Opus-4.7)
  recall on the 24-paper frontier subset for tab:recall-overall
…odel overlap analyses

Restyle all Venn figures to the single-panel main-paper look (model-name
titles, no in-figure Jaccard, larger fonts, consistent palette). Retarget
the Claude-vs-GPT and three-system overlaps to the perturbation benchmark
results. New scripts: efficient-model union overlaps on both benchmarks,
per-system union-over-models Venn, and a driver that regenerates the
appendix per-model Venn grids from the reported region averages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dangng2004 dangng2004 force-pushed the feat/venn-analyses branch from c0e3639 to 2864971 Compare June 5, 2026 21:38
@dangng2004 dangng2004 changed the title Refactor review_analysis + add 3-system and human-vs-AI overlap analyses Overlap (venn) analysis of comments from different models/systems (used in paper) Jun 5, 2026
@dangng2004 dangng2004 changed the title Overlap (venn) analysis of comments from different models/systems (used in paper) Overlap (venn) analysis of comments from different models/systems used in paper Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant