evo2 SAE eval (1/2): label producers — imported by #1636#1630
evo2 SAE eval (1/2): label producers — imported by #1636#1630polinabinder1 wants to merge 2 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces foundational infrastructure for sparse autoencoder (SAE) interpretability on genomic sequences. It adds dependency support, implements an extensible per-token labeling system with a registration-based architecture, provides generic utilities to load annotation tracks and window sequences, and includes a specialized builder for eukaryotic gene-structure-aware windows. The suite enables researchers to assign biological feature labels to individual DNA sequence tokens for downstream SAE feature probing. ChangesSAE Interpretability Infrastructure
Sequence Diagram(s)sequenceDiagram
participant User
participant main as main CLI
participant ParseGFF as parse_gff
participant SelectTx as representative_tx
participant BuildWin as build_windows
participant LabelWin as _label_window
participant Output as Window + Stats
User->>main: fasta.fa, genes.gff3<br/>--seq_len 1024, --max_tokens 300k
main->>ParseGFF: Parse GFF3
ParseGFF->>SelectTx: gene→transcript structures
SelectTx->>BuildWin: Select longest tx per gene
BuildWin->>BuildWin: Assign instance IDs<br/>(exons, introns, genes)
BuildWin->>BuildWin: Sample exon-centered windows<br/>(filter N%, enforce length)
BuildWin->>BuildWin: Fill remainder with<br/>random intergenic windows
BuildWin->>LabelWin: For each window [w0,w1)
LabelWin->>LabelWin: Vectorized overlap check<br/>vs. gene model intervals
LabelWin->>Output: Per-position labels<br/>(exon/intron/CDS/UTR)<br/>+ instance IDs
Output->>User: Windowed labels + coverage stats
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…rovenance --max-sequences caps the corpus (default 4000): the 7B pass exists only to give the example cards sequence-aligned activations, not to re-derive the full atlas. Labels are joined from --feature-annotations (the label-producer pipeline, #1630), not computed here. Stats stay local to the SAE codes (compute_feature_stats wants raw activations that won't fit for long DNA); UMAP still reuses compute_feature_umap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…small pass) atlas: stats from a RANDOM SAMPLE of an extract.py activation store run through the SAE (no 7B/megatron) + UMAP from the decoder -> features_atlas + feature_metadata. This is the same store the SAE trained on; if absent, errors pointing at extract.py. examples: small --examples-fasta through the full engine -> feature_examples. Fixes the pass-2 OOM (extract per-feature columns + window to --max-example-bp instead of holding [S, n_features] per sequence) and skips dead features. Resolves the earlier design: the atlas no longer re-runs the expensive 7B over the whole corpus. Labels still join from --feature-annotations (#1630). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py (1)
120-122: ⚡ Quick winRemove the
D103suppressions on the public entry points.
build_windowsandmainexplicitly skip public-function docstrings via# noqa: D103, which conflicts with the repository rule requiring Google-style docstrings in Python code. As per coding guidelines,**/*.py: Use Google-style docstrings (pydocstyle convention) in Python code.Also applies to: 211-211
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py` around lines 120 - 122, Remove the "# noqa: D103" suppressions on the public entry points and add proper Google-style docstrings for the functions build_windows and main: include a one-line summary, Args with parameter names and types/defaults, Returns (and type) or None, and Raises if applicable; ensure the docstrings are triple-quoted immediately under each function definition and follow pydocstyle/Google conventions so the D103 warnings are resolved.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/annot_tracks.py`:
- Around line 128-145: The sliding-window loop currently uses range(0, max(1, N
- seq_len + 1), stride) which skips a valid trailing window; change the
iteration to explicitly include the final window start at last_start = max(0, N
- seq_len). For example, build starts = list(range(0, max(1, N - seq_len + 1),
stride)) and if starts is empty or starts[-1] != last_start append last_start,
then iterate for w0 in starts (keeping the existing w1 = min(N, w0 + seq_len)
logic and subsequent label/instance handling), so short final windows ≥60 bp are
considered and duplicates are avoided; preserve tot/max_tokens behavior as-is.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`:
- Around line 46-83: parse_gff currently drops the GFF seqid and flattens
coordinates; modify parse_gff to capture the seqid (f[0]) for each feature and
store genes/tx keyed by chromosome (e.g., include "seqid" or nest genes under a
chrom key) instead of a single global coordinate space; update any data
structures referencing gene_strand/gene_biotype/tx_gene/tx_exon/tx_cds to
include the seqid when assigning gid/tid so genes remain tied to their contig.
Then update build_windows to stop concatenating FASTA records into one "chrom"
string and instead iterate over each FASTA record separately, creating windows
and emitting labels using the per-record coordinates and the seqid-aware gene/tx
structures (use the function names parse_gff and build_windows to locate edits).
Ensure lookups for exons/cds use the chrom-aware keys so labels map to the
correct contig.
- Around line 194-207: The loop that samples intergenic windows can call
rng.randint(0, N - seq_len) which raises ValueError when N < seq_len; add a
guard using the existing variables (N and seq_len) so the intergenic sampling is
skipped when the contig is shorter than the requested window: e.g., before
entering the while tot < max_tokens and tries < 20000 loop check if N < seq_len
and if so skip/break out (do not call rng.randint), thereby leaving windows (and
tot) unchanged and avoiding the exception from rng.randint; update any related
control flow around the while block that builds windows to reflect this
early-skip behavior.
- Around line 69-74: The exon/CDS handling uses a.get("Parent",
"").replace("transcript:", "") as a single key but GFF3 can list multiple
comma-separated Parent IDs; update the exon and CDS branches so you split the
Parent string on commas, strip whitespace and any "transcript:" prefix for each
id, then iterate and append (s, e) to tx_exon[tid] and tx_cds[tid] respectively
(same approach used for tx_gene elsewhere); ensure you handle empty Parent
gracefully.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.py`:
- Around line 225-228: The splice donor labeler `_sd` uses `_starts(ctx.dna,
r"GT[AG]AG")` which misses the terminal T in the documented consensus; update
the pattern in the `_sd` labeler (the function decorated with
`@labeler`("splice_donor")) to require the final 'T' (e.g., change the regex to
include the trailing T) so `_dna_mask(ctx, _starts(ctx.dna, r"GT[AG]AGT"))` is
used and prevents false-positive matches on 5-mers like "GTAAG" or "GTGAG".
---
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`:
- Around line 120-122: Remove the "# noqa: D103" suppressions on the public
entry points and add proper Google-style docstrings for the functions
build_windows and main: include a one-line summary, Args with parameter names
and types/defaults, Returns (and type) or None, and Raises if applicable; ensure
the docstrings are triple-quoted immediately under each function definition and
follow pydocstyle/Google conventions so the D103 warnings are resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e14af7e7-bc7b-449d-afcc-f9f185b31a81
📒 Files selected for processing (6)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.tomlbionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/annot_tracks.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_annot_tracks.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_labelers.py
57837ec to
13a0690
Compare
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
f794c31 to
3557958
Compare
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
… correctness, tests - declare pyrodigal (predict_cds/predict_codons import it lazily; only biopython was declared, so the gene-calling path would ImportError in a clean env) - euk_windows: reject a multi-record FASTA (parse_gff drops seqid + we concatenate, so a multi-chrom GFF would silently mislabel against a blob) and document the single-chromosome assumption - skip windows overlapping a *second* gene's span (central-gene approx mislabels a neighbor's exons as intergenic) and de-dup near-duplicate adjacent-exon windows so they don't eat the token budget - extract the CDS reverse-strand frame math into a unit-tested helper (_frame_and_start) - tests: euk_windows had none — add parse_gff/1-based->0-based labeling/single-chrom guard; cover the previously-untested labelers (gc/homopolymer/dinuc/kozak/splice) + the +/- strand frame anchor #1630 CPU tests: 16 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Summary
The label-producer half of the eval — per-token biological concepts ("what to measure"), mostly CPU-testable. Stacked on #1629.
Contents
labelers.py— motifs (ATG/stop/TATA/RBS/splice/Kozak), single-basebase_A/C/G/T, codon-frame, GC/homopolymer/dinuc, prok/sink, CDS (pyrodigal)euk_windows.py— GFF3 → instance-labeled exon/intron/CDS windowsannot_tracks.py— generic BED/GFF interval tracks → per-token mask + instance ids (RefSeq/Rfam/JASPAR/ENCODE)test_annot_tracks.py(CPU)How to run
These concepts are consumed by the harness (#1636) —
probe.py auroc / annotate / domain-eval.Summary by CodeRabbit
Release Notes
New Features
Dependencies
Tests
Expected output
euk_windows.py --dry-run→ coverage stats: chromosome length, protein-coding genes used, exon/intron counts, per-position coverage %.pytest tests/test_annot_tracks.py tests/test_labelers.py→ 5 + 3 passed (BED/GFF parse + mask/instance; motif/base positions + tag-offset).