evo2 SAE eval (1/2): label producers — imported by #1636 by polinabinder1 · Pull Request #1630 · NVIDIA-BioNeMo/bionemo-recipes

polinabinder1 · 2026-06-11T19:46:53Z

Part 1 of 2 of the eval — not standalone. These are the label-definition modules
(concepts/masks), with no output of their own; they're imported by the probing
harness in #1636, which produces the actual artifacts (buffers, feature_metadata.parquet).
Split from #1636 only to keep each PR reviewable — review and merge the two together.

Summary

The label-producer half of the eval — per-token biological concepts ("what to measure"), mostly CPU-testable. Stacked on #1629.

labelers.py — motifs (ATG/stop/TATA/RBS/splice/Kozak), single-base base_A/C/G/T, codon-frame, GC/homopolymer/dinuc, prok/sink, CDS (pyrodigal)
euk_windows.py — GFF3 → instance-labeled exon/intron/CDS windows
annot_tracks.py — generic BED/GFF interval tracks → per-token mask + instance ids (RefSeq/Rfam/JASPAR/ENCODE)
tests: test_annot_tracks.py (CPU)

How to run

pytest recipes/evo2/tests/test_annot_tracks.py                              # CPU
python scripts/euk_windows.py --fasta chr21.fa --gff chr21.gff3 --dry-run   # gene-structure coverage stats

These concepts are consumed by the harness (#1636) — probe.py auroc / annotate / domain-eval.

Summary by CodeRabbit

Release Notes

New Features
- Added support for loading and processing genomic annotation tracks (BED/GFF/GTF formats)
- Added CLI tool for building eukaryotic genomic windows with gene structure labels (exons, introns, CDS, UTR)
- Added extensible biological feature labelers for sparse autoencoder interpretability analysis
Dependencies
- Added biopython>=1.80
Tests
- Added comprehensive test coverage for annotation track processing and biological labelers

Expected output

euk_windows.py --dry-run → coverage stats: chromosome length, protein-coding genes used, exon/intron counts, per-position coverage %.
pytest tests/test_annot_tracks.py tests/test_labelers.py → 5 + 3 passed (BED/GFF parse + mask/instance; motif/base positions + tag-offset).

copy-pr-bot · 2026-06-11T19:50:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-11T19:50:51Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bd86428f-c35f-4fe4-8781-589f966aa883

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces foundational infrastructure for sparse autoencoder (SAE) interpretability on genomic sequences. It adds dependency support, implements an extensible per-token labeling system with a registration-based architecture, provides generic utilities to load annotation tracks and window sequences, and includes a specialized builder for eukaryotic gene-structure-aware windows. The suite enables researchers to assign biological feature labels to individual DNA sequence tokens for downstream SAE feature probing.

Changes

SAE Interpretability Infrastructure

Layer / File(s)	Summary
Build dependency setup `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.toml`	Project dependencies updated to include `biopython>=1.80` for genetic code and translation utilities referenced by labelers.
Labeling infrastructure and SeqContext `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.py`	Introduces the `labeler` decorator, global registries (`LABELERS`, `COMPLEX_LABELERS`), and the `SeqContext` dataclass carrying tokenization metadata and optional gene-structure annotations through labeling pipelines.
Biological labelers and prokaryotic gene prediction `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.py`, `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_labelers.py`	Registers per-token labeler functions for positional, single-base, composition, motif, and gene-structure features; includes `predict_codons` and `predict_cds` helpers using pyrodigal for prokaryotic gene calling. Tests verify motif firing, nucleotide detection, and tag-prefix handling.
Generic annotation track loader and windowing `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/annot_tracks.py`, `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_annot_tracks.py`	Reads multi-record FASTA (gzip-transparent), parses interval tracks (BED 0-based half-open; GFF/GTF 1-based converted to 0-based with optional feature filtering), and tiles sequences into fixed-length windows with per-concept per-token masks and stable instance IDs. Tests validate coordinate conventions, instance-ID consistency, and mask alignment.
Eukaryotic genome window builder `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`	CLI tool that builds instance-aware exon/intron/CDS/UTR/intergenic windows from genome FASTA and protein-coding GFF3. Selects longest transcript per gene, constructs exon-centered windows (filtered for length and `N` content, budget-capped), and fills remainder with random intergenic windows. Reports chromosome coverage and per-label instance statistics.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant main as main CLI
  participant ParseGFF as parse_gff
  participant SelectTx as representative_tx
  participant BuildWin as build_windows
  participant LabelWin as _label_window
  participant Output as Window + Stats

  User->>main: fasta.fa, genes.gff3<br/>--seq_len 1024, --max_tokens 300k
  main->>ParseGFF: Parse GFF3
  ParseGFF->>SelectTx: gene→transcript structures
  SelectTx->>BuildWin: Select longest tx per gene
  BuildWin->>BuildWin: Assign instance IDs<br/>(exons, introns, genes)
  BuildWin->>BuildWin: Sample exon-centered windows<br/>(filter N%, enforce length)
  BuildWin->>BuildWin: Fill remainder with<br/>random intergenic windows
  BuildWin->>LabelWin: For each window [w0,w1)
  LabelWin->>LabelWin: Vectorized overlap check<br/>vs. gene model intervals
  LabelWin->>Output: Per-position labels<br/>(exon/intron/CDS/UTR)<br/>+ instance IDs
  Output->>User: Windowed labels + coverage stats

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA-BioNeMo/bionemo-framework#1621: Both PRs modify the evo2-sae recipe's pyproject.toml to update dependencies for the sparse autoencoder recipe.

Suggested labels

ciflow:all

Suggested reviewers

pstjohn
trvachov
jwilber
cspades

Poem

🐰 A rabbit hops through genes with care,
Labeling tokens everywhere—
Exons, introns, codons bright,
Windows windowed just right!
SAE features now can see,
What the genome's meant to be. 🧬

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description clearly outlines the contents, objectives, and how to run the changes, but does not follow the provided template structure with sections like 'Description', 'Type of changes', or 'Pre-submit Checklist'.	Reformat the PR description to match the repository template: add 'Description', 'Usage' with code snippet, 'Type of changes' (mark New feature), 'CI Pipeline Configuration' labels, and 'Pre-submit Checklist' confirmations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the PR's main contribution: adding label producer modules (labelers, euk_windows, annot_tracks) for evo2 SAE evaluation, with clear reference to the related harness PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pbinder/evo2-eval-harness

_{Comment @coderabbitai help to get the list of available commands.}

…rovenance --max-sequences caps the corpus (default 4000): the 7B pass exists only to give the example cards sequence-aligned activations, not to re-derive the full atlas. Labels are joined from --feature-annotations (the label-producer pipeline, #1630), not computed here. Stats stay local to the SAE codes (compute_feature_stats wants raw activations that won't fit for long DNA); UMAP still reuses compute_feature_umap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…small pass) atlas: stats from a RANDOM SAMPLE of an extract.py activation store run through the SAE (no 7B/megatron) + UMAP from the decoder -> features_atlas + feature_metadata. This is the same store the SAE trained on; if absent, errors pointing at extract.py. examples: small --examples-fasta through the full engine -> feature_examples. Fixes the pass-2 OOM (extract per-feature columns + window to --max-example-bp instead of holding [S, n_features] per sequence) and skips dead features. Resolves the earlier design: the atlas no longer re-runs the expensive 7B over the whole corpus. Labels still join from --feature-annotations (#1630). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

polinabinder1 · 2026-06-12T04:30:55Z

@coderabbitai review

coderabbitai · 2026-06-12T04:31:00Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

polinabinder1 · 2026-06-12T04:36:43Z

@coderabbitai review

coderabbitai · 2026-06-12T04:36:47Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py (1)
120-122: ⚡ Quick win

Remove the D103 suppressions on the public entry points.

build_windows and main explicitly skip public-function docstrings via # noqa: D103, which conflicts with the repository rule requiring Google-style docstrings in Python code. As per coding guidelines, **/*.py: Use Google-style docstrings (pydocstyle convention) in Python code.

Also applies to: 211-211
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`
around lines 120 - 122, Remove the "# noqa: D103" suppressions on the public
entry points and add proper Google-style docstrings for the functions
build_windows and main: include a one-line summary, Args with parameter names
and types/defaults, Returns (and type) or None, and Raises if applicable; ensure
the docstrings are triple-quoted immediately under each function definition and
follow pydocstyle/Google conventions so the D103 warnings are resolved.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/annot_tracks.py`:
- Around line 128-145: The sliding-window loop currently uses range(0, max(1, N
- seq_len + 1), stride) which skips a valid trailing window; change the
iteration to explicitly include the final window start at last_start = max(0, N
- seq_len). For example, build starts = list(range(0, max(1, N - seq_len + 1),
stride)) and if starts is empty or starts[-1] != last_start append last_start,
then iterate for w0 in starts (keeping the existing w1 = min(N, w0 + seq_len)
logic and subsequent label/instance handling), so short final windows ≥60 bp are
considered and duplicates are avoided; preserve tot/max_tokens behavior as-is.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`:
- Around line 46-83: parse_gff currently drops the GFF seqid and flattens
coordinates; modify parse_gff to capture the seqid (f[0]) for each feature and
store genes/tx keyed by chromosome (e.g., include "seqid" or nest genes under a
chrom key) instead of a single global coordinate space; update any data
structures referencing gene_strand/gene_biotype/tx_gene/tx_exon/tx_cds to
include the seqid when assigning gid/tid so genes remain tied to their contig.
Then update build_windows to stop concatenating FASTA records into one "chrom"
string and instead iterate over each FASTA record separately, creating windows
and emitting labels using the per-record coordinates and the seqid-aware gene/tx
structures (use the function names parse_gff and build_windows to locate edits).
Ensure lookups for exons/cds use the chrom-aware keys so labels map to the
correct contig.
- Around line 194-207: The loop that samples intergenic windows can call
rng.randint(0, N - seq_len) which raises ValueError when N < seq_len; add a
guard using the existing variables (N and seq_len) so the intergenic sampling is
skipped when the contig is shorter than the requested window: e.g., before
entering the while tot < max_tokens and tries < 20000 loop check if N < seq_len
and if so skip/break out (do not call rng.randint), thereby leaving windows (and
tot) unchanged and avoiding the exception from rng.randint; update any related
control flow around the while block that builds windows to reflect this
early-skip behavior.
- Around line 69-74: The exon/CDS handling uses a.get("Parent",
"").replace("transcript:", "") as a single key but GFF3 can list multiple
comma-separated Parent IDs; update the exon and CDS branches so you split the
Parent string on commas, strip whitespace and any "transcript:" prefix for each
id, then iterate and append (s, e) to tx_exon[tid] and tx_cds[tid] respectively
(same approach used for tx_gene elsewhere); ensure you handle empty Parent
gracefully.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.py`:
- Around line 225-228: The splice donor labeler `_sd` uses `_starts(ctx.dna,
r"GT[AG]AG")` which misses the terminal T in the documented consensus; update
the pattern in the `_sd` labeler (the function decorated with
`@labeler`("splice_donor")) to require the final 'T' (e.g., change the regex to
include the trailing T) so `_dna_mask(ctx, _starts(ctx.dna, r"GT[AG]AGT"))` is
used and prevents false-positive matches on 5-mers like "GTAAG" or "GTGAG".

---

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py`:
- Around line 120-122: Remove the "# noqa: D103" suppressions on the public
entry points and add proper Google-style docstrings for the functions
build_windows and main: include a one-line summary, Args with parameter names
and types/defaults, Returns (and type) or None, and Raises if applicable; ensure
the docstrings are triple-quoted immediately under each function definition and
follow pydocstyle/Google conventions so the D103 warnings are resolved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e14af7e7-bc7b-449d-afcc-f9f185b31a81

📥 Commits

Reviewing files that changed from the base of the PR and between 79df727 and 0c76d38.

📒 Files selected for processing (6)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.toml
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/annot_tracks.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/euk_windows.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/labelers.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_annot_tracks.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_labelers.py

Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

… correctness, tests - declare pyrodigal (predict_cds/predict_codons import it lazily; only biopython was declared, so the gene-calling path would ImportError in a clean env) - euk_windows: reject a multi-record FASTA (parse_gff drops seqid + we concatenate, so a multi-chrom GFF would silently mislabel against a blob) and document the single-chromosome assumption - skip windows overlapping a *second* gene's span (central-gene approx mislabels a neighbor's exons as intergenic) and de-dup near-duplicate adjacent-exon windows so they don't eat the token budget - extract the CDS reverse-strand frame math into a unit-tested helper (_frame_and_start) - tests: euk_windows had none — add parse_gff/1-based->0-based labeling/single-chrom guard; cover the previously-untested labelers (gc/homopolymer/dinuc/kozak/splice) + the +/- strand frame anchor #1630 CPU tests: 16 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

polinabinder1 mentioned this pull request Jun 11, 2026

sae eval: feature-probing metrics + Evo2 probe CLI #1624

Closed

polinabinder1 mentioned this pull request Jun 11, 2026

evo2 SAE recipe: feature-explorer dashboard (viz) #1623

Open

polinabinder1 mentioned this pull request Jun 12, 2026

evo2 SAE eval (2/2): probing harness + CLI (on #1630) #1636

Open

polinabinder1 changed the title ~~evo2 SAE eval: feature-probing harness + label producers~~ evo2 SAE eval: label producers (labelers + interval tracks) Jun 12, 2026

polinabinder1 mentioned this pull request Jun 12, 2026

sae: shared probing primitives (eval metrics + ActivationBuffer) #1629

Open

polinabinder1 changed the title ~~evo2 SAE eval: label producers (labelers + interval tracks)~~ evo2 SAE eval (1/2): label producers — imported by #1636 Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

polinabinder1 marked this pull request as ready for review June 12, 2026 05:32

polinabinder1 requested review from jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners June 12, 2026 05:32

polinabinder1 requested review from jstjohn and pstjohn and removed request for jstjohn, pstjohn and trvachov June 12, 2026 05:32

polinabinder1 force-pushed the pbinder/sae-interp-primitives branch from 57837ec to 13a0690 Compare June 23, 2026 06:06

polinabinder1 force-pushed the pbinder/evo2-eval-harness branch from f794c31 to 3557958 Compare June 23, 2026 06:18

Conversation

polinabinder1 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contents

How to run

Summary by CodeRabbit

Release Notes

Expected output

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polinabinder1 commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading