Skip to content

feat(keepset): k-mer homology keep-set builder + BED12-block --keep-bed support#55

Merged
nh13 merged 4 commits into
mainfrom
feat/z-keepset
Jun 26, 2026
Merged

feat(keepset): k-mer homology keep-set builder + BED12-block --keep-bed support#55
nh13 merged 4 commits into
mainfrom
feat/z-keepset

Conversation

@nh13

@nh13 nh13 commented Jun 22, 2026

Copy link
Copy Markdown

Design-Z stack 4/6 — base: feat/z-gate-ffi.

Keep-set construction: prmi keepset builds a flat-coordinate keep-set BED from a per-contig target BED + reference (targets ∪ k-mer homology copies under a max-occ cap, optional flank), and --keep-bed now honors BED12 blocks for position-precise keep-sets.

Dual-reviewed (coderabbit --agent + local CR); fix applied (saturating_add on the flank upper bound). green.

Summary by CodeRabbit

  • New Features
    • Added a keepset CLI command to generate keep-set BED files from a reference FASTA and per-contig BED3 targets.
    • Supports BED3 or BED12 output, configurable k-mer size and homology controls, optional flanking/merging, BED12 packing, and custom chromosome labels.
    • Can write results to standard output.
  • Bug Fixes
    • Improved BED input handling by adding BED12 block-based interval parsing with stricter validation (malformed fields, zero-length/overflowing blocks, and out-of-range coordinates).
  • Tests
    • Added coverage for BED12 decoding/merge behavior and keep-set generation edge cases.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b683d61-2f93-48a7-af80-9a06d2455d76

📥 Commits

Reviewing files that changed from the base of the PR and between 6a09209 and d40e755.

📒 Files selected for processing (4)
  • prmi/src/cli.rs
  • prmi/src/keepset.rs
  • prmi/src/lib.rs
  • prmi/src/train/mask.rs

Walkthrough

Adds a keepset CLI command that builds BED3 or BED12 keep-set output from a reference FASTA and target BED inputs, writes to stdout or a file, and extends parse_bed to decode BED12 blocks.

Changes

Keepset generation and BED12 parsing

Layer / File(s) Summary
CLI surface and execution
prmi/src/lib.rs, prmi/src/cli.rs
prmi exports the new keepset module, and Cmd::Keepset plus Cli::run add the new command-line entry point, validation, buffering, output selection, and stats reporting.
Keep-set inputs and types
prmi/src/keepset.rs
keepset.rs defines options and stats, flattens FASTA contigs into genome coordinates, parses per-contig BED3 targets, and iterates rolling k-mers with N resets.
Keep-set computation and output
prmi/src/keepset.rs
keepset.rs computes kept runs from targets and k-mer homology, expands and merges runs by flank, emits BED3 or BED12, wires the build entry point, and adds validation-focused tests.
BED12 parsing support
prmi/src/train/mask.rs
parse_bed now expands BED12 blocks into intervals, validates block fields and coordinate bounds, and adds tests for accepted and rejected BED12 inputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fg-labs/prmi#10 — Adds another Cmd subcommand and Cli::run dispatch branch in prmi/src/cli.rs.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is specific and matches the two main changes: a new keepset builder and BED12 block support for --keep-bed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/z-keepset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@nh13 nh13 force-pushed the feat/z-gate-ffi branch from eb640f9 to 37ad062 Compare June 23, 2026 20:07
@nh13 nh13 force-pushed the feat/z-keepset branch 3 times, most recently from a3ea72b to f63415b Compare June 23, 2026 21:30
@nh13 nh13 force-pushed the feat/z-gate-ffi branch from c439c79 to 5b8afdd Compare June 24, 2026 17:44
@nh13 nh13 force-pushed the feat/z-keepset branch from f63415b to f5235e6 Compare June 24, 2026 17:44
@nh13 nh13 force-pushed the feat/z-gate-ffi branch 3 times, most recently from 9c11832 to a2a3a29 Compare June 25, 2026 01:07
@nh13 nh13 force-pushed the feat/z-keepset branch from f5235e6 to 1b10e79 Compare June 25, 2026 01:13
@nh13 nh13 force-pushed the feat/z-gate-ffi branch from a2a3a29 to c052094 Compare June 25, 2026 04:08
@nh13 nh13 force-pushed the feat/z-keepset branch from 1b10e79 to 2dd8f76 Compare June 25, 2026 04:08
Base automatically changed from feat/z-gate-ffi to main June 25, 2026 04:16
…set)

parse_bed now treats a 12+-column line as BED12: keep only its blocks
(blockCount/blockSizes/blockStarts, cols 10/11/12), not the whole [start,end)
span. A <12-column line stays BED3 (whole interval), so existing keep/mask BEDs
are unchanged. This lets the keep-set encode position-precise homology (e.g.
length-1 / short contiguous-run k-mer start positions) compactly, one BED line
per region, which the tiered (Design Z) .sa filter consumes unchanged via
keep_doubled_pos.

Block fields are validated (integer, count match, non-zero length, checked
arithmetic against overflow, within chromEnd) with line-numbered errors. Adds
unit tests for BED12 block parsing, trailing-comma tolerance, mixed BED3/BED12
files, length-1 blocks, coordinate-overflow rejection, and the other
malformed-input rejections.
@nh13 nh13 force-pushed the feat/z-keepset branch from 2dd8f76 to 527cdd1 Compare June 25, 2026 04:17
@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@prmi/src/cli.rs`:
- Around line 419-423: The guard checks in the CLI validation logic are
currently using anyhow::ensure! and therefore lose the intended user-input error
type; update the keepset validation in the CLI path so both the k range check
and the blocks_per_line minimum check return crate::Error::InvalidInput instead
of a generic anyhow error. Make the change in the validation block that enforces
the --k and --blocks-per-line constraints, preserving the existing messages
while routing failures through the InvalidInput variant.

In `@prmi/src/keepset.rs`:
- Line 32: The k-mer cap and lookup logic in keepset needs canonicalization:
`for_each_kmer` forward codes and `reverse_complement_key` RC codes are encoded
differently for `k < 32`, so compare and count using a single canonical
representation instead of mixing low-aligned and high-aligned/T-padded keys.
Update the matching flow around the `Keepset` k-mer processing so both forward
and reverse-complement variants are normalized before `max_occ` counting and
lookup, ensuring RC homologs are not missed and inverted repeats cannot bypass
the cap.
- Around line 118-132: The FASTA loading logic in keepset::read_contigs
currently allows duplicate contig names, and later HashMap construction silently
overwrites earlier entries. Update the contig ingestion loop that builds bases
and contigs to track seen names as you read each record, and return
Error::InvalidInput immediately when a duplicate is encountered. Keep the
existing record parsing and Contig creation flow, but guard it before the later
BED mapping path so duplicate FASTA names cannot reach the HashMap collapse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b8c9c139-6878-4915-94d0-1151da419c6c

📥 Commits

Reviewing files that changed from the base of the PR and between 8f70e9a and 527cdd1.

📒 Files selected for processing (4)
  • prmi/src/cli.rs
  • prmi/src/keepset.rs
  • prmi/src/lib.rs
  • prmi/src/train/mask.rs

Comment thread prmi/src/cli.rs Outdated
Comment thread prmi/src/keepset.rs Outdated
Comment thread prmi/src/keepset.rs
@nh13 nh13 force-pushed the feat/z-keepset branch from 527cdd1 to 6a09209 Compare June 25, 2026 14:43
@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@prmi/src/cli.rs`:
- Around line 454-474: The success path in the keepset output handling can
report success before the buffered output is actually written, because
`crate::keepset::build_keepset` may return `Ok(())` while the final `BufWriter`
flush still fails. Update the `out == Path::new("-")` and file-output branches
in `cli::` to explicitly flush the local `w` after `build_keepset` returns and
before assigning `stats`, so any disk-full or broken-pipe error is propagated
instead of being hidden.

In `@prmi/src/keepset.rs`:
- Around line 167-184: The target parsing in keepset::parse_bed currently
accepts any record with 3+ columns but only consumes start/end, which silently
widens BED12 intervals. Update the parsing path to either reject non-BED3 inputs
explicitly in this branch or reuse the BED12 block-expansion logic already used
by parse_bed before flattening, so multi-block targets are handled consistently
and by the same interval symbols as the existing parser.
- Around line 292-296: `compute_keep_runs` is silently clipping caller-provided
intervals instead of validating them, so update the `targets` handling to reject
malformed ranges before writing to `in_target`. In the loop over `targets`,
check each `(s, e)` for `end <= start` or `end > bases.len()` and return
`Error::InvalidInput` immediately rather than using `min(n)` normalization; keep
the existing bitmap logic only for valid intervals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5fcd8d63-cf50-49ac-893d-67a78734e17d

📥 Commits

Reviewing files that changed from the base of the PR and between 527cdd1 and 6a09209.

📒 Files selected for processing (4)
  • prmi/src/cli.rs
  • prmi/src/keepset.rs
  • prmi/src/lib.rs
  • prmi/src/train/mask.rs

Comment thread prmi/src/cli.rs
Comment thread prmi/src/keepset.rs
Comment thread prmi/src/keepset.rs
nh13 added 3 commits June 25, 2026 15:16
Add `prmi keepset --ref <fa> --targets <bed3> [-k] [--max-occ] [--flank]
[--bed12]`: build a flat-coordinate Design-Z keep-set from a reference and a
per-contig target BED. The keep-set is the union of (a) the target intervals
kept whole and (b) the START positions of every reference k-mer that is a
forward or reverse-complement copy of a target k-mer whose reference count is
<= max-occ (high-copy repeats are dropped; they fall back regardless). Only the
suffix-start of each homolog is kept, matching the tiered .sa's keyed-at-start
filter -- keeping the whole [p, p+k) span would over-keep on merge.

Memory is bounded by the target k-mer count, not the genome: only target
k-mers are tracked, and counting is skipped entirely when uncapped (counting
every genome k-mer would need ~140 GB on hg38). A shared for_each_kmer helper
drives the three rolling-window passes.

Output is in prmi's flat concatenated-genome coordinate space (the space
`prmi build` and parse_bed operate in); the tool owns the per-contig -> flat
mapping so real per-contig target BEDs work directly. Emits one BED3 line per
maximal kept run, or compact BED12 (--bed12) packing runs as blocks; both
decode to the identical keep-set. k and --blocks-per-line are validated before
the reference is read so an invalid knob fails fast.

Productionizes the build_halo_bed measurement prototype with correct start-only
emission, reverse-complement homology, multi-contig support, and BED3/BED12
output. The pure core (compute_keep_runs) is unit-tested on synthetic
sequences; an end-to-end build (keepset -> build --keep-bed) yields a
byte-identical tiered .sa from both the BED3 and BED12 forms.
…turating_sub)

Address review: the lower flank bound uses saturating_sub but the upper used
`b + flank`, an inconsistent overflow path. Use saturating_add to match.
Newer nightly rustfmt (unpinned CI fmt toolchain) reformats these; apply
it so the fmt check passes. Formatting only.
@nh13 nh13 force-pushed the feat/z-keepset branch from 6a09209 to d40e755 Compare June 25, 2026 22:17
@nh13

nh13 commented Jun 25, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13

nh13 commented Jun 26, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nh13 nh13 merged commit cd38f10 into main Jun 26, 2026
4 checks passed
@nh13 nh13 deleted the feat/z-keepset branch June 26, 2026 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant