feat(keepset): k-mer homology keep-set builder + BED12-block --keep-bed support#55
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughAdds a ChangesKeepset generation and BED12 parsing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a3ea72b to
f63415b
Compare
9c11832 to
a2a3a29
Compare
…set) parse_bed now treats a 12+-column line as BED12: keep only its blocks (blockCount/blockSizes/blockStarts, cols 10/11/12), not the whole [start,end) span. A <12-column line stays BED3 (whole interval), so existing keep/mask BEDs are unchanged. This lets the keep-set encode position-precise homology (e.g. length-1 / short contiguous-run k-mer start positions) compactly, one BED line per region, which the tiered (Design Z) .sa filter consumes unchanged via keep_doubled_pos. Block fields are validated (integer, count match, non-zero length, checked arithmetic against overflow, within chromEnd) with line-numbered errors. Adds unit tests for BED12 block parsing, trailing-comma tolerance, mixed BED3/BED12 files, length-1 blocks, coordinate-overflow rejection, and the other malformed-input rejections.
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@prmi/src/cli.rs`:
- Around line 419-423: The guard checks in the CLI validation logic are
currently using anyhow::ensure! and therefore lose the intended user-input error
type; update the keepset validation in the CLI path so both the k range check
and the blocks_per_line minimum check return crate::Error::InvalidInput instead
of a generic anyhow error. Make the change in the validation block that enforces
the --k and --blocks-per-line constraints, preserving the existing messages
while routing failures through the InvalidInput variant.
In `@prmi/src/keepset.rs`:
- Line 32: The k-mer cap and lookup logic in keepset needs canonicalization:
`for_each_kmer` forward codes and `reverse_complement_key` RC codes are encoded
differently for `k < 32`, so compare and count using a single canonical
representation instead of mixing low-aligned and high-aligned/T-padded keys.
Update the matching flow around the `Keepset` k-mer processing so both forward
and reverse-complement variants are normalized before `max_occ` counting and
lookup, ensuring RC homologs are not missed and inverted repeats cannot bypass
the cap.
- Around line 118-132: The FASTA loading logic in keepset::read_contigs
currently allows duplicate contig names, and later HashMap construction silently
overwrites earlier entries. Update the contig ingestion loop that builds bases
and contigs to track seen names as you read each record, and return
Error::InvalidInput immediately when a duplicate is encountered. Keep the
existing record parsing and Contig creation flow, but guard it before the later
BED mapping path so duplicate FASTA names cannot reach the HashMap collapse.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b8c9c139-6878-4915-94d0-1151da419c6c
📒 Files selected for processing (4)
prmi/src/cli.rsprmi/src/keepset.rsprmi/src/lib.rsprmi/src/train/mask.rs
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@prmi/src/cli.rs`:
- Around line 454-474: The success path in the keepset output handling can
report success before the buffered output is actually written, because
`crate::keepset::build_keepset` may return `Ok(())` while the final `BufWriter`
flush still fails. Update the `out == Path::new("-")` and file-output branches
in `cli::` to explicitly flush the local `w` after `build_keepset` returns and
before assigning `stats`, so any disk-full or broken-pipe error is propagated
instead of being hidden.
In `@prmi/src/keepset.rs`:
- Around line 167-184: The target parsing in keepset::parse_bed currently
accepts any record with 3+ columns but only consumes start/end, which silently
widens BED12 intervals. Update the parsing path to either reject non-BED3 inputs
explicitly in this branch or reuse the BED12 block-expansion logic already used
by parse_bed before flattening, so multi-block targets are handled consistently
and by the same interval symbols as the existing parser.
- Around line 292-296: `compute_keep_runs` is silently clipping caller-provided
intervals instead of validating them, so update the `targets` handling to reject
malformed ranges before writing to `in_target`. In the loop over `targets`,
check each `(s, e)` for `end <= start` or `end > bases.len()` and return
`Error::InvalidInput` immediately rather than using `min(n)` normalization; keep
the existing bitmap logic only for valid intervals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5fcd8d63-cf50-49ac-893d-67a78734e17d
📒 Files selected for processing (4)
prmi/src/cli.rsprmi/src/keepset.rsprmi/src/lib.rsprmi/src/train/mask.rs
Add `prmi keepset --ref <fa> --targets <bed3> [-k] [--max-occ] [--flank] [--bed12]`: build a flat-coordinate Design-Z keep-set from a reference and a per-contig target BED. The keep-set is the union of (a) the target intervals kept whole and (b) the START positions of every reference k-mer that is a forward or reverse-complement copy of a target k-mer whose reference count is <= max-occ (high-copy repeats are dropped; they fall back regardless). Only the suffix-start of each homolog is kept, matching the tiered .sa's keyed-at-start filter -- keeping the whole [p, p+k) span would over-keep on merge. Memory is bounded by the target k-mer count, not the genome: only target k-mers are tracked, and counting is skipped entirely when uncapped (counting every genome k-mer would need ~140 GB on hg38). A shared for_each_kmer helper drives the three rolling-window passes. Output is in prmi's flat concatenated-genome coordinate space (the space `prmi build` and parse_bed operate in); the tool owns the per-contig -> flat mapping so real per-contig target BEDs work directly. Emits one BED3 line per maximal kept run, or compact BED12 (--bed12) packing runs as blocks; both decode to the identical keep-set. k and --blocks-per-line are validated before the reference is read so an invalid knob fails fast. Productionizes the build_halo_bed measurement prototype with correct start-only emission, reverse-complement homology, multi-contig support, and BED3/BED12 output. The pure core (compute_keep_runs) is unit-tested on synthetic sequences; an end-to-end build (keepset -> build --keep-bed) yields a byte-identical tiered .sa from both the BED3 and BED12 forms.
…turating_sub) Address review: the lower flank bound uses saturating_sub but the upper used `b + flank`, an inconsistent overflow path. Use saturating_add to match.
Newer nightly rustfmt (unpinned CI fmt toolchain) reformats these; apply it so the fmt check passes. Formatting only.
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
Design-Z stack 4/6 — base:
feat/z-gate-ffi.Keep-set construction:
prmi keepsetbuilds a flat-coordinate keep-set BED from a per-contig target BED + reference (targets ∪ k-mer homology copies under a max-occ cap, optional flank), and--keep-bednow honors BED12 blocks for position-precise keep-sets.Dual-reviewed (coderabbit --agent + local CR); fix applied (saturating_add on the flank upper bound). green.
Summary by CodeRabbit
keepsetCLI command to generate keep-set BED files from a reference FASTA and per-contig BED3 targets.