feat: support region (column|row) projection for worksheet reads#661
feat: support region (column|row) projection for worksheet reads#661alexander-beedie wants to merge 1 commit into
Conversation
lukapeschke
left a comment
There was a problem hiding this comment.
Thanks a lot, this will be really helpful 🙏
To answer your question on #660
do you have an idea if there are other meaningful options that would be useful outside of row/col definition
I'd still be in favor of a struct for selection options to avoid an API breakage later.
I see a few things I'd like to add to a SelectOptions struct (we can do it in other PRs to keep everything reviewable):
- I'd like to deprecated the
HeaderRowparam on the reader and move it to the sheet: Right now it mutates the reader, so if we have a first sheet where the header is on row 11 (for example because there is a chart above), we setwith_header_row(HeaderRow::Row(10)). But if we want to read another sheet with the same reader, we need to setwith_header_row(HeaderRow::default())first, otherwise the first 10 rows will be ignored - A
max_rowsparameter would also be nice: it would allow direct support for fastexcel'slimitparameters, and would also be convenient for things such as retrieving only the header row with whenHeaderRowisFirstNonEmptyRow - A
max_cellssafeguard to avoid memory explosions
I guess an API like this would be nice to use:
#[derive(Debug, Clone, Default)]
#[non_exhaustive]
pub struct RangeOptions {
cols: IndexSet,
rows: IndexSet,
}
impl RangeOptions {
pub fn with_cols(self, cols: impl Into<IndexSet>) -> Self;
pub fn with_rows(self, rows: impl Into<IndexSet>) -> Self;
// In later PRs
pub fn with_header_row(self, header_row: HeaderRow) -> Self;
pub fn with_row_limit(self, limit: u32) -> Self;
pub fn with_max_cells(self, limit: u32) -> Self;
}d5b5d63 to
ff8c01a
Compare
ff8c01a to
ba4c0c7
Compare
|
Folks. Let me know when this is complete/ready for merge. |
|
Thanks for the updates @alexander-beedie ! 🙏 Just curious about your opinion on the extensible options struct vs the |
Yup, I think it's worthwhile - just thinking about implementation :) |
Ref: pola-rs/polars#27677
Allows for projecting a region (columns/rows) from the sheet's Range at read time so we can collect/allocate only what is needed, instead of (potentially) allocating the entire Range. The linked issue shows a ~40GB allocation that would only need to be a few MB if the required region could be declared before reserving memory.
IndexSetfor row/col indices that can init from a range, index, list of indices, or list of ranges.ReaderRefwithworksheet_range_region_refandworksheet_range_region.collect_cells_into_range.In use
(This PR supersedes #660, which only supported column projection).