feat: support column-projected worksheet reads#660
Conversation
1f1ebc0 to
8f9705a
Compare
lukapeschke
left a comment
There was a problem hiding this comment.
Thanks for addressing this! I'm thinking we could maybe make the new methods take an non-exhaustive options struct ? This would allow to add support for more options later without having to modify the existing API.
Something like this:
#[derive(Debug, Clone, Default)]
#[non_exhaustive]
pub struct RangeOptions {
column_indices: Option<Vec<u32>>, // or an ordered set. Or (u32, u32) coordinates
// In another PR, allow to select rows as well
row_indices: Option<Vec<u32>>,
}
fn worksheet_range_opts(
&mut self,
name: &str,
opts: RangeOptions,
) -> Result<Range<Data>, Self::Error> {]This would allow us to select only the rows we need , and we could also move the header_row option there eventually, since it's more sheet-specific than reader-specific.
Looks good otherwise, thanks for the extra tests 👍
@lukapeschke: adding support for rows here is definitely a good move; do you have an idea if there are other meaningful options that would be useful outside of row/col definitions? In the meantime I'll drop this PR into Draft and start adding rows support right away. I also realised it would be nice to accept reader.worksheet_range_ref_cols("Sheet1", 0..5)?; // range
reader.worksheet_range_ref_cols("Sheet1", 0..=4)?; // inclusive
reader.worksheet_range_ref_cols("Sheet1", 2..)?; // from col 2 to end
reader.worksheet_range_ref_cols("Sheet1", [1, 3, 5])?; // explicit/disjoint list
reader.worksheet_range_ref_cols("Sheet1", ..)?; // all (no projection)...but ideally we would support declaring disjoint ranges/indices in one go to select a region 🤔 reader.worksheet_range_ref_region("S", 0..5, ..)?; // range of cols, all rows
reader.worksheet_range_ref_region("S", [1,3,5], 0..100)?; // specific cols, range of rows
reader.worksheet_range_ref_region("S", [0..3, 8..10], ..)?; // disjoint col ranges, all rows |
8f9705a to
5005741
Compare
|
Closing here and raising as a fresh PR... New version now handles regions (that can potentially be disjoint) instead of only cols, and is essentially a rewrite 👌 |
Ref: pola-rs/polars#27677
This PR allows for projecting columns at read time (for xlsx/xlsb), so we can omit columns from the streaming reader before collection - this means peak allocation tracks the selected columns, not the entire sheet's Range.
The linked issue shows a ~40GB allocation which would only need to be a few MB if the required columns could be declared (and the remaining columns filtered out) before we allocate.