core page index cache changes#22254
Conversation
PR Reviewer Guide 🔍(Review updated until commit 4483eae)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 4483eae Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 709d000
Suggestions up to commit aae6832
Suggestions up to commit 1d91513
|
1d91513 to
aae6832
Compare
|
Persistent review updated to latest commit aae6832 |
Signed-off-by: G <bharath78910@gmail.com>
aae6832 to
709d000
Compare
|
Persistent review updated to latest commit 709d000 |
| /// request, so cells are shared across paths → cross-path sharing. | ||
| #[derive(Clone, PartialEq, Eq, Hash, Debug)] | ||
| pub(crate) struct CiCellKey { | ||
| pub(crate) path: Arc<str>, |
There was a problem hiding this comment.
Is this absolute path?
There was a problem hiding this comment.
its store relative path - but key here is arc - its just a reference to the object_path
There was a problem hiding this comment.
Will this be an issue when Warm Nodes refer to the files locally v/s in S3 ? Or is it abstracted away by S3 Filesystem ?
There was a problem hiding this comment.
Indexed path:
Java passes file paths via ShardView.object_metas (ObjectMeta[])
↓
build_segments() — reads meta.location for:
- load_parquet_metadata() → metadata cache key
- segment.object_path → stored on SegmentFileInfo
↓
SegmentFileInfo.object_path used by:
- load_scoped_page_index_cols() → page index cache key (Arc<str>)
- CachedMetadataReaderFactory → passed to ParquetSource for actual data reads
- IndexedExec.object_path → passed to RowGroupStreamConfig for parquet IO
Listing path:
DataFusion list_files_cache (pre-populated from shard_view.object_metas)
↓
ListingTable.list_files_for_scan() → PartitionedFile.object_meta.location
↓
Same location used by:
- DFParquetMetadata.fetch_metadata() → metadata cache key
- ScopedPageIndexReader.location → page index cache key (Arc<str>)
- CachedParquetFileReader.location → actual parquet data reads
- ParquetFileMetrics → metrics labeling
So for overall query we are already using the file location, here just reusing with an arc reference.
| /// read path. Counters are atomics so `stats()` is always lock-free. | ||
| pub(super) struct BoundedCache<K, V> | ||
| where | ||
| K: Eq + Hash + Clone + Display + Send + Sync + 'static, |
There was a problem hiding this comment.
Required by dashmap - needs both eq and hash for keys.
| Ok(s) => Arc::new(s), | ||
| // If we can't derive the file schema, fall back to the union schema; the | ||
| // caller still falls back to footer-only on any downstream mismatch. | ||
| Err(_) => return resolve_with_schema(_arrow_schema, metadata, predicate_column_names), |
There was a problem hiding this comment.
Why are we resolving with union here? Isn't that wrong?
There was a problem hiding this comment.
there is followup for resolve in general, we can handle it there.
| pub fn resolve_predicate_parquet_columns_pair( | ||
| union_schema: &SchemaRef, | ||
| metadata: &ParquetMetaData, | ||
| names_a: &[String], |
There was a problem hiding this comment.
nit: could be named better
| } | ||
| // Same fallback as the single-name path: resolve against the union schema. | ||
| Err(_) => ( | ||
| resolve_with_schema(union_schema, metadata, names_a), |
There was a problem hiding this comment.
Not sure again if this is right, let's throw a warning log atleast?
| //! index (no per-page string stats). Built for **all row groups** (an empty | ||
| //! OffsetIndex on a row group DataFusion scans panics / breaks reads, and | ||
| //! DataFusion chooses the scanned set itself, after our load — see | ||
| //! HANDOFF_step2_rg_scoping.md §1e). |
There was a problem hiding this comment.
nit: please give reference of HANDOFF_step2 md file here? or update the comment?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #22254 +/- ##
============================================
+ Coverage 73.35% 73.39% +0.04%
- Complexity 75937 75946 +9
============================================
Files 6071 6071
Lines 344993 344993
Branches 49638 49638
============================================
+ Hits 253080 253221 +141
+ Misses 71710 71568 -142
- Partials 20203 20204 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
|
||
| // Process-global caches | ||
|
|
||
| pub(crate) static COLUMN_INDEX_CACHE: Lazy<BoundedCache<CiCellKey, ColumnIndexMetaData>> = |
There was a problem hiding this comment.
Tie them with the global runTime rather?
There was a problem hiding this comment.
This is a custom cache right? Global runtime doesn't have support for caches other than the list files, stats and metadata cache ?
| arrow_schema: &SchemaRef, | ||
| predicate: &Arc<dyn datafusion::physical_expr::PhysicalExpr>, | ||
| ) -> Vec<usize> { | ||
| use arrow::array::{ArrayRef, BooleanArray, UInt64Array}; |
There was a problem hiding this comment.
nit: we can move them up?
| set.into_iter().filter(|&c| c < num_cols).collect() | ||
| } | ||
| }; | ||
| if off_cols.is_empty() { |
There was a problem hiding this comment.
Why are we returning here? Do we not have to place the placeholders?
There was a problem hiding this comment.
None falls back to footer only. This only happens when there are technically no columns in parquet which shouldn't happen at all.
| } | ||
|
|
||
| /// Union of `offset_index` byte ranges across the given column chunks. | ||
| fn offset_index_union(chunks: &[ColumnChunkMetaData]) -> Option<Range<u64>> { |
There was a problem hiding this comment.
both column_index_union and this one looks same
There was a problem hiding this comment.
column_index_offset vs offset_index_offset but otherwise the same only
| fetch_ranges.push(range); | ||
| } | ||
|
|
||
| let buffers = store.get_ranges(location, &fetch_ranges).await.ok()?; |
There was a problem hiding this comment.
How are we stich the query cancellation framework with this? Just want to understand in case the store is lagging or stuck, we cancel the query, how will it propagate to here? Or that is something already handled in the task managers?
There was a problem hiding this comment.
Here, we either insert to cache or don't right ? There is no possibility of leak here
| /// (fixed-width page offsets) is tiny, so they get separate, separately-tunable | ||
| /// limits rather than sharing one number. | ||
| /// | ||
| /// TODO : configure via settings |
There was a problem hiding this comment.
IIUC This is global for the node which is at par with DocValueSkipList OR any other Index file stored in OS File page cache for Vanilla OpenSearch. Can you record this in a comment explicitly ?
Signed-off-by: G <bharath78910@gmail.com>
|
Persistent review updated to latest commit 4483eae |
This change reduces the dominant native heap consumer + compute on wide-schema Parquet files: the full ColumnIndex + OffsetIndex decoded for every column on every query. For example in textbench, 1 billion rows dataset - the full parquet metadata was 1750 MB.
So, DataFusion's parquet opener decodes the entire page index (per-page string min/max for all columns × all row groups) before caching, then caches the result. That amounts to 1750 MB.
So we try to only cache the parquet metadata footer which consists of file stats and RG stats by default and lazily load and cache the offset index and column index.
Two-cache scoped page index (column/offset index cache)
decoded and stored once per file, shared across every query whose predicate touches that column regardless of projection or literal.
all row groups (arrow-rs push decoder indexes directly by file-global RG).
Footer-only level-1 metadata cache
storing so the level-1 LRU only ever holds footer-only entries.
PageIndexPolicy::Optional code path.
Listing-table scan path (scoped_page_index_reader.rs, scoped_index_optimizer.rs)
the match()-only 2.5× bytes_scanned regression).
Indexed-table scan path (indexed_executor.rs)
page index via load_scoped_page_index_cols.
Cache size after these changes :
Level 1 cache : 157 MB [ parquet footer ].
Offset index : 158 MB even after select * type of queries .
Column index : 20-30 MB when 20 columns are used in various queries as filter.
So down to 300 MB from 1750 MB and lazy loading causes evictions to be less painful since only part of the offset index and column index - ~5 MB needs to be loaded back.
Follow-ups (not in this PR)
surviving_row_groups() → load_scoped_page_index_scoped(). Currently load_scoped_page_index_cols (no RG scoping) is used.
SegmentFileInfo at build_segments time and thread it through.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.