Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 17 additions & 9 deletions BloomFilter.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ boolean block_check(block b, unsigned int32 x) {
for i in [0..7] {
for j in [0..31] {
if (masked.getWord(i).isSet(j)) {
if (not b.getWord(i).setBit(j)) {
if (not b.getWord(i).isSet(j)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a nice find

return false
}
}
Expand Down Expand Up @@ -266,8 +266,8 @@ false positive rates:
#### File Format

Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block
bloom filter consists of the bloom filter header followed by the bloom filter bitset. The bloom filter
header encodes the size of the bloom filter bit set in bytes that is used to read the bitset.
Bloom filter consists of the Bloom filter header followed by the Bloom filter bitset. The Bloom filter
header encodes the size of the Bloom filter bitset in bytes that is used to read the bitset.

Here are the Bloom filter definitions in thrift:

Expand All @@ -282,7 +282,7 @@ union BloomFilterAlgorithm {
}

/** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash
* algorithm. It uses 64 bits version of xxHash.
* algorithm. It uses the 64-bit version of xxHash.
**/
struct XxHash {}

Expand All @@ -307,21 +307,29 @@ union BloomFilterCompression {
* Bloom filter header is stored at beginning of Bloom filter data of each column
* and followed by its bitset.
**/
struct BloomFilterPageHeader {
/** The size of bitset in bytes **/
struct BloomFilterHeader {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this matches what is in

struct BloomFilterHeader {

/** The size of bitset in bytes. **/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you can update parquet.thrift to match these changes?

struct BloomFilterHeader {
/** The size of bitset in bytes **/
1: required i32 numBytes;
/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm algorithm;
/** The hash function used for Bloom filter. **/
3: required BloomFilterHash hash;
/** The compression used in the Bloom filter **/
4: required BloomFilterCompression compression;

1: required i32 numBytes;
/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm algorithm;
/** The hash function used for Bloom filter. **/
3: required BloomFilterHash hash;
/** The compression used in the Bloom filter **/
/** The compression used in the Bloom filter. **/
4: required BloomFilterCompression compression;
}

struct ColumnMetaData {
...
/** Byte offset from beginning of file to Bloom filter data. **/
14: optional i64 bloom_filter_offset;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified it is in

/** Size of Bloom filter data including the serialized header, in bytes.
* Added in 2.10 so readers may not read this field from old files and
* it can be obtained after the BloomFilterHeader has been deserialized.
* Writers should write this field so readers can read the bloom filter
* in a single I/O.
*/
15: optional i32 bloom_filter_length;

/** Size of Bloom filter data including the serialized header, in bytes.
* Added in 2.10 so readers may not read this field from old files and
* it can be obtained after the BloomFilterHeader has been deserialized.
* Writers should write this field so readers can read the bloom filter
* in a single I/O.
*/
15: optional i32 bloom_filter_length;
}

```
Expand All @@ -339,8 +347,8 @@ information such as the presence of value. Therefore the Bloom filter of columns
data should be encrypted with the column key, and the Bloom filter of other (not sensitive) columns
do not need to be encrypted.

Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal
fields, including the BloomFilterPageHeader `bloom_filter_page_header`), and the Bitset. The header
Bloom filters have two serializable modules - the Bloom filter header (the BloomFilterHeader thrift
structure and its internal fields), and the Bitset. The header
structure is serialized by Thrift, and written to file output stream; it is followed by the
serialized Bitset.

Expand Down
8 changes: 4 additions & 4 deletions PageIndex.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@
- under the License.
-->

# Parquet page index: Layout to Support Page Skipping
# Parquet Page Index: Layout to Support Page Skipping

In Parquet, a *page index* is optional metadata for a
ColumnChunk, containing statistics for DataPages that can be used
to skip those pages when scanning in ordered and unordered columns.
The page index is stored using the OffsetIndex and ColumnIndex structures,
defined in [`parquet.thrift`](src/main/thrift/parquet.thrift)
defined in [`parquet.thrift`](src/main/thrift/parquet.thrift).

## Problem Statement
In previous versions of the format, Statistics are stored for ColumnChunks in
Expand All @@ -37,7 +37,7 @@ data from disk.
1. Make both range scans and point lookups I/O efficient by allowing direct
access to pages based on their min and max values. In particular:
* A single-row lookup in a row group based on the sort column of that row group
will only read one data page per the retrieved column.
will only read one data page per retrieved column.
* Range scans on the sort column will only need to read the exact data
pages that contain relevant data.
* Make other selective scans I/O efficient: if we have a very selective
Expand Down Expand Up @@ -81,7 +81,7 @@ Some observations:
* We store lower and upper bounds for the values of each page. These may be the
actual minimum and maximum values found on a page, but can also be (more
compact) values that do not exist on a page. For example, instead of storing
""Blart Versenwald III", a writer may set `min_values[i]="B"`,
`"Blart Versenwald III"`, a writer may set `min_values[i]="B"`,
`max_values[i]="C"`. This allows writers to truncate large values and writers
should use this to enforce some reasonable bound on the size of the index
structures.
Expand Down