diff --git a/BinaryProtocolExtensions.md b/BinaryProtocolExtensions.md index e23d3328..49dba3b5 100644 --- a/BinaryProtocolExtensions.md +++ b/BinaryProtocolExtensions.md @@ -26,11 +26,17 @@ The extension mechanism of the `binary` Thrift field-id `32767` has some desirab * The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift. * Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL). -Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers. +Because only one field-id is reserved the extension bytes themselves require +disambiguation; otherwise readers will not be able to decode extensions safely. +This is left to implementers who MUST put enough unique state in their extension +bytes for disambiguation. This can be relatively easily achieved by adding a +[UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the +start or end of the extension bytes. The extension does not specify a +disambiguation mechanism to allow more flexibility to implementers. Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire. - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(M) encoded size of the extension M bytes | extension bytes @@ -50,14 +56,14 @@ To illustrate the applicability of the extension mechanism we provide examples o ### Footer -A variant of `FileMetaData` encoded in Flatbuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with. +A variant of `FileMetaData` encoded in FlatBuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with. In its private form the footer of a Parquet file will look like so: - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(K+28) encoded size of the extension - K bytes | Flatbuffers representation (v0) of FileMetaData + K bytes | FlatBuffers representation (v0) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) @@ -67,20 +73,20 @@ In its private form the footer of a Parquet file will look like so: some-UUID is some UUID picked for this extension and it is used throughout (possibly internal) experimentation. It is put at the end to allow detection of the extension when parsed in reverse. The little-endian sizes and crc32s are also to the end to facilitate efficient parsing the footer in reverse without requiring parsing the Thrift compact protocol that precedes it. -At some point the experiments conclude and the extension shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`. +At some point the experiments conclude and the extension is shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`. -The community reviews the proposal and (potentially) proposes changes to the Flatbuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must: +The community reviews the proposal and (potentially) proposes changes to the FlatBuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must: 1. have some way of being extended in the future much like what it replaces. Because the extension mechanism only allows for a single extension, without this in place we cannot have footer extensions during the migration. 2. consider its intermediate form where both the **Thrift** `FileMetaData` and the **FlatBuffers** `FileMetaData` will be present. 3. consider its final form where the long form header for `32767: binary` may not be present. -Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the flatbuffer `FileMetaData`. It will look much like its private form except the flatbuffer IDL may be different: +Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the FlatBuffers `FileMetaData`. It will look much like its private form except the FlatBuffers IDL may be different: - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(K+28) encoded size of the extension - K bytes | Flatbuffers representation (v1) of FileMetaData + K bytes | FlatBuffers representation (v1) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) @@ -90,7 +96,7 @@ Once the design is ratified the new `FileMetaData` encoding is made final with t After the migration period, the end of the Parquet file may look like this: - K bytes | Flatbuffers representation (v1) of FileMetaData + K bytes | FlatBuffers representation (v1) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d6049a88..f9fdf21a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -43,10 +43,10 @@ The general steps for adding features to the format are as follows: 1. Design/scoping: The goal of this phase is to identify design goals of a feature and provide some demonstration that the feature meets those goals. This phase starts with a discussion of changes on the developer mailing list - (dev@parquet.apache.org). Depending on the scope and goals of the feature the - it can be useful to provide additional artifacts as part of a discussion. The - artifacts can include a design docuemnt, a draft pull request to make the - discussion concrete and/or an prototype implementation to demostrate the + (dev@parquet.apache.org). Depending on the scope and goals of the feature, it + can be useful to provide additional artifacts as part of a discussion. The + artifacts can include a design document, a draft pull request to make the + discussion concrete and/or a prototype implementation to demonstrate the viability of implementation. This step is complete when there is lazy consensus. Part of the consensus is whether it is sufficient to provide two working implementations as outlined in step 2, or if demonstration of the @@ -58,7 +58,7 @@ The general steps for adding features to the format are as follows: 2. Completeness: The goal of this phase is to ensure the feature is viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two - implementations of the feature demonstrating interopability must also be + implementations of the feature demonstrating interoperability must also be provided. One implementation MUST be [`parquet-java`](http://github.com/apache/parquet-java). It is preferred that the second implementation be @@ -73,21 +73,21 @@ The general steps for adding features to the format are as follows: fit for inclusion (for example, they were submitted as a pull request against the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations are welcome and can help lend - weight to features desirability but are not sufficient for acceptance of a + weight to a feature's desirability but are not sufficient for acceptance of a new feature. Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation. -3. Ratification: After the first two steps are complete a formal vote is held on +3. Ratification: After the first two steps are complete, a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote - passes the format change is merged into the `parquet-format` repository and + passes, the format change is merged into the `parquet-format` repository and it is expected the changes from step 2 will also be merged soon after (implementations should not be merged until the addition has been merged to `parquet-format`). -#### General guidelines/preferences on additions. +#### General guidelines/preferences on additions 1. To the greatest extent possible changes should have an option for forward compatibility (old readers can still read files). The [compatibility and @@ -95,13 +95,13 @@ demonstrate that the feature is mergeable to its implementation. provides more details on expectations for changes that break compatibility. 2. New encodings should be fully specified in this repository and not - rely on an external dependencies for implementation (i.e. `parquet-format` is + rely on external dependencies for implementation (i.e. `parquet-format` is the source of truth for the encoding). If it does require an external dependency, then the external dependency must have its own specification separate from implementation. 3. New compression mechanisms should have a pure Java implementation that can be - used as a dependency in `parquet-java`, exceptions may be + used as a dependency in `parquet-java`; exceptions may be discussed on the mailing list to see if a non-native Java implementation is acceptable. @@ -154,7 +154,7 @@ recommendations for managing features: 2. Forward compatible features/changes may be enabled and used by default in implementations once the parquet-format containing those changes has been formally released. For features that may pose a significant performance - regression to older format readers, libaries should consider delaying default + regression to older format readers, libraries should consider delaying default enablement until 1 year after the release of the parquet-java implementation that contains the feature implementation. @@ -162,7 +162,7 @@ recommendations for managing features: until 2 years after the parquet-java implementation containing the feature is released. It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers (e.g. via - a major version release if using Semantic Versioning, or highlighed in + a major version release if using Semantic Versioning, or highlighted in release notes). For forward compatible changes which have a high chance of performance @@ -174,7 +174,7 @@ the same timelines as `parquet-java`. Parquet-java will wait to enable features by default until the most conservative timelines outlined above have been exceeded. This timeline is an attempt to balance ensuring new features make their way into the ecosystem and avoiding -breaking compatiblity for readers that are slower to adopt new standards. We +breaking compatibility for readers that are slower to adopt new standards. We encourage earlier adoption of new features when an organization using Parquet can guarantee that all readers of the parquet files they produce can read a new feature. diff --git a/Encryption.md b/Encryption.md index 180b9aa6..d3c8c9fa 100644 --- a/Encryption.md +++ b/Encryption.md @@ -79,7 +79,7 @@ in order to verify its integrity. New footer fields keep an information about the file encryption algorithm and the footer signing key. For encrypted columns, the following modules are always encrypted, with the same column key: -pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter +pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter headers and bitsets. If the column key is different from the footer encryption key, the column metadata is serialized separately and encrypted with the column key. In this case, the column metadata is also @@ -101,7 +101,7 @@ other on a combination of GCM and CTR modes. AES GCM is an authenticated encryption. Besides the data confidentiality (encryption), it supports two levels of integrity verification (authentication): of the data (default), and of the data combined with an optional AAD (“additional authenticated data”). The -authentication allows to make sure the data has not been tampered with. An AAD +authentication makes it possible to verify that the data has not been tampered with. An AAD is a free text to be authenticated, together with the data. The user can, for example, pass the file name with its version (or creation timestamp) as an AAD input, to verify that the file has not been replaced with an older version. The details on how Parquet creates @@ -136,9 +136,10 @@ one IV is ever repeated, then the implementation may be vulnerable"*. *"Complian requirement is crucial to the security of GCM"*. The bulk of modules in a Parquet file are page headers and data pages. Therefore, one encryption -key shall not not be used for more than 2^31 (~2 billion) pages. In Parquet files encrypted with -multiple keys (footer and column keys), the constraint on the number of invocations is applied -to each key separately. +key shall not be used for more than 2^32 total module encryptions, as per the NIST specification. +Since each data page requires two module encryptions (header + data), this means in practice no +more than 2^31 pages per key. In Parquet files encrypted with multiple keys (footer and column +keys), the constraint on the number of invocations is applied to each key separately. When running in the context of a larger system, any particular Parquet writer implementation likely does not have sufficient context to enforce key invocation limits system-wide. Therefore, @@ -161,8 +162,9 @@ tag used to verify the ciphertext and AAD integrity. #### 4.2.2 AES_GCM_CTR_V1 + In this Parquet algorithm, all modules except pages are encrypted with the GCM cipher, as described -above. The pages are encrypted by the CTR cipher without padding. This allows to encrypt/decrypt +above. The pages are encrypted by the CTR cipher without padding. This makes it possible to encrypt/decrypt the bulk of the data faster, while still verifying the metadata integrity and making sure the file has not been replaced with a wrong version. However, tampering with the page data might go unnoticed. The AES CTR cipher @@ -208,7 +210,7 @@ it can't prevent replacement of one ciphertext with another (encrypted with the Parquet modular encryption leverages AADs to protect against swapping ciphertext modules (encrypted with AES GCM) inside a file or between files. Parquet can also protect against swapping full files - for example, replacement of a file with an old version, or replacement of one table -partition with another. AADs are built to reflects the identity of a file and of the modules +partition with another. AADs are built to reflect the identity of a file and of the modules inside the file. Parquet constructs a module AAD from two components: an optional AAD prefix - a string provided @@ -221,12 +223,12 @@ group 1. The module AAD is a direct concatenation of the prefix and suffix parts #### 4.4.1 AAD prefix File swapping can be prevented by an AAD prefix string, that uniquely identifies the file and -allows to differentiate it e.g. from older versions of the file or from other partition files in the same +makes it possible to differentiate it e.g. from older versions of the file or from other partition files in the same data set (table). This string is optionally passed by a writer upon file creation. If provided, the AAD prefix is stored in an `aad_prefix` field in the file, and is made available to the readers. This field is not encrypted. If a user is concerned about keeping the file identity inside the file, the writer code can explicitly request Parquet not to store the AAD prefix. Then the aad_prefix field -will be empty; AAD prefixes must be fully managed by the caller code and supplied explictly to Parquet +will be empty; AAD prefixes must be fully managed by the caller code and supplied explicitly to Parquet readers for each file. The protection against swapping full files is optional. It is not enabled by default because @@ -246,15 +248,15 @@ of all partition files (prefixes) from 0 to N-1. #### 4.4.2 AAD suffix The suffix part of a module AAD protects against module swapping inside a file. It also protects against -module swapping between files - in situations when an encryption key is re-used in multiple files and the +module swapping between files - in situations when an encryption key is re-used in multiple files and the writer has not provided a unique AAD prefix for each file. Unlike AAD prefix, a suffix is built internally by Parquet, by direct concatenation of the following parts: 1. [All modules] internal file identifier - a random byte array generated for each file (implementation-defined length) 2. [All modules] module type (1 byte) -3. [All modules except footer] row group ordinal (2 byte short, little endian) -4. [All modules except footer] column ordinal (2 byte short, little endian) -5. [Data page and header only] page ordinal (2 byte short, little endian) +3. [All modules except footer] row group ordinal (2-byte short, little-endian) +4. [All modules except footer] column ordinal (2-byte short, little-endian) +5. [Data page and header only] page ordinal (2-byte short, little-endian) The following module types are defined: @@ -262,8 +264,8 @@ The following module types are defined: * ColumnMetaData (1) * Data Page (2) * Dictionary Page (3) - * Data PageHeader (4) - * Dictionary PageHeader (5) + * Data Page Header (4) + * Dictionary Page Header (5) * ColumnIndex (6) * OffsetIndex (7) * BloomFilter Header (8) @@ -276,8 +278,8 @@ The following module types are defined: | ColumnMetaData | yes | yes (1) | yes | yes | no | | Data Page | yes | yes (2) | yes | yes | yes | | Dictionary Page | yes | yes (3) | yes | yes | no | -| Data PageHeader | yes | yes (4) | yes | yes | yes | -| Dictionary PageHeader| yes | yes (5) | yes | yes | no | +| Data Page Header | yes | yes (4) | yes | yes | yes | +| Dictionary Page Header| yes | yes (5) | yes | yes | no | | ColumnIndex | yes | yes (6) | yes | yes | no | | OffsetIndex | yes | yes (7) | yes | yes | no | | BloomFilter Header | yes | yes (8) | yes | yes | no | @@ -285,7 +287,7 @@ The following module types are defined: -## 5 File Format +## 5. File Format ### 5.1 Encrypted module serialization All modules, except column pages, are encrypted with the GCM cipher. In the AES_GCM_V1 algorithm, @@ -392,7 +394,7 @@ struct ColumnChunk { ### 5.3 Protection of sensitive metadata The Parquet file footer, and its nested structures, contain sensitive information - ranging -from a secret data (column statistics) to other information that can be exploited by an +from secret data (column statistics) to other information that can be exploited by an attacker (e.g. schema, num_values, key_value_metadata, encoding and crypto_metadata). This information is automatically protected when the footer and secret columns are encrypted with the same key. In other cases - when column(s) and the @@ -408,7 +410,7 @@ field in the `ColumnChunk`. struct ColumnChunk { ... - /** Column metadata for this chunk.. **/ + /** Column metadata for this chunk **/ 3: optional ColumnMetaData meta_data .. /** Crypto metadata of encrypted columns **/ @@ -439,7 +441,7 @@ little endian integer, followed by a final magic string, "PARE". The same magic written at the beginning of the file (offset 0). Parquet readers start file parsing by reading and checking the magic string. Therefore, the encrypted footer mode uses a new magic string ("PARE") in order to instruct readers to look for a file crypto metadata -before the footer - and also to immediately inform legacy readers (expecting ‘PAR1’ +before the footer - and also to immediately inform legacy readers (expecting "PAR1" bytes) that they can’t parse this file. ```c @@ -490,14 +492,14 @@ The plaintext footer is signed in order to prevent tampering with the structure with the AES GCM algorithm - using a footer signing key, and an AAD constructed according to the instructions of the section 4.4. Only the nonce and GCM tag are stored in the file – as a 28-byte -fixed-length array, written right after the footer itself. The ciphertext is not stored, +fixed-length array, written right after the footer itself. The ciphertext is not stored, because it is not required for footer integrity verification by readers. | nonce (12 bytes) | tag (16 bytes) | |------------------|-----------------| -The plaintext footer mode sets the following fields in the the FileMetaData structure: +The plaintext footer mode sets the following fields in the FileMetaData structure: ```c struct FileMetaData { @@ -522,7 +524,7 @@ The 28-byte footer signature is written after the plaintext footer, followed by that contains the combined length of the footer and its signature. A final magic string, "PAR1", is written at the end of the file. The same magic string is written at the beginning of the file (offset 0). The magic bytes -for plaintext footer mode are ‘PAR1’ to allow legacy readers to read projections of the file +for plaintext footer mode are "PAR1" to allow legacy readers to read projections of the file that do not include encrypted columns. ![File Layout - Encrypted footer](doc/images/FileLayoutEncryptionPF.png)