Skip to content

RFC: Encode parquet-format minor_version in thrift metadata#581

Draft
alamb wants to merge 2 commits into
apache:masterfrom
alamb:alamb/parquet-versions-option-1
Draft

RFC: Encode parquet-format minor_version in thrift metadata#581
alamb wants to merge 2 commits into
apache:masterfrom
alamb:alamb/parquet-versions-option-1

Conversation

@alamb

@alamb alamb commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

NOTE: there is an alternate RFC here

As described on the mailing list thread about versions, there is currently no way for a parquet reader to know which if the many parquet features it may encounter in a partcular file

The current version field in the thirft metadata is insufficient because:
2. There is no agreed upon definition of version and many writers use it incorrectly:

* As of December 2025, there is no agreed upon consensus of what constitutes
* version 2 of the file. For maximum compatibility with readers, writers should
* always populate "1" for version. For maximum compatibility with writers,
* readers should accept "1" and "2" interchangeably. All other versions are
* reserved for potential future use-cases.
*/

  1. Even if we agreed to use the version field, version "2" has several forward incompatible changes (see Document Parquet Features by Version parquet-site#186) meaning a reader doesn't know what features it may encounter

What changes are included in this PR?

Add a minor_version field to the thrift metadata to encode the minor version of parquet-format. Readers can use this field to determine what features it may encounter.

This field would be ignored by older readers

Do these changes have PoC implementations?

Not yet

* minor versions. See the documentation[1] for more details on the versioning
* scheme and the features added in each version.
*
* [1]: http://parquet.apache.org/docs/file-format/versions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL would correspond to apache/parquet-site#186 when published

@etseidl

etseidl commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

My issue with this (and #582) is that the version info (here the minor, there both major and minor), are encoded after the entirety of the metadata. Readers will have to decode the entire footer to obtain adequate versioning info.

The current version is an i32, why waste that on a single digit? We can encode a year based version as a decimal integer 202606. Or if we want SemVer something like 2013000 (for 2.13.0). Or, for just major/minor we could split the i32 into two i16s (0x2000D).

The advantage to keeping the current version field is that it will be the first thing encoded. You don't need a custom thrift decoder to examine a single VLQ encoded integer. Readers can then check the version before passing the footer bytes on to whatever thrift decoder they're using.

@alamb

alamb commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

The current version is an i32, why waste that on a single digit? We can encode a year based version as a decimal integer 202606. Or if we want SemVer something like 2013000 (for 2.13.0). Or, for just major/minor we could split the i32 into two i16s (0x2000D).

I agree that is a more clever encoding. I was trying to find something that was backwards compatible and avoids the "what does the version field mean" discussion, that no one seems able to resolve

🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants