RFC: Encode parquet-format minor_version in thrift metadata#581
Conversation
| * minor versions. See the documentation[1] for more details on the versioning | ||
| * scheme and the features added in each version. | ||
| * | ||
| * [1]: http://parquet.apache.org/docs/file-format/versions |
There was a problem hiding this comment.
This URL would correspond to apache/parquet-site#186 when published
|
My issue with this (and #582) is that the version info (here the minor, there both major and minor), are encoded after the entirety of the metadata. Readers will have to decode the entire footer to obtain adequate versioning info. The current version is an The advantage to keeping the current version field is that it will be the first thing encoded. You don't need a custom thrift decoder to examine a single VLQ encoded integer. Readers can then check the version before passing the footer bytes on to whatever thrift decoder they're using. |
I agree that is a more clever encoding. I was trying to find something that was backwards compatible and avoids the "what does the version field mean" discussion, that no one seems able to resolve 🤷 |
Rationale for this change
NOTE: there is an alternate RFC here
format_major_versionandformat_minor_versionto thrift metadata #582As described on the mailing list thread about versions, there is currently no way for a parquet reader to know which if the many parquet features it may encounter in a partcular file
The current
versionfield in the thirft metadata is insufficient because:2. There is no agreed upon definition of
versionand many writers use it incorrectly:parquet-format/src/main/thrift/parquet.thrift
Lines 1368 to 1373 in 74001e4
versionfield, version "2" has several forward incompatible changes (see Document Parquet Features by Version parquet-site#186) meaning a reader doesn't know what features it may encounterWhat changes are included in this PR?
Add a
minor_versionfield to the thrift metadata to encode the minor version of parquet-format. Readers can use this field to determine what features it may encounter.This field would be ignored by older readers
Do these changes have PoC implementations?
Not yet