feat(format): schema evolution for the Java row codec#3714
feat(format): schema evolution for the Java row codec#3714stevenschlansker wants to merge 11 commits into
Conversation
7823b91 to
8be8335
Compare
4e7b606 to
96f441f
Compare
Add withSchemaEvolution() to the row/array/map codecs so a current-version codec can read payloads written by older versions of the same bean. Versions are declared with @ForyVersion(since/until) on fields/accessors and @ForySchema(removedFields=...) for deleted fields. Decode dispatches on an 8-byte strict hash at the head of the payload (the row reuses its hash slot with a stricter, name- and nullability-sensitive hash; array/map gain an 8-byte prefix) to a per-version projection codec. The non- evolution path is unchanged: projection state is null when evolution is off, so unversioned codecs are byte-for-byte identical and pay no decode cost. Includes the one-allocation-per-encode path for evolution-enabled array/map codecs and the row-format allocation probe.
Cross-product over the versions of distinct nested versioned beans so a nested bean evolves to arbitrary depth, dispatched by the inner bean's own strict hash rather than a version number. ProjectionRouting maps each combination to a generated projection class, identified by a class-name suffix carrying the inner simple name, version, and strict-hash low bits. Folds in the projection-path decode hardening and allocation trims, the generation-time logging-unit fix, the isBean gate on the version-history probe, the row-format guide's schema-evolution section, the JMH benchmark suite, and the spotless pass over the new sources.
Discover versioned beans nested in List/Map field values (arrays use the component type, not the element type) so their older versions are enumerated and projected. Rejects finite @ForyVersion(until) on a live field and drops RECORD_COMPONENT from the @ForyVersion target for Java 11 module safety (FIELD+METHOD covers records). Carries the resolved schema in the RowFactory, precomputed once at build time, instead of mutating the builder.
…ojection Position-scope the bean codec registration so a same-class key and value bean do not share one beanEncoderMap entry; the key is always read at the current schema while the value projects to its historical schema. Covers both the eager-encode and lazy-decode key positions.
…ersioned bean Take the evolution path for a top-level array/map whose element or value is a versioned bean, so the strict-hash prefix is always present and producer and consumer stay wire-compatible. Drops an unreachable live-field since/until check and covers added reference and collection field defaults on row evolution.
…ested Thread the TypeResolutionContext through the core TypeUtils map key/value isBean branch so an interface bean is discovered as a map key/value rather than rejected. Throws for non-accessor methods colliding with an absent projection field, warns on a large projection cross-product, rejects @ForyVersion(since) below the first schema version, and documents the Java-only wire framing in the row format specification.
A versioned bean used as a map key was read at the current schema only, so an older key decoded against the current layout and corrupted silently. Make map keys evolve like values: the map header's single 8-byte hash now identifies the (key-layout, value-layout) combination jointly (FNV-1a mix of the two strict hashes), so the writer still emits one hash per map and the reader dispatches both positions to the matching historical projection codec. MapCodecBuilder builds a key history alongside the value history and enumerates their cross-product, generating one projection map codec per combination keyed by the combined hash; a position with no versioned bean contributes a single current-only layout. Its NON_BEAN_POSITION_HASH sentinel is 0L, which is also a legitimate FNV result for a real schema, so the sentinel's safety rests on the build-time collision guards in buildVersioned rather than on 0L being unreachable. MapEncoderBuilder gained a key codec suffix parallel to the value suffix so the key subtree routes to its historical row codec instead of being pinned to current; the generated map class name namespaces the key suffix with _K so (keyOld, valCurrent) and (keyCurrent, valOld) do not collapse onto one class. Encoding and both codec formats thread the key suffix through. The map header hash derivation changes for all evolution-enabled maps; the feature is unreleased so there is no compatibility constraint. Tests cover key-only evolution, both-sides evolution, the cross-combination class-name collision, and the compact format.
Generalize SchemaHistory's per-field nested-bean enumeration from a single key/value dimension to a list of NestedSite entries, each carrying the map-branch path (KEY/VALUE per map crossed) from the field root to the bean leaf. collectNestedSites now descends both the map key (kv.f0) and value (kv.f1), so a versioned bean used as a map key inside a row field becomes its own cross-product dimension and evolves independently of the value, to arbitrary nesting depth. NestedSite.substitute follows the recorded path to substitute each historical struct into exactly its leaf, leaving other leaves intact; projectThroughWrapper (the top-level array/map path) follows a value-only path of the field's map depth, preserving prior behavior. Previously a row field typed Map<VersionedKey, V> with key-version skew between writer and reader threw ClassNotCompatibleException at decode, because the row strict hash includes the key struct but the key was never enumerated. The realistic case (distinct writer/reader key classes) needs no codegen change: distinct classes already route to distinct beanCodecKey/nestedBeanSuffix entries. Add a build-time guard in MapCodecBuilder.buildVersioned that rejects a combined (key, value) schema-hash collision instead of silently overwriting a projection entry, mirroring the row-path strict-hash collision check, and cover the case where a projection hash would shadow the current hash on the decode hot path. Tests: evolvingMapKeyInRowField (directly-typed key) and evolvingKeyOfNestedMap (key of a map nested under another map's value, exercising the multi-step branch path). Update the row-format guide to state keys evolve whether the map is top-level or nested, with the hash semantics for each.
… array/map A top-level array or map codec enumerated only one nested versioned bean and stamped that bean's strict hash, so an element that reached more than one distinct versioned bean class — for example List<Map<KeyBean, ValueBean>> with KeyBean and ValueBean evolving independently — generated a projection codec for only one of them and failed to compile a reference to the other. Enumerate the element/position over the same per-class cross-product the row-field path uses: SchemaHistory.forElement builds the element field's history from collectNestedSites, so the element schema's strict hash identifies every reachable bean's layout jointly and each combination carries its chosen versions in nestedBeanSchemas(). Codegen routing is generalized from a single rowCodecSuffix to a per-class Map<Class,String> (nestedBeanSuffix consults it, absent class meaning current schema), threaded through Encoding, Encoders, the array/map/compact builders, and BaseBinaryEncoderBuilder, so one generated codec embeds the right historical row codec for each nested class. ArrayCodecBuilder and MapCodecBuilder build via forElement; the map keeps its per-position combined-hash dispatch and evolves multi-bean wrappers within each position, preserving key non-nullability through projectedPositionField. The value-only projectThroughWrapper/mapDepth and the single-bean beanPath helper are removed; evolutionBean is reduced to the reachability-and-naming predicate it now is. Reaching a bean also means reaching through a non-versioned wrapper bean: a struct whose own fields are stable but which holds an evolving nested bean (such as a map-key struct holding an evolving detail) must itself become an evolution site, or no projection is generated and an older nested payload fails loud with a map self/peer hash mismatch. collectNestedSites now treats any row bean as a site when build()'s history has more than one version — covering both directly annotated beans and these transitive-only wrappers — replacing the direct-annotation-only isBeanWithVersioning probe with a plain isBean gate, since build() already expands the nested cross-product. A SchemaEvolutionStressTest case covers a map struct-key whose nested bean evolves. Per-class enumeration stays correct: a versioned bean is one class written at the latest version, so its key and value occurrences in a map are always the same version; a map with the same logical bean at mixed versions on the two sides is not a shape a real writer produces. The wire format is unchanged (still one strict-hash prefix per array/map), so the row format spec needs no change; the Java row-format guide now states keys and values evolve wherever the map appears, including top-level array/map wrappers. Also fold in review polish on the same files: document the ForyVersion.until no-upper-bound sentinel, import RecordComponent instead of fully qualifying it, share one FNV-1a helper (SchemaHistory.combineHashes) between the schema hash and the map combined hash, note why the row/array projection tables need no builder-side collision guard while the map's combined hash does, and unify RowEncoderBuilder's nested-bean suffix routing onto the inherited nestedClassSuffixes field instead of a parallel private map and override.
SchemaHistory.hashField mixed a StructType's constant name then recursed
into its children with no boundary marker. A struct's arity is variable
(unlike list/map, whose arity is fixed by the type kind), so
{a: struct<x>, b} and {a: struct<x, b>} mixed an identical byte sequence
and produced the same 64-bit strict hash. A non-injective strict hash can
route an older payload to the wrong projection codec or trip the
build-time collision guard on a legitimate evolution. Mix the struct's
child count before recursing so the hash stays injective over nesting.
Also tighten visibility and remove duplicate paths surfaced by review:
hoist the duplicated typeCtx() (the synthesize-interfaces resolution
context) into the shared BaseCodecBuilder owner; drop MapCodecBuilder's
one-line combinedHash forwarder in favor of calling
SchemaHistory.combineHashes directly; and narrow computeStrictSchemaHash
to package-private for the new test.
Add SchemaHistoryTest covering the struct-boundary collision, its minimal
empty-struct form, and that structurally identical schemas still hash
equal.
96f441f to
0e3480e
Compare
…ecode Schema-evolution projection codecs were compiled eagerly at builder time: the row/array/map builders enumerated the full historical cross-product and loaded one generated codec class per combination, so a deep nested version history paid its whole class cost up front whether or not those versions ever appeared on the wire. Defer that compilation to decode. Each builder now builds an immutable hash -> ProjectionSource index that holds only the inputs (the VersionedSchema and codegen context); the codec class is compiled the first time a payload with that hash is decoded and cached in a per-encoder LongMap. No new synchronization: encoders are single-threaded, and the class compile is already memoized globally by the shared code generator, so a concurrent first-miss on the same hash compiles the class once. The build-time collision guards stay eager over the full cross-product (row/array via SchemaHistory's strict-hash guard, map via its two combined-hash guards), so a hash clash still fails fast at build rather than on an unlucky decode. Because the build-time class count no longer tracks annotated history, the projection-count warning no longer flags a real cost; remove PROJECTION_COUNT_WARN_THRESHOLD and warnIfManyProjections. Also fix stale Javadoc that claimed map keys are always read at the current schema and that map support is in progress, and update the row-format guide's generation-cost note to the lazy model.
5d7e5e3 to
4964041
Compare
|
@chaokunyang , this is not 100% done yet, but I think it is getting very close. Would you mind doing a design review to make sure that this feature will be acceptable to commit into the project? It is not a small feature, but I think the complexity is justified by giving the row format a low-cost method of achieving schema evolution. If there is some way to streamline the eventual code review of a large volume of LLM-typed code, please let me know. I intend to test this over the coming month or so internally, and mark the PR as ready once we have some real world experience running it. |
Opt in with
.withSchemaEvolution()on any row, array, or map codec builder. Fields carry@ForyVersion(since, until); removed fields are listed on a nested interface referenced from@ForySchema(removedFields = ...). Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported.Why?
Currently changing row format schema definition in any way invalidates all records
What does this PR do?
Propose a new concept of format versions, each succeeding version may add or remove fields from types, and deserialization machinery picks version based on schema hash
AI Contribution Checklist
AI Usage Disclosure
yes, I included a completed AI Contribution Checklist in this PR description and the requiredAI Usage Disclosure.yes, my PR description includes the requiredai_reviewsummary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.Does this PR introduce any user-facing change?
New codec option: schema evolution. Some small annotations and a builder method.
Existing row format compatibility unchanged
Benchmark
withSchemaEvolution()is an opt-in feature that adds a new row-codec path; it does not modify any existing serialization hot path.There is no
apache/mainbaseline to compare against —SchemaEvolutionSuiteexerciseswithSchemaEvolution(), which does not existon
main— so the benchmark measures two things directly: the steady-state cost of enabling the flag, and projection-vs-currentparity.
Bounded JMH run (JDK 26, 2 forks × 4 iterations, 1s each,
-prof gc).B/opisgc.alloc.rate.norm(bytes allocated per operation).currentDecodecurrentDecodeNoEvolutionencodeencodeNoEvolutioncompactCurrentDecodecompactCurrentDecodeNoEvolutioncompactEncodecompactEncodeNoEvolutionolderDecodecompactOlderDecodeFindings:
B/opis byte-identical between the evolution-on and*NoEvolutionvariants on every path (decode 312/312, encode 152/152, compact decode 280/280, compact encode 144/144).
~10% confidence intervals on the no-evolution variants, so the throughput claim is "no measurable difference," not a tight bound;
allocation is exact.
vs 280 compact) because it reads the narrower V1 schema, not because projection is inherently cheaper. Each projection codec holds its
historical schema's precomputed row layout, so there is no per-decode rebuild.
Limitations
withSchemaEvolution()flag; the twoframings are not wire-compatible. A flag-mismatched peer fails loudly with
ClassNotCompatibleException(except evolution-off reading evolution-on bytes,which is undefined). Adopt by enabling the flag on both sides in a release that
changes no schema, then evolve schemas once every peer is on the new build.
Data cannot be upgraded since the original format has no hash slot for array or map.
C++) cannot read them.
version counts of the distinct nested versioned bean classes. Retire entries
from a bean's
Historyinterface once you no longer need to read payloads fromthat range to bound the growth.