feat(format): schema evolution for the Java row codec by stevenschlansker · Pull Request #3714 · apache/fory

stevenschlansker · 2026-05-29T00:07:03Z

Opt in with .withSchemaEvolution() on any row, array, or map codec builder. Fields carry @ForyVersion(since, until); removed fields are listed on a nested interface referenced from
@ForySchema(removedFields = ...). Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported.

Why?

Currently changing row format schema definition in any way invalidates all records

What does this PR do?

Propose a new concept of format versions, each succeeding version may add or remove fields from types, and deserialization machinery picks version based on schema hash

  Schema evolution for the Java row codec

  Lets a consumer built from the current bean decode rows written by older versions of that bean.

  Adds opt-in schema evolution to the Java row codec, enabled with
  `withSchemaEvolution()` on the bean/array/map codec builders. A reader can decode
  payloads written against older versions of a bean by dispatching on a per-payload
  strict schema hash to a projection codec that reads the historical layout and
  maps it onto the current type, applying defaults for fields that did not yet
  exist and discarding fields that have been removed.

  Field history is declared with `@ForyVersion(since/until)` on live fields and on
  a `@ForySchema(removedFields = ...)` history interface for removed fields.
  `SchemaHistory` enumerates the version history, including the cross-product over
  nested versioned beans; each historical layout gets a projection codec whose row
  layout is precomputed once so projection decode costs the same per call as
  current-schema decode. The wire format uses an 8-byte strict-hash slot
  (reusing the row's existing hash slot, and new for evolution-enabled array/map elements),
  and producer/consumer must agree on the flag.

  Covers the standard and compact row formats, records (including
  `@ForyVersion` on record components) and interface beans, and nested versioned
  beans to arbitrary depth.


  API
  - `@ForyVersion` — since/until version window on a field, record component, or accessor method, defining when that field is present in the schema history.
  - `@ForySchema` — declares a bean's evolution intent and points at a nested interface describing removed fields.

  How it works

  Row payloads already carry an 8-byte schema hash. Evolution-enabled encoders keep
  a map of historical hashes; on decode, a hash mismatch is resolved against that
  map instead of failing.

  - `SchemaHistory` derives the ordered historical schemas from the version
    annotations and a strict hash per version (hash includes field name and
    nullability).
  - `RowCodecBuilder` generates one projection codec per historical version
    (`_V<n>` classes), each reading its historical schema and producing
    current-bean instances.
  - `BinaryRowEncoder.decode` dispatches on the peer hash: exact match → current
    codec; known historical hash → projection codec; else
    `ClassNotCompatibleException`.
  - Nested versioned beans dispatch recursively by strict hash, routed through
    array and map projection codecs so versioning composes through `List<Bean>` and
    `Map<_, Bean>`.

  Performance

  Both decode paths reuse the `CompactRowLayout` cache from #3717. The
  current-schema path allocates rows via `writer.newRow()`; the historical path via
  a per-projection `RowFactory` that holds its historical schema's layout, built
  once. Per-decode cost is the same on both — no layout recomputation. The
  evolution-enabled array/map codecs keep a single allocation per encode.

AI Contribution Checklist

AI Usage Disclosure

substantial_ai_assistance: yes
scope: all
affected_files_or_subsystems: row format Java
ai_review: <line-by-line self-review completed; summarize the two-reviewer loop and final no-further-comments result>
ai_review_artifacts:
human_verification: <checks run locally or in CI + pass/fail summary + contributor reviewed results>
performance_verification: ✔️
provenance_license_confirmation: ✔️
If yes, I included a completed AI Contribution Checklist in this PR description and the required AI Usage Disclosure.
If yes, my PR description includes the required ai_review summary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.

Does this PR introduce any user-facing change?

New codec option: schema evolution. Some small annotations and a builder method.
Existing row format compatibility unchanged

Benchmark

withSchemaEvolution() is an opt-in feature that adds a new row-codec path; it does not modify any existing serialization hot path.
There is no apache/main baseline to compare against — SchemaEvolutionSuite exercises withSchemaEvolution(), which does not exist
on main — so the benchmark measures two things directly: the steady-state cost of enabling the flag, and projection-vs-current
parity.

Bounded JMH run (JDK 26, 2 forks × 4 iterations, 1s each, -prof gc). B/op is gc.alloc.rate.norm (bytes allocated per operation).

Benchmark	Throughput (ops/s)	B/op
`currentDecode`	17.6M	312
`currentDecodeNoEvolution`	16.6M	312
`encode`	15.9M	152
`encodeNoEvolution`	15.8M	152
`compactCurrentDecode`	16.4M	280
`compactCurrentDecodeNoEvolution`	16.3M	280
`compactEncode`	16.4M	144
`compactEncodeNoEvolution`	15.5M	144
`olderDecode`	24.5M	216
`compactOlderDecode`	24.9M	192

Findings:

Enabling evolution adds zero allocation on the current path. B/op is byte-identical between the evolution-on and *NoEvolution
variants on every path (decode 312/312, encode 152/152, compact decode 280/280, compact encode 144/144).
Throughput overhead of the flag is within the run's noise band. Every on-vs-off pair overlaps within error. This bounded run has
~10% confidence intervals on the no-evolution variants, so the throughput claim is "no measurable difference," not a tight bound;
allocation is exact.
Projection (older-version) decode is not penalized versus current decode. It allocates less here (216 vs 312 B/op standard; 192
vs 280 compact) because it reads the narrower V1 schema, not because projection is inherently cheaper. Each projection codec holds its
historical schema's precomputed row layout, so there is no per-decode rebuild.

Limitations

Producer and consumer must agree on the withSchemaEvolution() flag; the two
framings are not wire-compatible. A flag-mismatched peer fails loudly with
ClassNotCompatibleException (except evolution-off reading evolution-on bytes,
which is undefined). Adopt by enabling the flag on both sides in a release that
changes no schema, then evolve schemas once every peer is on the new build.
Data cannot be upgraded since the original format has no hash slot for array or map.
Evolution-enabled payloads are Java-only; cross-language consumers (Python,
C++) cannot read them.
The number of generated projection codec classes grows as the product of the
version counts of the distinct nested versioned bean classes. Retire entries
from a bean's History interface once you no longer need to read payloads from
that range to bound the growth.

Add withSchemaEvolution() to the row/array/map codecs so a current-version codec can read payloads written by older versions of the same bean. Versions are declared with @ForyVersion(since/until) on fields/accessors and @ForySchema(removedFields=...) for deleted fields. Decode dispatches on an 8-byte strict hash at the head of the payload (the row reuses its hash slot with a stricter, name- and nullability-sensitive hash; array/map gain an 8-byte prefix) to a per-version projection codec. The non- evolution path is unchanged: projection state is null when evolution is off, so unversioned codecs are byte-for-byte identical and pay no decode cost. Includes the one-allocation-per-encode path for evolution-enabled array/map codecs and the row-format allocation probe.

Cross-product over the versions of distinct nested versioned beans so a nested bean evolves to arbitrary depth, dispatched by the inner bean's own strict hash rather than a version number. ProjectionRouting maps each combination to a generated projection class, identified by a class-name suffix carrying the inner simple name, version, and strict-hash low bits. Folds in the projection-path decode hardening and allocation trims, the generation-time logging-unit fix, the isBean gate on the version-history probe, the row-format guide's schema-evolution section, the JMH benchmark suite, and the spotless pass over the new sources.

Discover versioned beans nested in List/Map field values (arrays use the component type, not the element type) so their older versions are enumerated and projected. Rejects finite @ForyVersion(until) on a live field and drops RECORD_COMPONENT from the @ForyVersion target for Java 11 module safety (FIELD+METHOD covers records). Carries the resolved schema in the RowFactory, precomputed once at build time, instead of mutating the builder.

…ojection Position-scope the bean codec registration so a same-class key and value bean do not share one beanEncoderMap entry; the key is always read at the current schema while the value projects to its historical schema. Covers both the eager-encode and lazy-decode key positions.

…ersioned bean Take the evolution path for a top-level array/map whose element or value is a versioned bean, so the strict-hash prefix is always present and producer and consumer stay wire-compatible. Drops an unreachable live-field since/until check and covers added reference and collection field defaults on row evolution.

…ested Thread the TypeResolutionContext through the core TypeUtils map key/value isBean branch so an interface bean is discovered as a map key/value rather than rejected. Throws for non-accessor methods colliding with an absent projection field, warns on a large projection cross-product, rejects @ForyVersion(since) below the first schema version, and documents the Java-only wire framing in the row format specification.

A versioned bean used as a map key was read at the current schema only, so an older key decoded against the current layout and corrupted silently. Make map keys evolve like values: the map header's single 8-byte hash now identifies the (key-layout, value-layout) combination jointly (FNV-1a mix of the two strict hashes), so the writer still emits one hash per map and the reader dispatches both positions to the matching historical projection codec. MapCodecBuilder builds a key history alongside the value history and enumerates their cross-product, generating one projection map codec per combination keyed by the combined hash; a position with no versioned bean contributes a single current-only layout. Its NON_BEAN_POSITION_HASH sentinel is 0L, which is also a legitimate FNV result for a real schema, so the sentinel's safety rests on the build-time collision guards in buildVersioned rather than on 0L being unreachable. MapEncoderBuilder gained a key codec suffix parallel to the value suffix so the key subtree routes to its historical row codec instead of being pinned to current; the generated map class name namespaces the key suffix with _K so (keyOld, valCurrent) and (keyCurrent, valOld) do not collapse onto one class. Encoding and both codec formats thread the key suffix through. The map header hash derivation changes for all evolution-enabled maps; the feature is unreleased so there is no compatibility constraint. Tests cover key-only evolution, both-sides evolution, the cross-combination class-name collision, and the compact format.

Generalize SchemaHistory's per-field nested-bean enumeration from a single key/value dimension to a list of NestedSite entries, each carrying the map-branch path (KEY/VALUE per map crossed) from the field root to the bean leaf. collectNestedSites now descends both the map key (kv.f0) and value (kv.f1), so a versioned bean used as a map key inside a row field becomes its own cross-product dimension and evolves independently of the value, to arbitrary nesting depth. NestedSite.substitute follows the recorded path to substitute each historical struct into exactly its leaf, leaving other leaves intact; projectThroughWrapper (the top-level array/map path) follows a value-only path of the field's map depth, preserving prior behavior. Previously a row field typed Map<VersionedKey, V> with key-version skew between writer and reader threw ClassNotCompatibleException at decode, because the row strict hash includes the key struct but the key was never enumerated. The realistic case (distinct writer/reader key classes) needs no codegen change: distinct classes already route to distinct beanCodecKey/nestedBeanSuffix entries. Add a build-time guard in MapCodecBuilder.buildVersioned that rejects a combined (key, value) schema-hash collision instead of silently overwriting a projection entry, mirroring the row-path strict-hash collision check, and cover the case where a projection hash would shadow the current hash on the decode hot path. Tests: evolvingMapKeyInRowField (directly-typed key) and evolvingKeyOfNestedMap (key of a map nested under another map's value, exercising the multi-step branch path). Update the row-format guide to state keys evolve whether the map is top-level or nested, with the hash semantics for each.

… array/map A top-level array or map codec enumerated only one nested versioned bean and stamped that bean's strict hash, so an element that reached more than one distinct versioned bean class — for example List<Map<KeyBean, ValueBean>> with KeyBean and ValueBean evolving independently — generated a projection codec for only one of them and failed to compile a reference to the other. Enumerate the element/position over the same per-class cross-product the row-field path uses: SchemaHistory.forElement builds the element field's history from collectNestedSites, so the element schema's strict hash identifies every reachable bean's layout jointly and each combination carries its chosen versions in nestedBeanSchemas(). Codegen routing is generalized from a single rowCodecSuffix to a per-class Map<Class,String> (nestedBeanSuffix consults it, absent class meaning current schema), threaded through Encoding, Encoders, the array/map/compact builders, and BaseBinaryEncoderBuilder, so one generated codec embeds the right historical row codec for each nested class. ArrayCodecBuilder and MapCodecBuilder build via forElement; the map keeps its per-position combined-hash dispatch and evolves multi-bean wrappers within each position, preserving key non-nullability through projectedPositionField. The value-only projectThroughWrapper/mapDepth and the single-bean beanPath helper are removed; evolutionBean is reduced to the reachability-and-naming predicate it now is. Reaching a bean also means reaching through a non-versioned wrapper bean: a struct whose own fields are stable but which holds an evolving nested bean (such as a map-key struct holding an evolving detail) must itself become an evolution site, or no projection is generated and an older nested payload fails loud with a map self/peer hash mismatch. collectNestedSites now treats any row bean as a site when build()'s history has more than one version — covering both directly annotated beans and these transitive-only wrappers — replacing the direct-annotation-only isBeanWithVersioning probe with a plain isBean gate, since build() already expands the nested cross-product. A SchemaEvolutionStressTest case covers a map struct-key whose nested bean evolves. Per-class enumeration stays correct: a versioned bean is one class written at the latest version, so its key and value occurrences in a map are always the same version; a map with the same logical bean at mixed versions on the two sides is not a shape a real writer produces. The wire format is unchanged (still one strict-hash prefix per array/map), so the row format spec needs no change; the Java row-format guide now states keys and values evolve wherever the map appears, including top-level array/map wrappers. Also fold in review polish on the same files: document the ForyVersion.until no-upper-bound sentinel, import RecordComponent instead of fully qualifying it, share one FNV-1a helper (SchemaHistory.combineHashes) between the schema hash and the map combined hash, note why the row/array projection tables need no builder-side collision guard while the map's combined hash does, and unify RowEncoderBuilder's nested-bean suffix routing onto the inherited nestedClassSuffixes field instead of a parallel private map and override.

SchemaHistory.hashField mixed a StructType's constant name then recursed into its children with no boundary marker. A struct's arity is variable (unlike list/map, whose arity is fixed by the type kind), so {a: struct<x>, b} and {a: struct<x, b>} mixed an identical byte sequence and produced the same 64-bit strict hash. A non-injective strict hash can route an older payload to the wrong projection codec or trip the build-time collision guard on a legitimate evolution. Mix the struct's child count before recursing so the hash stays injective over nesting. Also tighten visibility and remove duplicate paths surfaced by review: hoist the duplicated typeCtx() (the synthesize-interfaces resolution context) into the shared BaseCodecBuilder owner; drop MapCodecBuilder's one-line combinedHash forwarder in favor of calling SchemaHistory.combineHashes directly; and narrow computeStrictSchemaHash to package-private for the new test. Add SchemaHistoryTest covering the struct-boundary collision, its minimal empty-struct form, and that structurally identical schemas still hash equal.

…ecode Schema-evolution projection codecs were compiled eagerly at builder time: the row/array/map builders enumerated the full historical cross-product and loaded one generated codec class per combination, so a deep nested version history paid its whole class cost up front whether or not those versions ever appeared on the wire. Defer that compilation to decode. Each builder now builds an immutable hash -> ProjectionSource index that holds only the inputs (the VersionedSchema and codegen context); the codec class is compiled the first time a payload with that hash is decoded and cached in a per-encoder LongMap. No new synchronization: encoders are single-threaded, and the class compile is already memoized globally by the shared code generator, so a concurrent first-miss on the same hash compiles the class once. The build-time collision guards stay eager over the full cross-product (row/array via SchemaHistory's strict-hash guard, map via its two combined-hash guards), so a hash clash still fails fast at build rather than on an unlucky decode. Because the build-time class count no longer tracks annotated history, the projection-count warning no longer flags a real cost; remove PROJECTION_COUNT_WARN_THRESHOLD and warnIfManyProjections. Also fix stale Javadoc that claimed map keys are always read at the current schema and that map support is in progress, and update the row-format guide's generation-cost note to the lazy model.

stevenschlansker · 2026-06-30T21:02:01Z

@chaokunyang , this is not 100% done yet, but I think it is getting very close. Would you mind doing a design review to make sure that this feature will be acceptable to commit into the project? It is not a small feature, but I think the complexity is justified by giving the row format a low-cost method of achieving schema evolution.

If there is some way to streamline the eventual code review of a large volume of LLM-typed code, please let me know.
I have tried to keep the bar for quality high (many, many review iterations per ai policy) and testing thorough.
I considered breaking it up into multiple PRs but there is not a "slice" of the feature that stands on its own.

I intend to test this over the coming month or so internally, and mark the PR as ready once we have some real world experience running it.

stevenschlansker requested review from PragmaTwice, chaokunyang and theweipeng as code owners May 29, 2026 00:07

stevenschlansker marked this pull request as draft May 29, 2026 00:08

stevenschlansker commented Jun 4, 2026

View reviewed changes

Comment thread docs/guide/java/row-format.md

stevenschlansker commented Jun 4, 2026

View reviewed changes

Comment thread docs/guide/java/row-format.md Outdated

stevenschlansker commented Jun 4, 2026

View reviewed changes

Comment thread docs/guide/java/row-format.md Outdated

stevenschlansker commented Jun 4, 2026

View reviewed changes

Comment thread docs/guide/java/row-format.md

stevenschlansker force-pushed the row-codec-schema-versions branch 16 times, most recently from 7823b91 to 8be8335 Compare June 30, 2026 15:56

stevenschlansker changed the title ~~Draft: feat(format): schema evolution for the Java row codec~~ feat(format): schema evolution for the Java row codec Jun 30, 2026

stevenschlansker force-pushed the row-codec-schema-versions branch 2 times, most recently from 4e7b606 to 96f441f Compare June 30, 2026 19:38

Claude (on behalf of Steven Schlansker) added 3 commits June 30, 2026 20:11

Claude (on behalf of Steven Schlansker) added 7 commits June 30, 2026 20:11

stevenschlansker force-pushed the row-codec-schema-versions branch from 96f441f to 0e3480e Compare June 30, 2026 20:13

stevenschlansker force-pushed the row-codec-schema-versions branch from 5d7e5e3 to 4964041 Compare June 30, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(format): schema evolution for the Java row codec#3714

feat(format): schema evolution for the Java row codec#3714
stevenschlansker wants to merge 11 commits into
apache:mainfrom
stevenschlansker:row-codec-schema-versions

stevenschlansker commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenschlansker commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

stevenschlansker commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

What does this PR do?

AI Contribution Checklist

Does this PR introduce any user-facing change?

Benchmark

Limitations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenschlansker commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stevenschlansker commented May 29, 2026 •

edited

Loading

stevenschlansker commented Jun 30, 2026 •

edited

Loading