Skip to content

feat(format): schema evolution for the Java row codec#3714

Draft
stevenschlansker wants to merge 11 commits into
apache:mainfrom
stevenschlansker:row-codec-schema-versions
Draft

feat(format): schema evolution for the Java row codec#3714
stevenschlansker wants to merge 11 commits into
apache:mainfrom
stevenschlansker:row-codec-schema-versions

Conversation

@stevenschlansker

@stevenschlansker stevenschlansker commented May 29, 2026

Copy link
Copy Markdown
Contributor

Opt in with .withSchemaEvolution() on any row, array, or map codec builder. Fields carry @ForyVersion(since, until); removed fields are listed on a nested interface referenced from
@ForySchema(removedFields = ...). Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported.

Why?

Currently changing row format schema definition in any way invalidates all records

What does this PR do?

Propose a new concept of format versions, each succeeding version may add or remove fields from types, and deserialization machinery picks version based on schema hash

  Schema evolution for the Java row codec

  Lets a consumer built from the current bean decode rows written by older versions of that bean.

  Adds opt-in schema evolution to the Java row codec, enabled with
  `withSchemaEvolution()` on the bean/array/map codec builders. A reader can decode
  payloads written against older versions of a bean by dispatching on a per-payload
  strict schema hash to a projection codec that reads the historical layout and
  maps it onto the current type, applying defaults for fields that did not yet
  exist and discarding fields that have been removed.

  Field history is declared with `@ForyVersion(since/until)` on live fields and on
  a `@ForySchema(removedFields = ...)` history interface for removed fields.
  `SchemaHistory` enumerates the version history, including the cross-product over
  nested versioned beans; each historical layout gets a projection codec whose row
  layout is precomputed once so projection decode costs the same per call as
  current-schema decode. The wire format uses an 8-byte strict-hash slot
  (reusing the row's existing hash slot, and new for evolution-enabled array/map elements),
  and producer/consumer must agree on the flag.

  Covers the standard and compact row formats, records (including
  `@ForyVersion` on record components) and interface beans, and nested versioned
  beans to arbitrary depth.


  API
  - `@ForyVersion` — since/until version window on a field, record component, or accessor method, defining when that field is present in the schema history.
  - `@ForySchema` — declares a bean's evolution intent and points at a nested interface describing removed fields.

  How it works

  Row payloads already carry an 8-byte schema hash. Evolution-enabled encoders keep
  a map of historical hashes; on decode, a hash mismatch is resolved against that
  map instead of failing.

  - `SchemaHistory` derives the ordered historical schemas from the version
    annotations and a strict hash per version (hash includes field name and
    nullability).
  - `RowCodecBuilder` generates one projection codec per historical version
    (`_V<n>` classes), each reading its historical schema and producing
    current-bean instances.
  - `BinaryRowEncoder.decode` dispatches on the peer hash: exact match → current
    codec; known historical hash → projection codec; else
    `ClassNotCompatibleException`.
  - Nested versioned beans dispatch recursively by strict hash, routed through
    array and map projection codecs so versioning composes through `List<Bean>` and
    `Map<_, Bean>`.

  Performance

  Both decode paths reuse the `CompactRowLayout` cache from #3717. The
  current-schema path allocates rows via `writer.newRow()`; the historical path via
  a per-projection `RowFactory` that holds its historical schema's layout, built
  once. Per-decode cost is the same on both — no layout recomputation. The
  evolution-enabled array/map codecs keep a single allocation per encode.

AI Contribution Checklist

AI Usage Disclosure

  • substantial_ai_assistance: yes
  • scope: all
  • affected_files_or_subsystems: row format Java
  • ai_review: <line-by-line self-review completed; summarize the two-reviewer loop and final no-further-comments result>
  • ai_review_artifacts:
  • human_verification: <checks run locally or in CI + pass/fail summary + contributor reviewed results>
  • performance_verification: ✔️
  • provenance_license_confirmation: ✔️
  • If yes, I included a completed AI Contribution Checklist in this PR description and the required AI Usage Disclosure.
  • If yes, my PR description includes the required ai_review summary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.

Does this PR introduce any user-facing change?

New codec option: schema evolution. Some small annotations and a builder method.
Existing row format compatibility unchanged

Benchmark

withSchemaEvolution() is an opt-in feature that adds a new row-codec path; it does not modify any existing serialization hot path.
There is no apache/main baseline to compare against — SchemaEvolutionSuite exercises withSchemaEvolution(), which does not exist
on main — so the benchmark measures two things directly: the steady-state cost of enabling the flag, and projection-vs-current
parity.

Bounded JMH run (JDK 26, 2 forks × 4 iterations, 1s each, -prof gc). B/op is gc.alloc.rate.norm (bytes allocated per operation).

Benchmark Throughput (ops/s) B/op
currentDecode 17.6M 312
currentDecodeNoEvolution 16.6M 312
encode 15.9M 152
encodeNoEvolution 15.8M 152
compactCurrentDecode 16.4M 280
compactCurrentDecodeNoEvolution 16.3M 280
compactEncode 16.4M 144
compactEncodeNoEvolution 15.5M 144
olderDecode 24.5M 216
compactOlderDecode 24.9M 192

Findings:

  • Enabling evolution adds zero allocation on the current path. B/op is byte-identical between the evolution-on and *NoEvolution
    variants on every path (decode 312/312, encode 152/152, compact decode 280/280, compact encode 144/144).
  • Throughput overhead of the flag is within the run's noise band. Every on-vs-off pair overlaps within error. This bounded run has
    ~10% confidence intervals on the no-evolution variants, so the throughput claim is "no measurable difference," not a tight bound;
    allocation is exact.
  • Projection (older-version) decode is not penalized versus current decode. It allocates less here (216 vs 312 B/op standard; 192
    vs 280 compact) because it reads the narrower V1 schema, not because projection is inherently cheaper. Each projection codec holds its
    historical schema's precomputed row layout, so there is no per-decode rebuild.

Limitations

  • Producer and consumer must agree on the withSchemaEvolution() flag; the two
    framings are not wire-compatible. A flag-mismatched peer fails loudly with
    ClassNotCompatibleException (except evolution-off reading evolution-on bytes,
    which is undefined). Adopt by enabling the flag on both sides in a release that
    changes no schema, then evolve schemas once every peer is on the new build.
    Data cannot be upgraded since the original format has no hash slot for array or map.
  • Evolution-enabled payloads are Java-only; cross-language consumers (Python,
    C++) cannot read them.
  • The number of generated projection codec classes grows as the product of the
    version counts of the distinct nested versioned bean classes. Retire entries
    from a bean's History interface once you no longer need to read payloads from
    that range to bound the growth.

Comment thread docs/guide/java/row-format.md
Comment thread docs/guide/java/row-format.md Outdated
Comment thread docs/guide/java/row-format.md Outdated
Comment thread docs/guide/java/row-format.md
@stevenschlansker stevenschlansker force-pushed the row-codec-schema-versions branch 16 times, most recently from 7823b91 to 8be8335 Compare June 30, 2026 15:56
@stevenschlansker stevenschlansker changed the title Draft: feat(format): schema evolution for the Java row codec feat(format): schema evolution for the Java row codec Jun 30, 2026
@stevenschlansker stevenschlansker force-pushed the row-codec-schema-versions branch 2 times, most recently from 4e7b606 to 96f441f Compare June 30, 2026 19:38
Claude (on behalf of Steven Schlansker) added 3 commits June 30, 2026 20:11
Add withSchemaEvolution() to the row/array/map codecs so a current-version
codec can read payloads written by older versions of the same bean. Versions
are declared with @ForyVersion(since/until) on fields/accessors and
@ForySchema(removedFields=...) for deleted fields.

Decode dispatches on an 8-byte strict hash at the head of the payload (the row
reuses its hash slot with a stricter, name- and nullability-sensitive hash;
array/map gain an 8-byte prefix) to a per-version projection codec. The non-
evolution path is unchanged: projection state is null when evolution is off, so
unversioned codecs are byte-for-byte identical and pay no decode cost.

Includes the one-allocation-per-encode path for evolution-enabled array/map
codecs and the row-format allocation probe.
Cross-product over the versions of distinct nested versioned beans so a nested
bean evolves to arbitrary depth, dispatched by the inner bean's own strict hash
rather than a version number. ProjectionRouting maps each combination to a
generated projection class, identified by a class-name suffix carrying the
inner simple name, version, and strict-hash low bits.

Folds in the projection-path decode hardening and allocation trims, the
generation-time logging-unit fix, the isBean gate on the version-history probe,
the row-format guide's schema-evolution section, the JMH benchmark suite, and
the spotless pass over the new sources.
Discover versioned beans nested in List/Map field values (arrays use the
component type, not the element type) so their older versions are enumerated
and projected. Rejects finite @ForyVersion(until) on a live field and drops
RECORD_COMPONENT from the @ForyVersion target for Java 11 module safety
(FIELD+METHOD covers records). Carries the resolved schema in the RowFactory,
precomputed once at build time, instead of mutating the builder.
Claude (on behalf of Steven Schlansker) added 7 commits June 30, 2026 20:11
…ojection

Position-scope the bean codec registration so a same-class key and value bean
do not share one beanEncoderMap entry; the key is always read at the current
schema while the value projects to its historical schema. Covers both the
eager-encode and lazy-decode key positions.
…ersioned bean

Take the evolution path for a top-level array/map whose element or value is a
versioned bean, so the strict-hash prefix is always present and producer and
consumer stay wire-compatible. Drops an unreachable live-field since/until
check and covers added reference and collection field defaults on row
evolution.
…ested

Thread the TypeResolutionContext through the core TypeUtils map key/value
isBean branch so an interface bean is discovered as a map key/value rather than
rejected. Throws for non-accessor methods colliding with an absent projection
field, warns on a large projection cross-product, rejects @ForyVersion(since)
below the first schema version, and documents the Java-only wire framing in the
row format specification.
A versioned bean used as a map key was read at the current schema only, so an
older key decoded against the current layout and corrupted silently. Make map
keys evolve like values: the map header's single 8-byte hash now identifies the
(key-layout, value-layout) combination jointly (FNV-1a mix of the two strict
hashes), so the writer still emits one hash per map and the reader dispatches
both positions to the matching historical projection codec.

MapCodecBuilder builds a key history alongside the value history and enumerates
their cross-product, generating one projection map codec per combination keyed
by the combined hash; a position with no versioned bean contributes a single
current-only layout. Its NON_BEAN_POSITION_HASH sentinel is 0L, which is also a
legitimate FNV result for a real schema, so the sentinel's safety rests on the
build-time collision guards in buildVersioned rather than on 0L being
unreachable. MapEncoderBuilder gained a key codec suffix parallel to the value
suffix so the key subtree routes to its historical row codec instead of being
pinned to current; the generated map class name namespaces the key suffix with
_K so (keyOld, valCurrent) and (keyCurrent, valOld) do not collapse onto one
class. Encoding and both codec formats thread the key suffix through.

The map header hash derivation changes for all evolution-enabled maps; the
feature is unreleased so there is no compatibility constraint.

Tests cover key-only evolution, both-sides evolution, the cross-combination
class-name collision, and the compact format.
Generalize SchemaHistory's per-field nested-bean enumeration from a single
key/value dimension to a list of NestedSite entries, each carrying the
map-branch path (KEY/VALUE per map crossed) from the field root to the bean
leaf. collectNestedSites now descends both the map key (kv.f0) and value
(kv.f1), so a versioned bean used as a map key inside a row field becomes its
own cross-product dimension and evolves independently of the value, to
arbitrary nesting depth. NestedSite.substitute follows the recorded path to
substitute each historical struct into exactly its leaf, leaving other leaves
intact; projectThroughWrapper (the top-level array/map path) follows a
value-only path of the field's map depth, preserving prior behavior.

Previously a row field typed Map<VersionedKey, V> with key-version skew between
writer and reader threw ClassNotCompatibleException at decode, because the row
strict hash includes the key struct but the key was never enumerated. The
realistic case (distinct writer/reader key classes) needs no codegen change:
distinct classes already route to distinct beanCodecKey/nestedBeanSuffix
entries.

Add a build-time guard in MapCodecBuilder.buildVersioned that rejects a
combined (key, value) schema-hash collision instead of silently overwriting a
projection entry, mirroring the row-path strict-hash collision check, and cover
the case where a projection hash would shadow the current hash on the decode
hot path.

Tests: evolvingMapKeyInRowField (directly-typed key) and evolvingKeyOfNestedMap
(key of a map nested under another map's value, exercising the multi-step
branch path). Update the row-format guide to state keys evolve whether the map
is top-level or nested, with the hash semantics for each.
… array/map

A top-level array or map codec enumerated only one nested versioned bean and
stamped that bean's strict hash, so an element that reached more than one
distinct versioned bean class — for example List<Map<KeyBean, ValueBean>> with
KeyBean and ValueBean evolving independently — generated a projection codec for
only one of them and failed to compile a reference to the other.

Enumerate the element/position over the same per-class cross-product the
row-field path uses: SchemaHistory.forElement builds the element field's history
from collectNestedSites, so the element schema's strict hash identifies every
reachable bean's layout jointly and each combination carries its chosen versions
in nestedBeanSchemas(). Codegen routing is generalized from a single
rowCodecSuffix to a per-class Map<Class,String> (nestedBeanSuffix consults it,
absent class meaning current schema), threaded through Encoding, Encoders, the
array/map/compact builders, and BaseBinaryEncoderBuilder, so one generated codec
embeds the right historical row codec for each nested class. ArrayCodecBuilder
and MapCodecBuilder build via forElement; the map keeps its per-position
combined-hash dispatch and evolves multi-bean wrappers within each position,
preserving key non-nullability through projectedPositionField. The value-only
projectThroughWrapper/mapDepth and the single-bean beanPath helper are removed;
evolutionBean is reduced to the reachability-and-naming predicate it now is.

Reaching a bean also means reaching through a non-versioned wrapper bean: a
struct whose own fields are stable but which holds an evolving nested bean (such
as a map-key struct holding an evolving detail) must itself become an evolution
site, or no projection is generated and an older nested payload fails loud with
a map self/peer hash mismatch. collectNestedSites now treats any row bean as a
site when build()'s history has more than one version — covering both directly
annotated beans and these transitive-only wrappers — replacing the
direct-annotation-only isBeanWithVersioning probe with a plain isBean gate, since
build() already expands the nested cross-product. A SchemaEvolutionStressTest
case covers a map struct-key whose nested bean evolves.

Per-class enumeration stays correct: a versioned bean is one class written at
the latest version, so its key and value occurrences in a map are always the
same version; a map with the same logical bean at mixed versions on the two
sides is not a shape a real writer produces. The wire format is unchanged (still
one strict-hash prefix per array/map), so the row format spec needs no change;
the Java row-format guide now states keys and values evolve wherever the map
appears, including top-level array/map wrappers.

Also fold in review polish on the same files: document the ForyVersion.until
no-upper-bound sentinel, import RecordComponent instead of fully qualifying it,
share one FNV-1a helper (SchemaHistory.combineHashes) between the schema hash
and the map combined hash, note why the row/array projection tables need no
builder-side collision guard while the map's combined hash does, and unify
RowEncoderBuilder's nested-bean suffix routing onto the inherited
nestedClassSuffixes field instead of a parallel private map and override.
SchemaHistory.hashField mixed a StructType's constant name then recursed
into its children with no boundary marker. A struct's arity is variable
(unlike list/map, whose arity is fixed by the type kind), so
{a: struct<x>, b} and {a: struct<x, b>} mixed an identical byte sequence
and produced the same 64-bit strict hash. A non-injective strict hash can
route an older payload to the wrong projection codec or trip the
build-time collision guard on a legitimate evolution. Mix the struct's
child count before recursing so the hash stays injective over nesting.

Also tighten visibility and remove duplicate paths surfaced by review:
hoist the duplicated typeCtx() (the synthesize-interfaces resolution
context) into the shared BaseCodecBuilder owner; drop MapCodecBuilder's
one-line combinedHash forwarder in favor of calling
SchemaHistory.combineHashes directly; and narrow computeStrictSchemaHash
to package-private for the new test.

Add SchemaHistoryTest covering the struct-boundary collision, its minimal
empty-struct form, and that structurally identical schemas still hash
equal.
@stevenschlansker stevenschlansker force-pushed the row-codec-schema-versions branch from 96f441f to 0e3480e Compare June 30, 2026 20:13
…ecode

Schema-evolution projection codecs were compiled eagerly at builder time:
the row/array/map builders enumerated the full historical cross-product and
loaded one generated codec class per combination, so a deep nested version
history paid its whole class cost up front whether or not those versions ever
appeared on the wire.

Defer that compilation to decode. Each builder now builds an immutable
hash -> ProjectionSource index that holds only the inputs (the VersionedSchema
and codegen context); the codec class is compiled the first time a payload with
that hash is decoded and cached in a per-encoder LongMap. No new synchronization:
encoders are single-threaded, and the class compile is already memoized globally
by the shared code generator, so a concurrent first-miss on the same hash
compiles the class once.

The build-time collision guards stay eager over the full cross-product (row/array
via SchemaHistory's strict-hash guard, map via its two combined-hash guards), so a
hash clash still fails fast at build rather than on an unlucky decode.

Because the build-time class count no longer tracks annotated history, the
projection-count warning no longer flags a real cost; remove
PROJECTION_COUNT_WARN_THRESHOLD and warnIfManyProjections. Also fix stale Javadoc
that claimed map keys are always read at the current schema and that map support
is in progress, and update the row-format guide's generation-cost note to the
lazy model.
@stevenschlansker stevenschlansker force-pushed the row-codec-schema-versions branch from 5d7e5e3 to 4964041 Compare June 30, 2026 21:01
@stevenschlansker

stevenschlansker commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@chaokunyang , this is not 100% done yet, but I think it is getting very close. Would you mind doing a design review to make sure that this feature will be acceptable to commit into the project? It is not a small feature, but I think the complexity is justified by giving the row format a low-cost method of achieving schema evolution.

If there is some way to streamline the eventual code review of a large volume of LLM-typed code, please let me know.
I have tried to keep the bar for quality high (many, many review iterations per ai policy) and testing thorough.
I considered breaking it up into multiple PRs but there is not a "slice" of the feature that stands on its own.

I intend to test this over the coming month or so internally, and mark the PR as ready once we have some real world experience running it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant