Skip to content

[#11753] feat(catalog-iceberg): support primary key via identifier fields#11754

Open
lzshlzsh wants to merge 1 commit into
apache:mainfrom
lzshlzsh:iceberg-primary-key-support
Open

[#11753] feat(catalog-iceberg): support primary key via identifier fields#11754
lzshlzsh wants to merge 1 commit into
apache:mainfrom
lzshlzsh:iceberg-primary-key-support

Conversation

@lzshlzsh

Copy link
Copy Markdown

What changes were proposed in this pull request?

Wire the Gravitino PRIMARY_KEY index through to / from Iceberg's
schema-level identifier-field-ids, forming a create/load round-trip closure
for primary keys on Iceberg V2 tables.

  • ConvertUtil
    • applyIdentifierFields + primaryKeyColumnNames (forward): after
      building the Iceberg Schema, set identifier-field-ids from the field
      ids of the columns referenced by the single PRIMARY_KEY index. This
      mirrors FlinkSchemaUtil.freshIdentifierFieldIds in apache/iceberg.
      Iceberg requires identifier fields to be required (NOT NULL); we rely
      on Iceberg's own Schema constructor to enforce that constraint and
      intentionally do not silently promote nullability.
    • constructIndexesFromIdentifierFields (reverse): reconstruct a
      PRIMARY_KEY index from Schema#identifierFieldIds(), ordered by the
      column position in the schema for deterministic results (Iceberg
      identifier fields are an unordered set).
  • IcebergTable
    • Add ICEBERG_PRIMARY_KEY_INDEX_NAME = "ICEBERG_PRIMARY_KEY_INDEX"
      (analogous to Paimon's PAIMON_PRIMARY_KEY_INDEX).
    • internalBuild now stores indexes on the built table; fromIcebergTable
      back-fills the PRIMARY_KEY index from the loaded Iceberg schema.
  • IcebergCatalogOperations
    • Replace the blanket "Iceberg does not support indexes" rejection with
      validateIcebergIndexes: allow at most one PRIMARY_KEY index over one
      or more non-nested columns; reject multiple indexes, non-PRIMARY_KEY
      types, empty column lists, and nested column references with clear
      messages.
    • Pass the validated indexes through to the Iceberg table builder.

Out of scope: primary-key evolution via TableChange (Iceberg's
UpdateSchema#setIdentifierFields exists, but PK evolution interacts with
NOT NULL promotion and equality deletes and warrants a separate discussion);
no changes to the iceberg-rest-server module.

Why are the changes needed?

Iceberg V2 natively supports a primary key concept via identifier-field-ids,
which Flink CDC and the Iceberg Flink connector use for upsert /
equality-delete writes. Gravitino's Iceberg catalog currently rejects every
Index at create time, so users cannot:

  1. Express a primary key on an Iceberg table through the Gravitino API or any
    engine routed via Gravitino.
  2. Round-trip a primary key created by another engine — even when the
    underlying Iceberg table already carries identifier-field-ids,
    IcebergTable#fromIcebergTable previously dropped them and Table#index()
    always returned empty.

The Paimon catalog already handles the analogous mapping in both directions
(PaimonTable#constructIndexesFromPrimaryKeys,
GravitinoToPaimonTableConverter); this PR aligns the Iceberg catalog with
that behavior.

Fix: #11753

Does this PR introduce any user-facing change?

Yes, in the catalog-lakehouse-iceberg module only:

  • Creating an Iceberg table with a single PRIMARY_KEY index is now
    accepted and produces an Iceberg V2 table with identifier-field-ids
    set on its schema (previously rejected with
    "Iceberg does not support indexes").
  • Loading an Iceberg table whose schema has identifier-field-ids now
    surfaces a PRIMARY_KEY index named ICEBERG_PRIMARY_KEY_INDEX via
    Table#index() (previously always empty for Iceberg).
  • Invalid index shapes (more than one index, non-PRIMARY_KEY type,
    empty column list, nested column reference) are still rejected, now
    with focused error messages.

No new property keys, no public API signature changes.

How was this patch tested?

New unit tests cover both directions of the mapping and the validation
rules (run via ./gradlew :catalogs:catalog-lakehouse-iceberg:test):

  • TestConvertUtil
    • forward: PRIMARY_KEY index → Iceberg identifier-field-ids (single &
      composite columns).
    • reverse: identifier-field-idsPRIMARY_KEY index, asserting the
      reconstructed columns are ordered by schema position.
    • empty cases: no index ⇒ no identifier fields; no identifier fields ⇒
      Indexes.EMPTY_INDEXES.
    • round-trip: forward then reverse yields the original PK column set.
    • Iceberg's NOT NULL constraint on identifier fields is exercised by
      relying on the Schema constructor's own validation.
  • TestIcebergCatalogOperations (validation):
    • more than one index ⇒ IllegalArgumentException.
    • non-PRIMARY_KEY index type ⇒ IllegalArgumentException.
    • empty PK column list ⇒ IllegalArgumentException.
    • nested PK column reference (a.b) ⇒ IllegalArgumentException.

No existing tests required changes; the previous "indexes not supported"
behavior had no positive coverage to remove.

…ier fields

Map a Gravitino PRIMARY_KEY index to Iceberg identifier-field-ids on
create, and reconstruct the PRIMARY_KEY index from identifier-field-ids
on load, forming a create/load round-trip closure.

- ConvertUtil: applyIdentifierFields + primaryKeyColumnNames (forward),
  constructIndexesFromIdentifierFields (reverse, ordered by schema
  column position for determinism)
- IcebergTable: ICEBERG_PRIMARY_KEY_INDEX_NAME constant, store indexes in
  internalBuild, back-fill indexes in fromIcebergTable
- IcebergCatalogOperations: replace the hard "does not support indexes"
  rejection with validateIcebergIndexes (single non-nested PRIMARY_KEY),
  pass indexes through to the table builder

Tests: forward/reverse/round-trip and validation rejection cases.
@lzshlzsh lzshlzsh changed the title [#11753] feat(catalog-iceberg): support primary key via identifier fields [#11754] feat(catalog-iceberg): support primary key via identifier fields Jun 22, 2026
@lzshlzsh lzshlzsh changed the title [#11754] feat(catalog-iceberg): support primary key via identifier fields [#11753] feat(catalog-iceberg): support primary key via identifier fields Jun 22, 2026
@yuqi1129 yuqi1129 requested a review from Copilot June 23, 2026 02:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires Gravitino PRIMARY_KEY indexes to Iceberg V2 schema-level identifier-field-ids (and back) in the catalog-lakehouse-iceberg module, enabling create/load round-trips of primary keys for Iceberg tables.

Changes:

  • Map a single Gravitino PRIMARY_KEY index to Iceberg Schema identifier field IDs during schema conversion, and reconstruct a synthetic PRIMARY_KEY index when loading an Iceberg table.
  • Relax Iceberg create-time index rejection by validating/allowing a single top-level PRIMARY_KEY index and passing indexes through to the Iceberg table builder.
  • Add unit tests for forward/reverse mapping and basic index validation rejections.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/converter/ConvertUtil.java Applies identifier-field-ids from PK indexes and reconstructs PK indexes from Iceberg schema identifier fields.
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/IcebergCatalogOperations.java Replaces blanket “indexes not supported” check with targeted PK-only validation and passes indexes into table builder.
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/IcebergTable.java Persists indexes on built tables and reconstructs PK indexes when loading from Iceberg metadata.
catalogs/catalog-lakehouse-iceberg/src/test/java/org/apache/gravitino/catalog/lakehouse/iceberg/converter/TestConvertUtil.java Adds unit tests for PK ↔ identifier-field-ids mapping and NOT NULL constraint behavior.
catalogs/catalog-lakehouse-iceberg/src/test/java/org/apache/gravitino/catalog/lakehouse/iceberg/TestIcebergCatalogOperations.java Adds unit tests for create-time index validation failures.

Comment on lines +620 to +624
Preconditions.checkArgument(
indexes.length == 1, "Iceberg only supports no more than one PRIMARY_KEY Index.");
Index index = indexes[0];
Preconditions.checkArgument(
index.type() == Index.IndexType.PRIMARY_KEY, "Iceberg only supports primary key Index.");
Comment on lines +629 to +634
Arrays.stream(fieldNames)
.forEach(
fieldName ->
Preconditions.checkArgument(
fieldName != null && fieldName.length == 1,
"The primary key columns should not be nested."));
Comment on lines +110 to +115
String[][] fieldNames =
schema.columns().stream()
.filter(field -> identifierFieldIds.contains(field.fieldId()))
.map(field -> new String[] {field.name()})
.toArray(String[][]::new);
return new Index[] {Indexes.primary(IcebergTable.ICEBERG_PRIMARY_KEY_INDEX_NAME, fieldNames)};
Comment on lines +135 to +138
Assertions.assertEquals(
com.google.common.collect.ImmutableSet.of(
schema.findField("id").fieldId(), schema.findField("region").fieldId()),
schema.identifierFieldIds());
Comment on lines +75 to +77
/** The name of the synthetic primary key index reconstructed from Iceberg identifier fields. */
@VisibleForTesting
public static final String ICEBERG_PRIMARY_KEY_INDEX_NAME = "ICEBERG_PRIMARY_KEY_INDEX";
Comment on lines +127 to +135
@Test
public void testCreateTableRejectsNestedPrimaryKeyColumn() {
Index[] indexes = new Index[] {Indexes.primary("pk", new String[][] {{"struct", "field"}})};
IllegalArgumentException exception =
Assertions.assertThrows(
IllegalArgumentException.class, () -> createTableWithIndexes(indexes));
Assertions.assertTrue(exception.getMessage().contains("should not be nested"));
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement] Support primary key on Iceberg tables via identifier-field-ids

2 participants