[#11753] feat(catalog-iceberg): support primary key via identifier fields by lzshlzsh · Pull Request #11754 · apache/gravitino

lzshlzsh · 2026-06-22T07:30:50Z

What changes were proposed in this pull request?

Wire the Gravitino PRIMARY_KEY index through to / from Iceberg's
schema-level identifier-field-ids, forming a create/load round-trip closure
for primary keys on Iceberg V2 tables.

ConvertUtil
- applyIdentifierFields + primaryKeyColumnNames (forward): after
  building the Iceberg Schema, set identifier-field-ids from the field
  ids of the columns referenced by the single PRIMARY_KEY index. This
  mirrors FlinkSchemaUtil.freshIdentifierFieldIds in apache/iceberg.
  Iceberg requires identifier fields to be required (NOT NULL); we rely
  on Iceberg's own Schema constructor to enforce that constraint and
  intentionally do not silently promote nullability.
- constructIndexesFromIdentifierFields (reverse): reconstruct a
  PRIMARY_KEY index from Schema#identifierFieldIds(), ordered by the
  column position in the schema for deterministic results (Iceberg
  identifier fields are an unordered set).
IcebergTable
- Add ICEBERG_PRIMARY_KEY_INDEX_NAME = "ICEBERG_PRIMARY_KEY_INDEX"
  (analogous to Paimon's PAIMON_PRIMARY_KEY_INDEX).
- internalBuild now stores indexes on the built table; fromIcebergTable
  back-fills the PRIMARY_KEY index from the loaded Iceberg schema.
IcebergCatalogOperations
- Replace the blanket "Iceberg does not support indexes" rejection with
  validateIcebergIndexes: allow at most one PRIMARY_KEY index over one
  or more non-nested columns; reject multiple indexes, non-PRIMARY_KEY
  types, empty column lists, and nested column references with clear
  messages.
- Pass the validated indexes through to the Iceberg table builder.

Out of scope: primary-key evolution via TableChange (Iceberg's
UpdateSchema#setIdentifierFields exists, but PK evolution interacts with
NOT NULL promotion and equality deletes and warrants a separate discussion);
no changes to the iceberg-rest-server module.

Why are the changes needed?

Iceberg V2 natively supports a primary key concept via identifier-field-ids,
which Flink CDC and the Iceberg Flink connector use for upsert /
equality-delete writes. Gravitino's Iceberg catalog currently rejects every
Index at create time, so users cannot:

Express a primary key on an Iceberg table through the Gravitino API or any
engine routed via Gravitino.
Round-trip a primary key created by another engine — even when the
underlying Iceberg table already carries identifier-field-ids,
IcebergTable#fromIcebergTable previously dropped them and Table#index()
always returned empty.

The Paimon catalog already handles the analogous mapping in both directions
(PaimonTable#constructIndexesFromPrimaryKeys,
GravitinoToPaimonTableConverter); this PR aligns the Iceberg catalog with
that behavior.

Fix: #11753

Does this PR introduce any user-facing change?

Yes, in the catalog-lakehouse-iceberg module only:

Creating an Iceberg table with a single PRIMARY_KEY index is now
accepted and produces an Iceberg V2 table with identifier-field-ids
set on its schema (previously rejected with
"Iceberg does not support indexes").
Loading an Iceberg table whose schema has identifier-field-ids now
surfaces a PRIMARY_KEY index named ICEBERG_PRIMARY_KEY_INDEX via
Table#index() (previously always empty for Iceberg).
Invalid index shapes (more than one index, non-PRIMARY_KEY type,
empty column list, nested column reference) are still rejected, now
with focused error messages.

No new property keys, no public API signature changes.

How was this patch tested?

New unit tests cover both directions of the mapping and the validation
rules (run via ./gradlew :catalogs:catalog-lakehouse-iceberg:test):

TestConvertUtil
- forward: PRIMARY_KEY index → Iceberg identifier-field-ids (single &
  composite columns).
- reverse: identifier-field-ids → PRIMARY_KEY index, asserting the
  reconstructed columns are ordered by schema position.
- empty cases: no index ⇒ no identifier fields; no identifier fields ⇒
  Indexes.EMPTY_INDEXES.
- round-trip: forward then reverse yields the original PK column set.
- Iceberg's NOT NULL constraint on identifier fields is exercised by
  relying on the Schema constructor's own validation.
TestIcebergCatalogOperations (validation):
- more than one index ⇒ IllegalArgumentException.
- non-PRIMARY_KEY index type ⇒ IllegalArgumentException.
- empty PK column list ⇒ IllegalArgumentException.
- nested PK column reference (a.b) ⇒ IllegalArgumentException.

No existing tests required changes; the previous "indexes not supported"
behavior had no positive coverage to remove.

…ier fields Map a Gravitino PRIMARY_KEY index to Iceberg identifier-field-ids on create, and reconstruct the PRIMARY_KEY index from identifier-field-ids on load, forming a create/load round-trip closure. - ConvertUtil: applyIdentifierFields + primaryKeyColumnNames (forward), constructIndexesFromIdentifierFields (reverse, ordered by schema column position for determinism) - IcebergTable: ICEBERG_PRIMARY_KEY_INDEX_NAME constant, store indexes in internalBuild, back-fill indexes in fromIcebergTable - IcebergCatalogOperations: replace the hard "does not support indexes" rejection with validateIcebergIndexes (single non-nested PRIMARY_KEY), pass indexes through to the table builder Tests: forward/reverse/round-trip and validation rejection cases.

Copilot

Pull request overview

This PR wires Gravitino PRIMARY_KEY indexes to Iceberg V2 schema-level identifier-field-ids (and back) in the catalog-lakehouse-iceberg module, enabling create/load round-trips of primary keys for Iceberg tables.

Changes:

Map a single Gravitino PRIMARY_KEY index to Iceberg Schema identifier field IDs during schema conversion, and reconstruct a synthetic PRIMARY_KEY index when loading an Iceberg table.
Relax Iceberg create-time index rejection by validating/allowing a single top-level PRIMARY_KEY index and passing indexes through to the Iceberg table builder.
Add unit tests for forward/reverse mapping and basic index validation rejections.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/converter/ConvertUtil.java	Applies `identifier-field-ids` from PK indexes and reconstructs PK indexes from Iceberg schema identifier fields.
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/IcebergCatalogOperations.java	Replaces blanket “indexes not supported” check with targeted PK-only validation and passes indexes into table builder.
catalogs/catalog-lakehouse-iceberg/src/main/java/org/apache/gravitino/catalog/lakehouse/iceberg/IcebergTable.java	Persists indexes on built tables and reconstructs PK indexes when loading from Iceberg metadata.
catalogs/catalog-lakehouse-iceberg/src/test/java/org/apache/gravitino/catalog/lakehouse/iceberg/converter/TestConvertUtil.java	Adds unit tests for PK ↔ identifier-field-ids mapping and NOT NULL constraint behavior.
catalogs/catalog-lakehouse-iceberg/src/test/java/org/apache/gravitino/catalog/lakehouse/iceberg/TestIcebergCatalogOperations.java	Adds unit tests for create-time index validation failures.

+    Preconditions.checkArgument(
+        indexes.length == 1, "Iceberg only supports no more than one PRIMARY_KEY Index.");
+    Index index = indexes[0];
+    Preconditions.checkArgument(
+        index.type() == Index.IndexType.PRIMARY_KEY, "Iceberg only supports primary key Index.");


+    Arrays.stream(fieldNames)
+        .forEach(
+            fieldName ->
+                Preconditions.checkArgument(
+                    fieldName != null && fieldName.length == 1,
+                    "The primary key columns should not be nested."));


+    String[][] fieldNames =
+        schema.columns().stream()
+            .filter(field -> identifierFieldIds.contains(field.fieldId()))
+            .map(field -> new String[] {field.name()})
+            .toArray(String[][]::new);
+    return new Index[] {Indexes.primary(IcebergTable.ICEBERG_PRIMARY_KEY_INDEX_NAME, fieldNames)};


+    Assertions.assertEquals(
+        com.google.common.collect.ImmutableSet.of(
+            schema.findField("id").fieldId(), schema.findField("region").fieldId()),
+        schema.identifierFieldIds());


+  /** The name of the synthetic primary key index reconstructed from Iceberg identifier fields. */
+  @VisibleForTesting
+  public static final String ICEBERG_PRIMARY_KEY_INDEX_NAME = "ICEBERG_PRIMARY_KEY_INDEX";


+  @Test
+  public void testCreateTableRejectsNestedPrimaryKeyColumn() {
+    Index[] indexes = new Index[] {Indexes.primary("pk", new String[][] {{"struct", "field"}})};
+    IllegalArgumentException exception =
+        Assertions.assertThrows(
+            IllegalArgumentException.class, () -> createTableWithIndexes(indexes));
+    Assertions.assertTrue(exception.getMessage().contains("should not be nested"));
+  }
+


lzshlzsh changed the title ~~[#11753] feat(catalog-iceberg): support primary key via identifier fields~~ [#11754] feat(catalog-iceberg): support primary key via identifier fields Jun 22, 2026

lzshlzsh changed the title ~~[#11754] feat(catalog-iceberg): support primary key via identifier fields~~ [#11753] feat(catalog-iceberg): support primary key via identifier fields Jun 22, 2026

yuqi1129 requested a review from Copilot June 23, 2026 02:02

Copilot started reviewing on behalf of yuqi1129 June 23, 2026 02:03 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#11753] feat(catalog-iceberg): support primary key via identifier fields#11754

[#11753] feat(catalog-iceberg): support primary key via identifier fields#11754
lzshlzsh wants to merge 1 commit into
apache:mainfrom
lzshlzsh:iceberg-primary-key-support

lzshlzsh commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lzshlzsh commented Jun 22, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants