[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job by laserninja · Pull Request #11700 · apache/gravitino

laserninja · 2026-06-16T21:43:44Z

What changes were proposed in this pull request?

Add a design document for the built-in Iceberg maintenance job builtin-iceberg-remove-orphan-files that identifies and removes orphaned data and metadata files from Iceberg table storage locations via Spark's remove_orphan_files procedure.

The design doc covers:

Full end-to-end architecture: Policy → Strategy → Adapter → Job
Policy content class (IcebergOrphanFileRemovalContent) with configurable parameters: older_than (timestamp), location (custom path), dry_run (preview mode)
Strategy handler for time-based trigger evaluation
Job adapter for context-to-config conversion
Safety considerations (3-day default, dry-run mode, policy-gated execution)
Comparison with existing compaction and snapshot expiration flows
Proposed PR plan (3 incremental PRs)

Why are the changes needed?

Orphan files accumulate from failed writes, incomplete transactions, schema evolution, or concurrent operations. Without periodic cleanup, these files waste significant storage. The existing built-in jobs cover data compaction, statistics, and snapshot expiration but do not address orphan file cleanup.

A design doc is needed before implementation to align on the approach, safety mechanisms, and PR structure.

Fix: #11195

Does this PR introduce any user-facing change?

No. This is a design document only.

How was this patch tested?

N/A — design doc only, no code changes.

apache#11195

github-actions · 2026-06-16T22:56:06Z

Code Coverage Report

Overall Project	67.15% `+0.08%`	🟢
Files changed	81.74%	🟢

Module	Coverage
aliyun	1.72%	🔴
api	46.82%	🟢
authorization-common	85.96%	🟢
aws	3.66%	🔴
azure	2.47%	🔴
catalog-common	10.4%	🔴
catalog-fileset	80.23%	🟢
catalog-glue	66.91%	🟢
catalog-hive	79.44% `+2.43%`	🟢
catalog-jdbc-clickhouse	80.02%	🟢
catalog-jdbc-common	44.22%	🟢
catalog-jdbc-doris	80.28%	🟢
catalog-jdbc-hologres	54.03%	🟢
catalog-jdbc-mysql	79.23%	🟢
catalog-jdbc-oceanbase	78.38%	🟢
catalog-jdbc-postgresql	82.29%	🟢
catalog-jdbc-starrocks	78.51%	🟢
catalog-kafka	77.01%	🟢
catalog-lakehouse-generic	58.53%	🟢
catalog-lakehouse-hudi	79.1%	🟢
catalog-lakehouse-iceberg	85.86%	🟢
catalog-lakehouse-paimon	82.14%	🟢
catalog-model	77.72%	🟢
cli	44.51%	🟢
client-java	78.01%	🟢
common	50.17%	🟢
core	82.59%	🟢
filesystem-hadoop3	77.27%	🟢
flink	0.0%	🔴
flink-common	47.12%	🟢
flink-runtime	0.0%	🔴
gcp	14.12%	🔴
hadoop-common	10.88%	🔴
hive-metastore-common	53.77%	🟢
iceberg-common	58.15%	🟢
iceberg-rest-server	73.9%	🟢
idp-basic	86.2%	🟢
integration-test-common	0.0%	🔴
jobs	66.17%	🟢
lance-common	20.83%	🔴
lance-rest-server	60.13%	🟢
lineage	53.02%	🟢
optimizer	82.95%	🟢
optimizer-api	21.95%	🔴
server	85.96%	🟢
server-common	74.18%	🟢
spark	28.57%	🔴
spark-common	41.66%	🟢
trino-connector	40.13%	🟢

Files

Module	File	Coverage
catalog-hive	HiveCatalogOperations.java	81.74%	🟢

roryqi · 2026-06-17T16:29:19Z

+
+| Layer        | Compaction Components                                                      | Purpose                                            |
+| ------------ | -------------------------------------------------------------------------- | -------------------------------------------------- |
+| **Policy**   | `Policy.BuiltInType.ICEBERG_COMPACTION`, `IcebergDataCompactionContent`    | Define configuration, thresholds, expressions      |


Align the table format.

Fixed in the latest commit. Simplified the table column separators.

roryqi · 2026-06-17T16:38:07Z

+
+## 9. Open Questions
+
+1. **Cleanup interval tracking** — How should we track when the last orphan


There are two ways to trigger the jobs.

Event trigger

Time trigger
Gravitino doesn't limit the users' choice

Updated the design doc to document both trigger modes (event-based and time-based). The strategy handler works the same regardless of how it's invoked; the difference is only in scheduling. Added this clarification in Section 5.3.1.

roryqi · 2026-06-17T16:39:10Z

+   metadata, (c) a separate statistics entry. The compaction flow uses
+   partition statistics; for orphan removal we may need a different mechanism
+   since it is table-level and time-driven.
+2. **Location parameter** — Should the policy support specifying a custom


Another question:

Do we scan the table's location if location parameter is specified.

No, when location is specified, only that custom location is scanned. The table's default location is not scanned in addition. This follows Iceberg's native remove_orphan_files behavior. Updated the field description and resolved Open Question 2 to reflect this.

roryqi · 2026-06-17T16:40:26Z

+   Custom locations add flexibility but also risk if misconfigured.
+3. **Dry-run result persistence** — Should dry-run results be stored
+   somewhere (e.g., job output metadata) for review before actual deletion?
+4. **PR granularity** — Single PR for all remaining layers or split


It depends the complexity and code size. It is good if code size is less than 1000 lines.

yes makes sense, consolidated the PR plan from 3 PRs down to 2: PR 1 for the job layer, PR 2 for policy + strategy + adapter combined. Total code for the remaining layers should be well under 1000 lines.

- Fix table format alignment in Section 4.1 - Document both event-trigger and time-trigger modes - Clarify location parameter behavior (replaces, not supplements, table location) - Consolidate PR plan into 2 PRs per reviewer guidance (code < 1000 lines) - Resolve open questions apache#2 and apache#4 apache#11195

docs: add design doc for iceberg remove-orphan-files maintenance job

014d694

apache#11195

roryqi reviewed Jun 17, 2026

View reviewed changes

laserninja added 2 commits June 17, 2026 21:41

Merge branch 'main' into docs/11195-iceberg-remove-orphan-files-design

e8d0abd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700

[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700
laserninja wants to merge 3 commits into
apache:mainfrom
laserninja:docs/11195-iceberg-remove-orphan-files-design

laserninja commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

roryqi Jun 17, 2026

Uh oh!

laserninja Jun 18, 2026

Uh oh!

roryqi Jun 17, 2026

Uh oh!

laserninja Jun 18, 2026

Uh oh!

roryqi Jun 17, 2026

Uh oh!

laserninja Jun 18, 2026 •

edited

Loading

Uh oh!

roryqi Jun 17, 2026

Uh oh!

laserninja Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## 9. Open Questions

		1. Cleanup interval tracking — How should we track when the last orphan

Conversation

laserninja commented Jun 16, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Report

Uh oh!

roryqi Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

laserninja Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

roryqi Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

laserninja Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

roryqi Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

laserninja Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

laserninja Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 16, 2026 •

edited

Loading

laserninja Jun 18, 2026 •

edited

Loading