Skip to content

[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700

Open
laserninja wants to merge 3 commits into
apache:mainfrom
laserninja:docs/11195-iceberg-remove-orphan-files-design
Open

[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700
laserninja wants to merge 3 commits into
apache:mainfrom
laserninja:docs/11195-iceberg-remove-orphan-files-design

Conversation

@laserninja

Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Add a design document for the built-in Iceberg maintenance job builtin-iceberg-remove-orphan-files that identifies and removes orphaned data and metadata files from Iceberg table storage locations via Spark's remove_orphan_files procedure.

The design doc covers:

  • Full end-to-end architecture: Policy → Strategy → Adapter → Job
  • Policy content class (IcebergOrphanFileRemovalContent) with configurable parameters: older_than (timestamp), location (custom path), dry_run (preview mode)
  • Strategy handler for time-based trigger evaluation
  • Job adapter for context-to-config conversion
  • Safety considerations (3-day default, dry-run mode, policy-gated execution)
  • Comparison with existing compaction and snapshot expiration flows
  • Proposed PR plan (3 incremental PRs)

Why are the changes needed?

Orphan files accumulate from failed writes, incomplete transactions, schema evolution, or concurrent operations. Without periodic cleanup, these files waste significant storage. The existing built-in jobs cover data compaction, statistics, and snapshot expiration but do not address orphan file cleanup.

A design doc is needed before implementation to align on the approach, safety mechanisms, and PR structure.

Fix: #11195

Does this PR introduce any user-facing change?

No. This is a design document only.

How was this patch tested?

N/A — design doc only, no code changes.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

Code Coverage Report

Overall Project 67.15% +0.08% 🟢
Files changed 81.74% 🟢

Module Coverage
aliyun 1.72% 🔴
api 46.82% 🟢
authorization-common 85.96% 🟢
aws 3.66% 🔴
azure 2.47% 🔴
catalog-common 10.4% 🔴
catalog-fileset 80.23% 🟢
catalog-glue 66.91% 🟢
catalog-hive 79.44% +2.43% 🟢
catalog-jdbc-clickhouse 80.02% 🟢
catalog-jdbc-common 44.22% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 78.38% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 58.53% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.86% 🟢
catalog-lakehouse-paimon 82.14% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 78.01% 🟢
common 50.17% 🟢
core 82.59% 🟢
filesystem-hadoop3 77.27% 🟢
flink 0.0% 🔴
flink-common 47.12% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-common 10.88% 🔴
hive-metastore-common 53.77% 🟢
iceberg-common 58.15% 🟢
iceberg-rest-server 73.9% 🟢
idp-basic 86.2% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.83% 🔴
lance-rest-server 60.13% 🟢
lineage 53.02% 🟢
optimizer 82.95% 🟢
optimizer-api 21.95% 🔴
server 85.96% 🟢
server-common 74.18% 🟢
spark 28.57% 🔴
spark-common 41.66% 🟢
trino-connector 40.13% 🟢
Files
Module File Coverage
catalog-hive HiveCatalogOperations.java 81.74% 🟢


| Layer | Compaction Components | Purpose |
| ------------ | -------------------------------------------------------------------------- | -------------------------------------------------- |
| **Policy** | `Policy.BuiltInType.ICEBERG_COMPACTION`, `IcebergDataCompactionContent` | Define configuration, thresholds, expressions |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align the table format.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit. Simplified the table column separators.


## 9. Open Questions

1. **Cleanup interval tracking** — How should we track when the last orphan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways to trigger the jobs.

  1. Event trigger
  2. Time trigger
    Gravitino doesn't limit the users' choice

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the design doc to document both trigger modes (event-based and time-based). The strategy handler works the same regardless of how it's invoked; the difference is only in scheduling. Added this clarification in Section 5.3.1.

metadata, (c) a separate statistics entry. The compaction flow uses
partition statistics; for orphan removal we may need a different mechanism
since it is table-level and time-driven.
2. **Location parameter** — Should the policy support specifying a custom

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question:

  1. Do we scan the table's location if location parameter is specified.

@laserninja laserninja Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when location is specified, only that custom location is scanned. The table's default location is not scanned in addition. This follows Iceberg's native remove_orphan_files behavior. Updated the field description and resolved Open Question 2 to reflect this.

Custom locations add flexibility but also risk if misconfigured.
3. **Dry-run result persistence** — Should dry-run results be stored
somewhere (e.g., job output metadata) for review before actual deletion?
4. **PR granularity** — Single PR for all remaining layers or split

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends the complexity and code size. It is good if code size is less than 1000 lines.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense, consolidated the PR plan from 3 PRs down to 2: PR 1 for the job layer, PR 2 for policy + strategy + adapter combined. Total code for the remaining layers should be well under 1000 lines.

- Fix table format alignment in Section 4.1
- Document both event-trigger and time-trigger modes
- Clarify location parameter behavior (replaces, not supplements, table location)
- Consolidate PR plan into 2 PRs per reviewer guidance (code < 1000 lines)
- Resolve open questions apache#2 and apache#4

apache#11195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add builtin-iceberg-remove-orphan-files maintenance job

2 participants