[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700
[#11195] docs: add design doc for iceberg remove-orphan-files maintenance job#11700laserninja wants to merge 3 commits into
Conversation
Code Coverage Report
Files
|
|
|
||
| | Layer | Compaction Components | Purpose | | ||
| | ------------ | -------------------------------------------------------------------------- | -------------------------------------------------- | | ||
| | **Policy** | `Policy.BuiltInType.ICEBERG_COMPACTION`, `IcebergDataCompactionContent` | Define configuration, thresholds, expressions | |
There was a problem hiding this comment.
Fixed in the latest commit. Simplified the table column separators.
|
|
||
| ## 9. Open Questions | ||
|
|
||
| 1. **Cleanup interval tracking** — How should we track when the last orphan |
There was a problem hiding this comment.
There are two ways to trigger the jobs.
- Event trigger
- Time trigger
Gravitino doesn't limit the users' choice
There was a problem hiding this comment.
Updated the design doc to document both trigger modes (event-based and time-based). The strategy handler works the same regardless of how it's invoked; the difference is only in scheduling. Added this clarification in Section 5.3.1.
| metadata, (c) a separate statistics entry. The compaction flow uses | ||
| partition statistics; for orphan removal we may need a different mechanism | ||
| since it is table-level and time-driven. | ||
| 2. **Location parameter** — Should the policy support specifying a custom |
There was a problem hiding this comment.
Another question:
- Do we scan the table's location if location parameter is specified.
There was a problem hiding this comment.
No, when location is specified, only that custom location is scanned. The table's default location is not scanned in addition. This follows Iceberg's native remove_orphan_files behavior. Updated the field description and resolved Open Question 2 to reflect this.
| Custom locations add flexibility but also risk if misconfigured. | ||
| 3. **Dry-run result persistence** — Should dry-run results be stored | ||
| somewhere (e.g., job output metadata) for review before actual deletion? | ||
| 4. **PR granularity** — Single PR for all remaining layers or split |
There was a problem hiding this comment.
It depends the complexity and code size. It is good if code size is less than 1000 lines.
There was a problem hiding this comment.
yes makes sense, consolidated the PR plan from 3 PRs down to 2: PR 1 for the job layer, PR 2 for policy + strategy + adapter combined. Total code for the remaining layers should be well under 1000 lines.
- Fix table format alignment in Section 4.1 - Document both event-trigger and time-trigger modes - Clarify location parameter behavior (replaces, not supplements, table location) - Consolidate PR plan into 2 PRs per reviewer guidance (code < 1000 lines) - Resolve open questions apache#2 and apache#4 apache#11195
What changes were proposed in this pull request?
Add a design document for the built-in Iceberg maintenance job
builtin-iceberg-remove-orphan-filesthat identifies and removes orphaned data and metadata files from Iceberg table storage locations via Spark'sremove_orphan_filesprocedure.The design doc covers:
IcebergOrphanFileRemovalContent) with configurable parameters:older_than(timestamp),location(custom path),dry_run(preview mode)Why are the changes needed?
Orphan files accumulate from failed writes, incomplete transactions, schema evolution, or concurrent operations. Without periodic cleanup, these files waste significant storage. The existing built-in jobs cover data compaction, statistics, and snapshot expiration but do not address orphan file cleanup.
A design doc is needed before implementation to align on the approach, safety mechanisms, and PR structure.
Fix: #11195
Does this PR introduce any user-facing change?
No. This is a design document only.
How was this patch tested?
N/A — design doc only, no code changes.