Skip to content

[flink-action][server][client] add orphan files cleanup action for remote storage#3404

Open
platinumhamburg wants to merge 16 commits into
apache:mainfrom
platinumhamburg:feat/orphan-files-cleanup
Open

[flink-action][server][client] add orphan files cleanup action for remote storage#3404
platinumhamburg wants to merge 16 commits into
apache:mainfrom
platinumhamburg:feat/orphan-files-cleanup

Conversation

@platinumhamburg

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3403 3403

Brief change log

Tests

API and Format

Documentation

@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch 3 times, most recently from 9b95b89 to 9bd732b Compare May 31, 2026 04:59
@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch from 9bd732b to 901dc41 Compare May 31, 2026 14:24
@luoyuxia luoyuxia requested a review from Copilot June 1, 2026 02:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Flink-based orphan_files_clean action to identify and delete orphaned remote-storage artifacts (log segments/manifests and KV snapshot files) by adding new coordinator read-only RPCs to enumerate the active reference set and wiring client/server support for those RPCs.

Changes:

  • Add new RPCs LIST_REMOTE_LOG_MANIFESTS and LIST_KV_SNAPSHOTS (proto + api keys), implement them in CoordinatorService, and expose them via the client Admin.
  • Add new Flink action module + SPI loader/entrypoint and implement the orphan cleanup DAG (scope enumeration → scan/clean → stats aggregation + empty-dir sweep) with rule-based file classification and audit logging.
  • Extend filesystem metadata support (modification time) to enable safe age-based deletion, and add unit/integration tests for the new behavior.

Reviewed changes

Copilot reviewed 82 out of 82 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
fluss-test-coverage/pom.xml Excludes newly introduced Flink action entry/SPI classes from coverage instrumentation.
fluss-server/src/test/java/org/apache/fluss/server/tablet/TestTabletServerGateway.java Adds stub methods for the new list RPCs in the tablet gateway test implementation.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/TestCoordinatorGateway.java Adds stub methods for the new list RPCs in the coordinator gateway test implementation.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorServiceOrphanRpcsITCase.java Adds IT coverage for coordinator orphan-cleanup RPCs (manifests + snapshot listing).
fluss-server/src/main/java/org/apache/fluss/server/zk/ZooKeeperClient.java Adds ZK helper methods to list remote log manifest handles and bucket snapshot IDs.
fluss-server/src/main/java/org/apache/fluss/server/tablet/TabletService.java Rejects orphan-cleanup RPCs on tablet servers (coordinator-only RPCs).
fluss-server/src/main/java/org/apache/fluss/server/kv/snapshot/CompletedSnapshotStore.java Adds API to expose active snapshot IDs (retained ∪ still-in-use) for listing RPCs.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/lease/KvSnapshotLeaseManager.java Exposes lease-pinned snapshot IDs to support “still-in-use” snapshot reporting.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorService.java Implements listRemoteLogManifests and listKvSnapshots coordinator RPC handlers.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java Adjusts visibility/testing annotation around snapshot store manager accessor.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CompletedSnapshotStoreManager.java Adds per-bucket active snapshot ID computation with ZK fallback when store isn’t in-memory.
fluss-rpc/src/test/java/org/apache/fluss/rpc/TestingTabletGatewayService.java Adds stub methods for the new list RPCs in RPC test scaffolding.
fluss-rpc/src/main/proto/FlussApi.proto Defines new request/response messages for manifest/snapshot listing.
fluss-rpc/src/main/java/org/apache/fluss/rpc/protocol/ApiKeys.java Registers new API keys for the orphan-cleanup list RPCs.
fluss-rpc/src/main/java/org/apache/fluss/rpc/gateway/AdminReadOnlyGateway.java Adds gateway methods for the two new read-only list RPCs.
fluss-flink/pom.xml Adds the new fluss-flink-action module.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/adapter/FlinkMultipleParameterToolTest.java Extends adapter tests for new convenience accessors.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/RuleDispatcherTest.java Adds rule dispatch coverage for orphan-cleanup file classification.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/OrphanDirDetectorTest.java Adds tests for orphan table/partition directory detection by ID guards.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/LogSegmentRuleTest.java Adds log segment rule tests (active-set + cutoff semantics).
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/LogManifestRuleTest.java Adds manifest rule tests (default conservative + opt-in deletion).
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/KvSnapshotFileRuleTest.java Adds KV snapshot file rule tests (active snap dirs + cutoff semantics).
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/rule/KvSharedSstRuleTest.java Adds tests ensuring shared SSTs are never deleted.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/RpcErrorClassifierTest.java Adds tests for stable RPC error categorization used in audit logs.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/job/EmptyDirSweeperTest.java Adds tests for post-clean empty directory sweeping (dry-run + bottom-up).
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/fs/SafeDeleterTest.java Adds tests for safe deletion behavior (dry-run, non-empty dir no-op, etc.).
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/config/OrphanCleanConfigTest.java Adds CLI config parsing/validation tests for orphan cleanup action.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/build/MaxKnownIdsTrackerTest.java Adds tests for max-known ID tracking used for orphan directory guard logic.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/action/orphan/build/ActiveRefsFetcherTest.java Adds tests for manifest/snapshot active-set fetching (retries + per-bucket failures).
fluss-flink/fluss-flink-common/src/main/resources/META-INF/services/org.apache.fluss.flink.action.ActionFactory Registers OrphanFilesCleanActionFactory via ServiceLoader SPI.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/adapter/MultipleParameterToolAdapter.java Adds convenience accessors (has/get/getMultiParameter) for CLI parsing.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/RuleId.java Introduces stable rule identifiers for audit tagging.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/RuleDispatcher.java Implements rule dispatch based on path patterns (log/kv/manifest/shared/unknown).
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/OrphanDirDetector.java Implements orphan table/partition dir detection via parsed ID + max-known guards.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/LogSegmentRule.java Implements log segment deletion decisions using active refs + cutoff + orphan-dir mode.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/LogManifestRule.java Implements conservative manifest handling with opt-in deletion behavior.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/KvSnapshotFileRule.java Implements KV snapshot file classification using active snapshot dir names.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/KvSharedSstRule.java Implements “never delete” policy for shared SST files.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/FileRule.java Defines the rule interface for single-file decisions.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/FileMeta.java Adds immutable file metadata container for rule evaluation.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/Decision.java Adds decision vocabulary for cleanup (DELETE/KEEP/DEFER/SKIP_UNKNOWN).
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/rule/BucketActiveRefs.java Adds immutable bucket-scoped active reference sets for log+kv+manifest paths.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/RpcErrorClassifier.java Adds stable classification of RPC failures for audit/reporting logic.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/OrphanFilesCleanActionFactory.java Adds factory for the orphan_files_clean action and CLI help text.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/OrphanFilesCleanAction.java Adds the action runner that executes the Flink job and logs final stats.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/OrphanCleanUtils.java Adds shared utilities (paths, remote dir resolution, safe listing helpers).
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/StatsAggregateOperator.java Adds custom bounded operator to aggregate stats and run the empty-dir sweep.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/ScanAndCleanFunction.java Implements stage-2 FS scan & cleanup with per-subtask rate limiting.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/OrphanFilesCleanJob.java Builds and executes the 3-stage Flink batch DAG and returns final stats.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/OrphanDirCleanTask.java Adds task type for cleaning an orphan table/partition directory.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/EmptyDirSweeper.java Adds end-of-run empty directory reclamation logic (dry-run aware).
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/CleanTask.java Adds marker interface for stage-1 emitted work items.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/CleanStats.java Adds aggregatable stats object including “touched dirs” for sweeping.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/BucketCleanTask.java Adds task type carrying bucket dirs + active refs for file-level cleanup.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/job/BucketCleaner.java Adds per-bucket directory walker applying rules and safe deletion.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/fs/SafeDeleter.java Centralizes deletion operations with dry-run + rate limiting + audit logging.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/config/OrphanCleanConfig.java Adds CLI configuration parsing and validation for the orphan cleanup action.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/build/RpcListStatus.java Adds shared per-target RPC list status representation.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/build/MaxKnownIdsTracker.java Adds per-run max-known ID tracking used for orphan dir guards.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/build/LogActiveRefsFetchResult.java Adds detailed per-target/per-bucket log manifest read results.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/build/KvActiveRefsFetchResult.java Adds per-target KV active snapshot dir fetch result representation.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/build/ActiveRefsFetcher.java Implements coordinator-RPC-driven active ref fetching with retries and parsing.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/orphan/audit/AuditLogger.java Adds structured audit logger for the cleanup action.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/ActionLoader.java Adds ServiceLoader-based action discovery and CLI dispatch.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/ActionFactory.java Adds SPI interface for action factories.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/action/Action.java Adds base action interface (build + run).
fluss-flink/fluss-flink-action/src/main/java/org/apache/fluss/flink/action/FlussFlinkActionEntrypoint.java Adds the main entrypoint for the Flink action shaded jar.
fluss-flink/fluss-flink-action/pom.xml Introduces new shaded Flink action jar module.
fluss-flink/fluss-flink-2.2/src/test/java/org/apache/fluss/flink/action/orphan/Flink22OrphanFilesCleanITCase.java Adds Flink 2.2 orphan cleanup IT case class.
fluss-flink/fluss-flink-2.2/src/main/java/org/apache/fluss/flink/adapter/MultipleParameterToolAdapter.java Mirrors CLI adapter enhancements for Flink 2.2 variant.
fluss-flink/fluss-flink-1.20/src/test/java/org/apache/fluss/flink/action/orphan/Flink20OrphanFilesCleanITCase.java Adds Flink 1.20 orphan cleanup IT case class.
fluss-flink/fluss-flink-1.19/src/test/java/org/apache/fluss/flink/action/orphan/Flink19OrphanFilesCleanITCase.java Adds Flink 1.19 orphan cleanup IT case class.
fluss-filesystems/fluss-fs-hadoop/src/main/java/org/apache/fluss/fs/hdfs/HadoopFileStatus.java Exposes HDFS modification time via FileStatus.
fluss-common/src/test/java/org/apache/fluss/fs/FileStatusTest.java Adds test locking down fail-safe default for modification time.
fluss-common/src/main/java/org/apache/fluss/utils/FlussPaths.java Makes remote log metadata dir name constant public for reuse.
fluss-common/src/main/java/org/apache/fluss/fs/local/LocalFileStatus.java Exposes local FS modification time via FileStatus.
fluss-common/src/main/java/org/apache/fluss/fs/FileStatus.java Adds default getModificationTime() (fail-safe MAX_VALUE) to FileStatus interface.
fluss-client/src/main/java/org/apache/fluss/client/utils/ClientRpcMessageUtils.java Adds helper to convert TableBucket to PbTableBucket.
fluss-client/src/main/java/org/apache/fluss/client/admin/FlussAdmin.java Implements client-side admin methods for the new list RPCs.
fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java Extends Admin API with internal default methods for the new list RPCs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch from 901dc41 to 0a34c82 Compare June 1, 2026 03:04

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@platinumhamburg Thanks for the pr. Left minor comment. PTAL

Comment thread fluss-server/src/main/java/org/apache/fluss/server/tablet/TabletService.java Outdated

@swuferhong swuferhong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @platinumhamburg Thanks for your great work, I left some comments.

Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/FlussAdmin.java Outdated
Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/FlussAdmin.java Outdated
Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java Outdated
Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java Outdated
Comment thread fluss-server/src/main/java/org/apache/fluss/server/zk/ZooKeeperClient.java Outdated
Comment thread fluss-flink/fluss-flink-action/pom.xml Outdated
platinumhamburg added a commit to platinumhamburg/fluss that referenced this pull request Jun 2, 2026
- Move orphan RPCs from AdminReadOnlyGateway to AdminGateway
- Fix CoordinatorContext thread safety via AccessContextEvent
- Add partition ownership validation in orphan RPCs
- Decouple Admin API from PB types with domain models
- Catch IOException in SafeDeleter to prevent batch job failure
- Skip shared SST directory listing in BucketCleaner
- Pass extraConfigs to StatsAggregateOperator for FS init
- Set maxParallelism(1) on single-parallelism operators
- Rename FlussFlinkActionEntrypoint to FlussActionEntrypoint
- Rename Flink19/Flink20 ITCase to Flink119/Flink120
- Remove unused methods and fix thread leak in test
platinumhamburg added a commit to platinumhamburg/fluss that referenced this pull request Jun 2, 2026
@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch 2 times, most recently from 71892a7 to 1c61b9c Compare June 2, 2026 14:13
@platinumhamburg

Copy link
Copy Markdown
Contributor Author

@swuferhong @luoyuxia Thanks for the review. All issues resolved — please re-review again when you have time.

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM!

platinumhamburg added a commit to platinumhamburg/fluss that referenced this pull request Jun 11, 2026
@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch 3 times, most recently from cb4d8ec to 43db84b Compare June 12, 2026 03:34

@wuchong wuchong left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RateLimiter in the job only throttles deletes. The job also issues many read-side FS calls — especially listStatus over log tablet dirs that can hold tens of thousands of segment dirs. Unthrottled, this can saturate the underlying store and impact cluster stability. Out of scope for this PR. Could you open a tracking issue for this and replace this TODO with a one-line pointer to it?

Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java Outdated
Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java Outdated
@wuchong

wuchong commented Jun 13, 2026

Copy link
Copy Markdown
Member

I also pushed a commit to improve the code a bit for some minor issues. Please take a look.

…rigger

- ScanAndCleanFunction: change open(Configuration) to open(OpenContext)
  for Flink 2.x compatibility (Flink 2.x removed the Configuration overload)
- FlussClusterExtension: triggerSnapshot() returns null on no-op instead of
  failing hard when snapshot ID does not advance (initSnapshot skips when
  logOffset <= lastSnapshotOffset)
- triggerAndWaitSnapshots() silently skips null buckets (original behavior)
@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch from 3430f94 to 16c5d17 Compare June 15, 2026 07:57
@platinumhamburg platinumhamburg force-pushed the feat/orphan-files-cleanup branch from 16c5d17 to 7a26c4e Compare June 15, 2026 08:16
Move RemoteLogManifest and RemoteLogManifestJsonSerde from fluss-server
to fluss-common (org.apache.fluss.remote) so the orphan-files cleanup
client can reuse the shared serde via RemoteLogManifest.fromJsonBytes()
instead of maintaining a parallel hand-written JSON parser.
Reuse Flink minicluster across tests in the same class via Flink's
AbstractTestBase. Cuts Flink120OrphanFilesCleanITCase wall time from
~78s to ~66s and prevents the package-private parent from being picked
up as a standalone test in fluss-flink-common.
The action depends on ListRemoteLogManifests and ListKvSnapshots, both
new server RPCs. The action jar may be deployed against an older
cluster, so probe both RPCs once at the start of stage 1 and abort with
a clear error if the server raises UnsupportedVersionException. Without
this guard the job would silently emit skip_log_target / skip_kv_target
audit events for every bucket and exit with deleted=0, masking the
incompatibility.
Route the final scanned/deleted/failures/bytes-reclaimed counters through
the dedicated fluss.orphan.audit logger so the run-level outcome lands in
the same sink as per-file action= events and can be queried alongside
deletes and skips. The application-side LOG.info line is kept for local
debugging.
The application-side completion line duplicated the structured summary
written by AuditLogger.logSummary. Remove it; the audit logger inherits
root appenders so local runs still surface the result.
@platinumhamburg

Copy link
Copy Markdown
Contributor Author

The RateLimiter in the job only throttles deletes. The job also issues many read-side FS calls — especially listStatus over log tablet dirs that can hold tens of thousands of segment dirs. Unthrottled, this can saturate the underlying store and impact cluster stability. Out of scope for this PR. Could you open a tracking issue for this and replace this TODO with a one-line pointer to it?

@wuchong Thanks for catching this. I agree this should be fixed in this PR because unthrottled remote filesystem reads/lists can be a production stability risk on object stores with QPS limits.

I updated the action to use a single shared remote filesystem operation limiter instead of limiting deletes only. The new option is --remote-fs-op-rate-limit-per-second, and the budget covers remote FS metadata reads, manifest reads, listStatus, and delete operations.

For the distributed ScanAndClean stage, the configured rate is split across subtasks based on the runtime operator parallelism. This is a best-effort job-level target because Flink does not provide a cross-JVM global RateLimiter for this action. Scope enumeration uses the same option as well.

…rtion

BucketCleaner only sweeps empty directories whose mtime is older than the
cutoff. createOldSegmentFile was making the file old but leaving the parent
directory at 'now', so the post-order sweep skipped it. Add makeOld on the
directory to match production reality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flink-action][server][client] add orphan files cleanup action for remote storage

5 participants