Summary
tracebloc dataset rm <name> drops the table but fails to delete the dataset's staging files on the shared PVC. The error is:
teardown incomplete — the table <schema>.<dataset> was dropped, but removing its files failed;
re-run `tracebloc dataset rm <dataset>` to remove the leftover files: removing PVC paths:
exec stream against tracebloc/<jobs-manager-pod>: command terminated with exit code 1
(rm: cannot remove '/data/shared/.tracebloc-staging/<dataset>/labels.csv': Permission denied)
The suggested "re-run" never succeeds — it fails on the same permission error every time. Orphaned staging files accumulate on the shared PVC while the dataset appears removed (the table is gone), masking the leak.
Reported during dataset ingestion/removal testing.
Root cause (verified)
A UID mismatch with no fsGroup bridge:
- The ingestor Job writes staging files as uid 65534 —
tracebloc/data-ingestors Dockerfile:55 → USER 65534.
- The teardown
rm -rf is exec'd inside the jobs-manager pod, which runs as its image UID (not 65534):
tracebloc/cli internal/push/teardown.go:91 builds the rm (append([]string{"rm","-rf"}, plan.PVCPaths...)), error wrap :93; exec stream internal/push/stream.go:100; user-facing wrap internal/cli/dataset_rm.go:190-191.
- The jobs-manager pod sets
runAsNonRoot: true but no runAsUser and no fsGroup — client chart client/templates/jobs-manager-deployment.yaml:30-33.
A non-root UID that is not 65534 cannot delete uid-65534-owned files in a directory that is not group-writable → Permission denied. The cli comment at internal/cli/dataset_rm.go:187 ("idempotent, so re-running completes the cleanup") assumes a transient failure; that assumption does not hold for a permission error, so the retry advice is dead-end.
Caveat that makes this non-trivial
fsGroup is not applied to hostPath volumes (kubernetes/kubernetes#138411 — already noted in this chart for the bare-metal mysql init). On bare-metal / hostPath clusters, adding fsGroup alone will not fix it.
Options (design decision needed before coding)
- Shared
fsGroup on both pods + group-writable staging — clean on CSI / dynamic PVs, no-op on hostPath.
- Ingestor creates staging dirs group-writable / setgid so any group member can clean up.
- Ingestor owns cleanup of its own staging (delete from a uid-65534 context); cli only drops the table.
- Run the teardown
rm as uid 65534 (dedicated pod / initContainer).
Affected repos: client (chart securityContext — this issue's home), client-runtime (jobs-manager image / uid), data-ingestors (staging dir perms), cli (teardown path + the misleading retry message).
Secondary fix (cli)
tracebloc/cli internal/cli/dataset_rm.go:187-191: do not advise "re-run … to remove the leftover files" when the failure is a permission error — re-running cannot help. Detect EACCES and give accurate guidance (or an operator-side privileged cleanup path).
Acceptance criteria
Refs
Summary
tracebloc dataset rm <name>drops the table but fails to delete the dataset's staging files on the shared PVC. The error is:The suggested "re-run" never succeeds — it fails on the same permission error every time. Orphaned staging files accumulate on the shared PVC while the dataset appears removed (the table is gone), masking the leak.
Reported during dataset ingestion/removal testing.
Root cause (verified)
A UID mismatch with no
fsGroupbridge:tracebloc/data-ingestorsDockerfile:55→USER 65534.rm -rfis exec'd inside the jobs-manager pod, which runs as its image UID (not 65534):tracebloc/cliinternal/push/teardown.go:91builds the rm (append([]string{"rm","-rf"}, plan.PVCPaths...)), error wrap:93; exec streaminternal/push/stream.go:100; user-facing wrapinternal/cli/dataset_rm.go:190-191.runAsNonRoot: truebut norunAsUserand nofsGroup—clientchartclient/templates/jobs-manager-deployment.yaml:30-33.A non-root UID that is not 65534 cannot delete uid-65534-owned files in a directory that is not group-writable →
Permission denied. The cli comment atinternal/cli/dataset_rm.go:187("idempotent, so re-running completes the cleanup") assumes a transient failure; that assumption does not hold for a permission error, so the retry advice is dead-end.Caveat that makes this non-trivial
fsGroupis not applied to hostPath volumes (kubernetes/kubernetes#138411 — already noted in this chart for the bare-metal mysql init). On bare-metal / hostPath clusters, addingfsGroupalone will not fix it.Options (design decision needed before coding)
fsGroupon both pods + group-writable staging — clean on CSI / dynamic PVs, no-op on hostPath.rmas uid 65534 (dedicated pod / initContainer).Affected repos:
client(chart securityContext — this issue's home),client-runtime(jobs-manager image / uid),data-ingestors(staging dir perms),cli(teardown path + the misleading retry message).Secondary fix (cli)
tracebloc/cliinternal/cli/dataset_rm.go:187-191: do not advise "re-run … to remove the leftover files" when the failure is a permission error — re-running cannot help. DetectEACCESand give accurate guidance (or an operator-side privileged cleanup path).Acceptance criteria
tracebloc dataset rm <name>removes both the table and all staging files on supported volume types; hostPath behavior documented explicitly.Refs