Skip to content

feat(installer,chart): place datasets on a network mount while MySQL stays local#262

Open
saadqbal wants to merge 3 commits into
developfrom
feat/743-nfs-host-uid-storage-split
Open

feat(installer,chart): place datasets on a network mount while MySQL stays local#262
saadqbal wants to merge 3 commits into
developfrom
feat/743-nfs-host-uid-storage-split

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What

Storage work for tracebloc/backend#743 — let VMs whose real storage is a network (NFS/CIFS) mount keep the database on local disk while the large dataset volume lives on the network mount.

Stacked on #261 (the preflight guard). Until #261 merges, this PR's diff also includes #261's commit — review/merge #261 first, then this (or review this against fix/743-preflight-network-fs).

Why

MySQL/InnoDB corrupts on NFS and the chart's root chown init-container is blocked by NFS root_squash, so the DB must stay local. But the dataset volume (usually the bulk of storage) can live on the customer's network mount — provided the pod that writes it runs as the uid that owns the export.

Changes

Chart (client/):

  • Parameterize the dataset PV hostPath base via hostPath.datasetPath (helper tracebloc.clientDataHostPath), default /tracebloc (byte-identical render). mysql + logs PV paths are untouched. values.yaml + schema + default dict nil-guard for --reuse-values.

Installer (bash + PowerShell):

  • New HOST_DATASET_DIR: validated (must exist + be writable; may live outside $HOME, unlike HOST_DATA_DIR; system paths barred), bind-mounted into k3d at a distinct /tracebloc-data path; the dataset dir is created there while mysql + logs stay local.
  • When set, the generated values set hostPath.datasetPath=/tracebloc-data and (Linux) pass HOST_UID/HOST_GID env to jobs-manager — consumed by the client-runtime ingestor change (separate PR) so spawned ingestion pods write the host-owned NFS export as the owning uid.
  • Preflight notes the dataset dir is exempt from the network-FS hard-fail.

Docs: docs/INSTALL.md checklist + docs/SECURITY.md §5.4.

Tests

  • helm-unittest: new shared_images_pvc_test.yaml + mysql/logs split-only guards — 259 pass.
  • bats: HOST_DATASET_DIR validation, second-mount, dir-split, values-generation.
  • helm lint --strict, shellcheck --severity=error, PowerShell parse — all clean.

Cross-repo

The end-to-end NFS write path also needs the client-runtime ingestor-uid PR (run the ingestion pod as HOST_UID). This PR + that one together complete backend#743's "datasets on NFS".

🤖 Generated with Claude Code


Note

Medium Risk
Changes persistent volume paths, k3d bind mounts, and install preflight behavior; misconfiguration could misroute datasets or block installs, but MySQL remains local by design and guards target that failure mode.

Overview
backend#743 splits storage so MySQL and logs stay on local disk while the shared dataset volume can live on a separate network (NFS) mount.

The Helm chart adds hostPath.datasetPath (helper tracebloc.clientDataHostPath) so only the shared-images / dataset PV base path changes; default /tracebloc keeps renders unchanged. MySQL and logs hostPath paths are unchanged.

Installers gain HOST_DATASET_DIR: optional host dir (may be outside $HOME, must exist and be writable), bind-mounted into k3d at /tracebloc-data, with hostPath.datasetPath=/tracebloc-data and (Linux) HOST_UID/HOST_GID in generated values for NFS root_squash ingestion writes. Reusing an existing cluster without that bind mount fails fast so datasets are not placed on ephemeral node storage.

Preflight now hard-fails when HOST_DATA_DIR is on NFS/CIFS/SMB (override TRACEBLOC_ALLOW_NETWORK_FS); HOST_DATASET_DIR is noted as allowed on network FS. Docs (INSTALL.md, SECURITY.md §5.4) and helm/bats/Pester tests cover the split and guards.

Reviewed by Cursor Bugbot for commit c31d866. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 2 commits June 16, 2026 17:30
Detect NFS/CIFS/SMB for HOST_DATA_DIR in preflight (bash + PowerShell) and
fail fast with an actionable message instead of a cryptic MySQL
CrashLoopBackOff ~20 min into install: MySQL/InnoDB corrupts on network
storage and the chart root chown init-container is blocked by NFS root_squash.

- preflight.sh: _pf_fstype reader (findmnt, then GNU stat, then df+mount;
  portable incl. macOS) + _pf_storage_type wired into run_preflight. Allowlists
  network fstypes so local FSes including overlay/tmpfs (CI) pass.
- install-k8s.ps1: Get-PfFsType (UNC / network drive) + Test-Preflight check.
- TRACEBLOC_ALLOW_NETWORK_FS=1 overrides (mirrors TRACEBLOC_ALLOW_ARM64).
- Tests: 10 bats cases + Pester cases (network -> fail, override, undetermined,
  Windows-only Get-PfFsType reader).

Part 1 of 3 for tracebloc/backend#743.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stays local

Storage split for VMs whose real storage is an NFS/CIFS mount (backend#743):
the database must stay on local disk (InnoDB over NFS is unsafe) but the large
dataset volume can live on the network mount.

Chart:
- Parameterize the dataset PV hostPath base via hostPath.datasetPath (helper
  tracebloc.clientDataHostPath). Default /tracebloc keeps it byte-identical;
  mysql + logs PV paths are unchanged. values.yaml + schema + nil-guard for
  --reuse-values upgrades.

Installer (bash + PowerShell):
- New HOST_DATASET_DIR: validated (must exist + be writable; MAY live outside
  $HOME unlike HOST_DATA_DIR; system paths barred), bind-mounted into k3d at a
  distinct /tracebloc-data path; the dataset dir is created there while mysql +
  logs stay local. When set, the generated values set
  hostPath.datasetPath=/tracebloc-data and (Linux) pass HOST_UID/HOST_GID env to
  jobs-manager so spawned ingestion pods write the host-owned NFS export as the
  owning uid. Preflight notes the dataset dir is exempt from the network-FS block.

Tests: new shared_images_pvc_test.yaml + mysql/logs split-only guards
(helm-unittest, 259 pass); HOST_DATASET_DIR validation, second-mount, dir-split
and values-generation cases (bats). Docs: INSTALL.md checklist + SECURITY.md 5.4.

Part 2/3 of backend#743. The end-to-end NFS write path also needs the
client-runtime ingestor-uid change (separate PR) so jobs-manager reads HOST_UID.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 096c4bb. Configure here.

Comment thread scripts/lib/cluster.sh
@LukasWodka

Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…mount (Bugbot)

The HOST_DATASET_DIR -> /tracebloc-data bind mount is baked into the k3d nodes at
create time (_create_new_cluster / the PS1 equivalent). k3d cannot add a mount to a
RUNNING cluster, but install-client-helm.sh still wrote `datasetPath: /tracebloc-data`
into the generated values whenever HOST_DATASET_DIR was merely set — so an
existing-cluster re-run pointed the chart's dataset PV at ephemeral in-node storage,
silently putting datasets on disposable storage instead of the network export (lost
on a restart).

Add _check_existing_cluster_dataset_mount (cluster.sh) + the PowerShell equivalent,
mirroring the existing _check_existing_cluster_proxy/bind drift checks: on an existing
cluster with HOST_DATASET_DIR set, inspect the server node for the /tracebloc-data
mount and FAIL FAST with the recreate remedy if it is absent — rather than installing
a quietly misrouted dataset volume. Fail-fast (not warn) because this is silent data
loss, consistent with the network-FS fail-fast guard. Values generation needs no
change: the install now stops in Step 2, before helm runs.

+4 bats (cluster.bats): unset -> no-op, mount present -> pass, mount ABSENT -> fail
fast, inspect fails -> no-op. bash + shellcheck clean; pwsh parses the .ps1; full
cluster suite 27/27.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants