Move your data into the tracebloc training environment β validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.
Your raw data
β
βΌ
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β Data ingestor ββββββΊβ Your Kubernetes cluster β
β β β β
β Validates β β Validated dataset β
β Preprocesses β β (ready for training) β
β Transfers β β β
ββββββββββββββββββββ ββββββββββββββββ¬ββββββββββββββββββββ
β
Metadata only
β
βΌ
ββββββββββββββββββββββββββββ
β tracebloc web app β
β (dataset management UI) β
ββββββββββββββββββββββββββββ
Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.
| Type | Categories |
|---|---|
| Image | image_classification, object_detection, keypoint_detection, semantic_segmentation |
| Text / NLP | text_classification, masked_language_modeling |
| Tabular | tabular_classification, tabular_regression |
| Time series | time_series_forecasting, time_to_event_prediction |
Each template ships a sample dataset and an example ingest.yaml you can copy as a starting point.
Describe your dataset in ~8 lines of YAML, then helm install. The official ingestor image (this package, signed + SBOM-attested, published as ghcr.io/tracebloc/ingestor) runs it. No Dockerfile, no Python script.
1. One-time: add the chart repo on your workstation.
helm repo add tracebloc https://tracebloc.github.io/client
helm repo updateThe tracebloc/client parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The tracebloc/ingestor subchart submits per-dataset ingestion runs against it.
Already installed the client via the one-liner (
bash <(curl -fsSL https://tracebloc.io/i.sh))? Use--reset-then-reuse-valuesso the helm upgrade doesn't drop the values the installer applied:helm upgrade <workspace> tracebloc/client -n <namespace> --reset-then-reuse-valuesAppend
--version <version-number>to pin a specific chart version.
2. Stage your data on the cluster's shared PVC.
The chart doesn't transport data into the cluster β it points at data already accessible to the cluster's shared PVC (client-pvc by default, mounted at /data/shared/ inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway kubectl cp Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe + manifests β tracebloc/client/ingestor/README.md#stage-your-data-on-the-shared-pvc.
3. Write your ingest.yaml.
The example below is for image_classification. Other categories require different fields β e.g. tabular_classification has no images: and instead needs a typed schema: block. Don't copy this one blindly; grab the matching file from examples/yaml/ (one per category) and edit from there. Per-category sample data and READMEs live under templates/.
apiVersion: tracebloc.io/v1
kind: IngestConfig
category: image_classification
table: cats_dogs_train
intent: train
csv: /data/shared/cats-dogs/labels.csv
images: /data/shared/cats-dogs/images/
label: labelThe top-level shape (apiVersion, kind, category, table, intent, label) is the same for every category; the category field picks the validator set, file-extension defaults, and column conventions, and the data-source fields (csv:, images:, schema:, β¦) vary per category. The paths are paths inside the ingestor Pod, which is the PVC mount you populated in step 2.
4. Install once per dataset.
helm install my-cats-dogs tracebloc/ingestor \
--namespace tracebloc \
--set-file ingestConfig=./ingest.yamlThe ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset. Customers never build an image, never write a Dockerfile, never track digest versions β the cluster's auto-upgrade flow keeps the official image current.
Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) β tracebloc/client/ingestor/README.md.
Use this when the declarative schema can't express what your data needs β typically when you have non-trivial preprocessing logic, a custom validator, or a BaseProcessor subclass.
1. Install the package.
pip install tracebloc-ingestor2. Pick a template + adapt the script.
cp templates/image_classification/image_classification.py .The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator, etc.) and the Database / APIClient helpers. See examples/ for working scripts.
3. Build + deploy as a Kubernetes Job.
The legacy Dockerfile and ingestor-job.yaml remain the canonical pattern for custom-processor flows:
docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yamlThe Job needs these environment variables (set in ingestor-job.yaml):
| Variable | What it is |
|---|---|
CLIENT_ID, CLIENT_PASSWORD |
Tracebloc client credentials |
CLIENT_PVC |
PVC name shared with the client (must match values.yaml) |
MYSQL_HOST |
Hostname of the client's MySQL service |
SRC_PATH |
Where your raw data is mounted in the ingestor pod |
LABEL_FILE |
Path to labels (e.g. Xy_train.csv) |
TABLE_NAME |
Destination table name in the client database |
TITLE |
(optional) Human-readable dataset name |
LOG_LEVEL |
(optional) INFO, WARNING, ERROR |
If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. (The declarative path's image is already PSA-restricted-compatible; this section only applies to custom Dockerfiles built from this repo.) Two changes are needed.
Check first:
kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jqLook for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.
1. Dockerfile β drop root. Append before ENTRYPOINT:
# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 10012. ingestor-job.yaml β add a hardened securityContext. Both pod-level and container-level:
spec:
template:
spec:
securityContext: # pod-level
runAsNonRoot: true
runAsUser: 1001
seccompProfile:
type: RuntimeDefault
containers:
- name: api
# ... existing container spec ...
securityContext: # container-level
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]For data that doesn't fit any of the existing templates, subclass BaseIngestor:
from tracebloc_ingestor import BaseIngestor, FileTypeValidator
class MyIngestor(BaseIngestor):
validators = [FileTypeValidator(allowed=[".parquet"])]
def transform(self, record):
# your preprocessing
return record
if __name__ == "__main__":
MyIngestor().ingest()- Python 3.8+
- A tracebloc account
- A running tracebloc client on your infrastructure
Platform Β· Docs Β· Data preparation guide Β· Discord
Maintainers: see RELEASING.md for the release procedure.
Apache 2.0 β see LICENSE.
Questions? support@tracebloc.io or open an issue.