Purpose
Define the integration-test and end-to-end-test architecture for the tracebloc platform — what gets tested, at which boundary, and what the trigger is. The deliverable of this epic is the playbook: a per-repo unit/integration inventory plus three concrete ring-shaped gate designs (in-process integration → service-level integration on docker-compose → full-platform E2E on ephemeral Kubernetes). The deliverable when the playbook is picked will be filed as a child epic with the matching ring's implementation tickets.
This epic is dense intentionally. It is the reference artifact the team can reach for once strategy is chosen.
How this relates to tracebloc/backend#699 and tracebloc/backend#665
Parallel companion to #699, built on top of #665. Both epics stay open simultaneously.
- #665 — Test coverage, visibility-first delivered (a) cross-repo coverage measurement and (b) Layer 1 / Layer 2 CI enforcement. We now know what each repo's test suite contains and CI is running everywhere with required-status gating.
- #699 — Quality system, incident-driven safety net says the next gate should be decided by a Phase 0 incident review, not by speculation. It lists ~7 candidate gates; "cross-service integration bugs" and "customer-platform incompatibility" are two of them. tracebloc/backend#699 is the meta-question "which gate next?"
- This epic is the answer-in-advance for two of tracebloc/backend#699's candidates. If Phase 0 of tracebloc/backend#699 surfaces integration / cross-service / customer-platform as the dominant incident class, the rings below are the ready-to-execute specs. If Phase 0 picks a different gate (deps CVEs, SDK contract drift, frontend regressions, etc.), this epic stays on the shelf as reference — the per-repo inventory is still useful for any future test investment, regardless of which gate is built first.
We are not asking the team to choose between tracebloc/backend#699 and this epic. We're asking the team to have both in hand when strategy time comes.
Inline [#699 contrast] annotations appear under each ring noting where tracebloc/backend#699's framing would push back or refine. The full summary of agreements / divergences is at the bottom.
State of the test suite today (2026-05-28, pulled from latest CI runs)
Numbers refreshed from the most recent default-branch CI run of each repo (not from tracebloc/backend#665's 2026-05-16 snapshot). Test counts shown are what CI actually executes, which differs from "what's in the repo" for repos whose workflow scope is narrower than their suite (notably backend).
| Repo |
Tests run in CI |
Coverage % |
Last successful run |
Workflow |
Notes |
| data-ingestors |
599 |
98.51% |
2026-05-27 |
Tests (py3.11+3.12 matrix) |
2691 stmts / 40 missing. fail-under=95 enforced. Best in fleet. |
| tracebloc-py-package |
876 |
95.42% |
2026-05-27 |
Test Suite (py3.11+3.12) |
fail-under=95 enforced. At target. |
| design-system |
94 (4 test files) |
94.93% lines / 91.24% branches |
2026-05-26 |
Tests |
Vitest. Scope still narrow (Button, Input, Switches atoms only). |
| tracebloc-client |
2,082 (3 xfail) |
87.58% |
2026-05-28 |
Run Tests |
8,727 stmts / 892 missing. Just below 90%. Largest test suite in the fleet. |
| averaging-service |
51 |
68% |
2026-05-28 |
CI |
625 stmts / 197 missing. Grew from 6 tests (May 21) to 51 since tracebloc/backend#665's snapshot. |
| backend |
378 (295 Django + 83 Bandit) |
62.33% |
2026-05-28 |
Tests |
11,360 stmts / 3,751 missing / 2,368 branches. First CI-measured baseline — coverage instrumentation just landed via tracebloc/backend#710. Now the lowest in the fleet; substantial Django app surface not yet exercised. |
| client-runtime |
125 |
not measured in CI |
2026-05-22 |
Tests |
pytest only, no coverage flag in workflow. Up from 53 since the 2026-05-16 snapshot. |
| model-zoo |
contract tests across 5 jobs (sklearn / tensorflow / pytorch / survival / ruff) |
not measured in CI |
2026-05-22 |
CI |
Contract test by design — coverage % isn't a meaningful number for a template library. |
| tracebloc-website |
209 unit + 82 E2E (last green) |
not measured in CI |
2026-05-21 |
Test |
Vitest + Playwright. No --coverage flag. Workflow red on develop since 2026-05-21 (contentlayer migration PR tracebloc/backend#302 + GTM env guard). |
| frontend-app |
0 on default branch (workflow just added) |
never produced |
— (never green on develop) |
Tests |
Tests workflow added via #499 but has never passed on develop. 1,057 tests exist locally; CI scope yet to validate them. |
| client (Helm) |
142 helm-unit + 4 schema + 4 template-render + lint |
N/A |
2026-05-26 |
Helm Chart CI |
helm-unittest doesn't produce coverage. |
What the refreshed data shows
- The "converging to 90%+" picture is true where coverage is measured, with caveats. Three repos clear 90% (
data-ingestors 98.51%, tracebloc-py-package 95.42%, design-system 94.93%); tracebloc-client is at 87.58% on a much larger suite and climbing. But:
- Three repos have no coverage instrumentation wired in CI —
client-runtime, model-zoo, tracebloc-website. For model-zoo that's a deliberate design choice (contract-only tests). For the other two it's a measurement gap. (tracebloc/backend#710 closed the backend's instrumentation gap on 2026-05-28.)
- Two repos are now genuinely below target —
backend at 62.33% (first measurement, baselines on the lower end as expected of a large Django app that just got instrumentation) and averaging-service at 68% (FL averaging math still the largest coverage gap relative to its risk profile, aligned with the gap the tracebloc/averaging-service#116 evolution epic targets). Backend is the lowest, but the runway is now visible — the previous "not measured" state hid this.
- Two repos have workflows that aren't usefully green on default branch —
tracebloc-website (red on develop since contentlayer migration) and frontend-app (Tests workflow added but never green on develop). Neither produces coverage today. These are real visibility gaps independent of the integration-test question.
- Test counts grew substantially since the 2026-05-16 tracebloc/backend#665 snapshot —
averaging-service 6→51, data-ingestors (effective) ~10→599, client-runtime 53→125, tracebloc-client ~1,102→2,082, tracebloc-py-package 529→876, tracebloc-website 35+22→209+82. The visibility-first + organic-CI-additions strategy in tracebloc/backend#665 produced real movement in two weeks.
What this means for this epic
Updates relative to the original ring spec, none requiring a structural rewrite:
- Ring 1
averaging-service row (golden-value test) — still the highest-leverage Ring 1 item. Coverage at 68% with 51 tests confirms the suite is shallow rather than wide; golden-value validation is the missing kind, not "more of the same."
- Ring 1
tracebloc-py-package row (contract tests against backend) — keep as drafted. 95.42% line coverage doesn't tell us whether the SDK actually still talks to a real backend; 100% mocked HTTP means contract drift is invisible regardless of coverage.
- Ring 1
tracebloc-client row (per-task integration loops) — keep as drafted. 87.58% on 8,727 statements means most missing lines are likely framework-specific paths exactly where the asymmetric integration coverage gap lives.
- Ring 1
backend row (cross-app APITestCase flows) — prerequisite satisfied by tracebloc/backend#710 landing on 2026-05-28. The new 62.33% baseline means any cross-app flow tests added under Ring 1 will produce a measurable coverage delta. The 3,751 missing statements are an obvious next-target signal for where the cross-app flow tests should bite hardest.
- Ring 1
frontend-app Cypress promotion — gains a prerequisite: the Tests workflow needs to be green on develop first. Until then promoting Cypress to live-backend is premature.
- Ring 1
tracebloc-website / client-runtime / model-zoo — none of these have CI coverage measurement; that's a separate "wire --coverage" task that doesn't belong in this epic. Mention as a prerequisite, defer the actual wiring to tracebloc/backend#665 or a dedicated visibility ticket.
- Cross-repo integration coverage is still zero. That's the bit only this epic addresses; coverage % per repo doesn't move it.
The platform workflow this epic is about
┌─────────────────────────────────────────────────────────────────────┐
│ Customer / data scientist │
│ • tracebloc-py-package (SDK) • frontend-app (Web UI) │
│ │ │ │ │
│ └──┬──┴──────────────┘ │
└──────────────────────────┼──────────────────────────────────────────┘
▼ HTTPS / DRF tokens
┌─────────────────────────────────────────────────────────────────────┐
│ backend (Django) — usermetadata (MSSQL/MySQL), MongoDB, S3, Redis │
│ │ │ │
│ └─ Azure Service Bus ──────────┘ │
│ (training queue, FLOPs queue, averaging queue) │
└─────────────────────────────────────────────────────────────────────┘
▼ AMQP-over-WS
┌─────────────────────────────────────────────────────────────────────┐
│ Customer Kubernetes cluster — deployed by client (Helm chart) │
│ │
│ client-runtime/jobs-manager ←─ polls SB, spawns training pods │
│ │ │
│ ├─► tracebloc-client (training pod) │
│ │ reads /data/shared (read-only) │
│ │ POSTs weights+FLOPs to jobs-manager proxy │
│ │ │
│ client-runtime/pods-monitor ←─ lifecycle events back │
│ client-runtime/resource-monitor (DaemonSet) ← CPU/GPU │
│ │
│ data-ingestors (Job) ─ writes /data/shared from CSV/Parq │
│ averaging-service ─ polls SB averaging queue, │
│ averages weights → S3 │
│ MySQL (in-cluster) ─ per-cluster state store │
└─────────────────────────────────────────────────────────────────────┘
An honest end-to-end walk = bash <(curl -fsSL tracebloc.io/i.sh) on a fresh box → data-ingestors loads a test dataset → tracebloc-py-package submits a use case + experiment → SB delivers → jobs-manager spawns a training pod → tracebloc-client trains → weights flow back → averaging-service averages → backend leaderboard reflects the cycle. Six repos crossed in one test.
The three-ring playbook
Each ring is a self-contained gate. The rings are not a roadmap (don't deploy Ring 1 then Ring 2 then Ring 3 speculatively — see tracebloc/backend#699 contrast notes). They are three different gates at three different cost levels; pick one based on what Phase 0 (or any other prioritization signal) surfaces.
Ring 1 — in-process integration (4–6 weeks, per-repo authors)
What it builds. Extend existing test suites to chain multiple modules inside one repo. No new infrastructure; existing CI workflows run them.
| Repo |
Concrete additions |
Why |
| backend |
3–5 cross-app flow tests in new system_tests/: (a) signup → activate → publish use-case → invite → join → submit-experiment, (b) flops exhaustion auto-stop, (c) edge heartbeat → cycle metric → leaderboard update, (d) team-merge re-allocates flops, (e) inference submission lifecycle. All APITestCase + ORM + model_bakery. |
These flows exist as manual playbook steps but have no automated chain test today. The largest gap in the most-tested repo. |
| tracebloc-py-package |
New tests/integration/ gated by @pytest.mark.integration and a TRACEBLOC_INTEGRATION_BACKEND env var. Phase A: responses cassettes recorded from a live staging backend. Phase B: embedded Django test server fixture. Scenarios: login → upload model → link dataset → start experiment → poll → download. |
The whole SDK is 100% mock. A 5-test recorded-cassette suite would catch any backend API shape change. |
| tracebloc-client |
Extend the *_integration.py pattern (already used in time_to_event_strategies/ and computer_vision_strategies/) to tabular, text, segmentation, MLM, time-series, keypoint, object-detection. One full-loop test per task: build real small model on synthetic data, run N=1 epoch, assert metric is reasonable. |
Asymmetric integration coverage across task types is the single largest framework-upgrade regression surface. |
| client-runtime |
2–3 tests using the kubernetes Python client against kind/k3d in CI. jobs-manager spawns a Job with the right securityContext; pods-monitor sees its lifecycle. Mock Azure SB and MySQL still — only Kubernetes is real. |
The security invariants in CLAUDE.md ("do NOT regress") currently have no real-cluster verification. |
| averaging-service |
Golden-value test: 3 real PyTorch state-dicts on disk, run Average.convert_weights(), assert against hand-computed averaged weights. No SB, no Azure. |
The FL math has 4 unit tests but no real-data validation today. Aligns with tracebloc/averaging-service#116 evolution epic. |
| data-ingestors |
One per-template test that runs the ingestor against a tiny real CSV / Parquet / image folder and asserts the output schema + file layout. |
The current 16 files are mostly schema validators; nothing actually runs an ingestor on real data. |
| frontend-app |
Promote the 10 existing Cypress specs to run against the deployed staging backend (gate behind CYPRESS_LIVE=1). |
Cypress workflow already exists in frontend-app#499; upgrading it is a single-line env-var change. |
Cost: 4–6 weeks calendar, ~25% of one engineer per repo, parallelizable across the existing test authors. No new CI infrastructure.
[#699 contrast] This is the cheapest ring and the one most likely to be picked under any plausible Phase 0 outcome that isn't pure "deps CVEs" or "customer-platform incompatibility." But tracebloc/backend#699 would warn: don't ship a green CI on shallow tests. Each Ring 1 test should be reviewed for whether it would actually catch a recent incident, not just whether it passes. The "averaging-service golden-value" and "client-runtime kind test" rows are the highest-quality items in this ring; the per-use-case real-framework loops in tracebloc-client are the ones at risk of being shape-tests.
Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)
What it builds. A new integration-tests/ directory (in tracebloc/backend per the private-catch-all convention, or tracebloc/.github if we want it org-neutral) that brings up multiple services in containers and tests cross-service contracts.
Stack (docker-compose.integration.yml):
- backend (Django + MSSQL + Redis)
- tracebloc-py-package (editable install)
- averaging-service container
- MongoDB
- Fake Azure Service Bus (emulator container; see Open Questions)
- Excludes: Kubernetes, jobs-manager / pods-monitor, real tracebloc-client training pod. Training pod replaced by a stub that POSTs canned weights to the requests-proxy.
Test scope (~8–12 scenarios, ~5–10 min runtime):
- Real SDK against real backend container: signup → use-case → experiment-submit
- Simulated cycle completion: stub POSTs weights → backend marks cycle done
- Averaging-service consumes from SB, averages 2 participants' weights, writes to S3
- backend leaderboard reflects the cycle
- Failure paths: SB unreachable, stub sends malformed weights, auth token expires mid-flow
Triggered on: PR to backend and tracebloc-py-package; nightly elsewhere.
Cost: 1–2 weeks scaffolding + 4–6 weeks scenario writing. Real cross-repo coupling — needs one owner.
[#699 contrast] This ring is the most direct answer to tracebloc/backend#699's "cross-service integration bugs" candidate gate. tracebloc/backend#699's data-driven framing would ask: how many of the last 6 months of incidents were caught at the SDK ↔ backend boundary or backend ↔ averaging-service boundary? If the answer is ≥3, this ring is justified. If 0, it's speculation and the dependency-CVE or backend-logic-bug gates from tracebloc/backend#699's table should win.
Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)
Trigger: the actual customer-facing one-line installer — bash <(curl -fsSL tracebloc.io/i.sh) for macOS/Linux, irm tracebloc.io/i.ps1 | iex for Windows. The bootstrap script is a thin wrapper around github.com/tracebloc/client@${BRANCH}/scripts/install-k8s.sh. Do not bypass with a direct helm install — that skips the bootstrap, GPU detection, k3s bring-up, and Helm-install layers that real customers exercise. Use the script's existing BRANCH env var to pin against a Helm-chart PR, and CLIENT_ENV to target a dedicated test-e2e backend.
Stack:
- Ephemeral cluster:
kind for local + cheap CI runs, EKS for release-candidate runs (~$1/run)
- Real Azure Service Bus namespace per CI run with
e2e-test-* prefix, torn down after
- Real
tracebloc-client training image pinned to a 2-layer CNN, 1 epoch, ~100-row toy dataset
- Real data-ingestors run against the toy dataset
- Real SDK driver script in
backend/e2e/ (or wherever the epic ends up rooted)
Scenario ladder (do in order within Ring 3):
- Single-cycle, single-participant — proves the wiring (~3 weeks alone)
- Single-cycle, two-participant federated — proves the averaging path
- Multi-cycle with mid-cycle pod kill recovery — proves pods-monitor reporting
- Domain-scoped private use-case + cross-domain user denial — proves authorization
- Inference submission lifecycle — separate path from training cycles
Hard prerequisites before Ring 3 starts:
- A
values.e2e.yaml profile in the client Helm chart with tiny resource requests, no GPU, externalized SB connection strings as test secrets
- The QA_TESTING_PLAYBOOK.md "End-to-End Smoke Test" (steps 1–14) is the source of truth — translate that file directly into a pytest suite
Cost: 8–12 weeks, single owner, ~30–60 min runtime per E2E pass.
[#699 contrast] This ring is the most direct answer to tracebloc/backend#699's "customer-platform incompatibility" candidate gate, and it partners explicitly with #690 (bare-metal test farm) — the bare-metal farm provides the heterogeneous hardware substrate; this ring provides the test driver that runs on it. tracebloc/backend#699's data-driven framing would ask: have we shipped install failures that customers hit, or k8s-version incompatibilities, or RBAC drift? If yes, this ring earns its complexity. If no, defer — the manual playbook + bare-metal farm cover the realistic risk at current customer count.
Where this epic and tracebloc/backend#699 align / diverge — summary
| Topic |
tracebloc/backend#699 says |
This epic says |
Reconciliation |
| Order of test investment |
Phase 0 incident review picks the gate |
Three rings are described, ordering deferred |
Same answer: don't pre-commit to Ring 1 → 2 → 3. Pick based on Phase 0 (or any equivalent signal). |
| Predetermined plans |
"Not a 5-layer blueprint to deploy speculatively" |
Ring playbooks exist on the shelf, conditional on selection |
Same answer: drafts ≠ commitments. Having the draft makes ticket-filing fast when the gate is picked. |
| Test quality |
Green CI on shallow tests is the failure mode |
Each ring section names which sub-items are most at risk of being shape-tests |
Same answer. Reviewer checklist for Ring 1: "would this test have caught a known incident?" |
| Tooling-vs-culture |
Equal weight; pairing + ADRs + onboarding doc are first-class |
Tooling-only focus |
This epic doesn't replicate tracebloc/backend#699's culture work. Culture interventions stay in tracebloc/backend#699. |
| Customer-impact signal |
"from:customer" kanban view, regression-rate metric |
Silent |
Defer to tracebloc/backend#699. Adding regression-rate tracking is not in this epic's scope. |
| Claude Code guardrails |
Audit trail, branch-target enforcement, destructive-op restrictions |
Silent |
Defer to tracebloc/backend#699. |
| Success criterion |
Measurable drop in "% of PRs causing follow-up fix within 7 days" |
"Integration coverage exists where it didn't before" |
#699's metric is better. If/when a ring is picked, the success criterion should be regression-rate, not coverage delta. |
| What's deferred |
Canary, feature flags, perf regression, customer × code-path telemetry |
Same list |
Aligned. |
Net: this epic is a deeper drill into one of tracebloc/backend#699's candidate gate families. It does not contradict tracebloc/backend#699; it pre-fills the answer for two of its candidates. The four areas where this epic is silent (culture, signal, Claude guardrails, regression metric) are owned by tracebloc/backend#699 and should not be re-litigated here.
Open questions / decisions needed before committing to a ring
These need group sign-off, not unilateral decisions:
- Selection trigger. Do we wait for tracebloc/backend#699 Phase 0 to pick a gate, or can this epic be triggered independently (e.g., by a customer-reported integration bug)? Recommended: wait for Phase 0 unless a specific incident creates the urgency.
- Owner if a ring is picked. Ring 1 fits per-repo authors (consistent with tracebloc/backend#665's Phase 2a pattern). Ring 2 + Ring 3 need a single owner — likely Asad (cross-cutting tech-lead, already named on tracebloc/backend#699) with this epic's authors driving repo-specific implementation.
- Azure Service Bus in CI (relevant to Ring 2 + Ring 3):
- Ring 2: emulator container (free, contract-drift risk) vs. a dedicated SB namespace per run (~$1, faithful). Recommended: emulator with periodic drift checks.
- Ring 3: dedicated SB namespace per run is non-negotiable; emulator can't validate the actual customer connection path.
- Bare-metal farm dependency (Ring 3): how much of Ring 3 is blocked on #690 being ready? Recommended: scenarios 1–3 can run on
kind without the bare-metal farm. Scenarios 4–5 should wait until the farm has at least one bare-metal node available.
- Where does
integration-tests/ live? Options: (a) tracebloc/backend/integration-tests/ (matches the private-catch-all convention), (b) a new tracebloc/integration-tests repo (clean separation, but creates an 11th repo to maintain), (c) tracebloc/.github/integration-tests/ (org-level neutral). Recommended: (a) for Ring 2, decide later for Ring 3.
- averaging-service refactor question carried over from tracebloc/backend#665: Lukas originally flagged "refactor
service/averaging.py to extract pure math vs accept the risk." Ring 1's golden-value test + Ring 2's federated-through-SB scenarios make this incremental rather than blocking. Confirm we're OK with the incremental path.
- Frontend ring scope. Ring 1 currently only covers
frontend-app Cypress promotion. Should design-system get an analogous integration ring (e.g., Chromatic visual regression)? Recommended: out of scope here; visual-regression has different failure modes and belongs in its own epic if we want it.
Definition of done (if a ring is selected)
This epic's DoD is the playbook is ready to execute, not "all three rings shipped." Concrete:
If a ring is executed under this epic, that ring's DoD:
- Ring 1: every row in the table has at least one passing integration test in CI, gated as a required status check, and each test has documented "incident-it-would-catch" justification
- Ring 2: docker-compose suite green on
backend + tracebloc-py-package PRs, runtime ≤10 min, regression-rate measured before/after for at least 4 weeks
- Ring 3: scenarios 1–3 running nightly, scenario 4 + 5 running pre-release, full E2E completion in ≤60 min, regression-rate measured before/after for at least 4 weeks
Related
Discussion thread
This epic does not commit the team to any work. It commits the team to having a ready answer when tracebloc/backend#699 Phase 0 picks a gate, when a customer reports an integration regression, or when leadership decides the regression-rate-target window has come.
Suggested next step: review this body together (30 min), confirm the per-repo classification matches lived reality, agree on Open Questions 1–5, then leave the epic on the shelf until a triggering signal arrives.
Purpose
Define the integration-test and end-to-end-test architecture for the tracebloc platform — what gets tested, at which boundary, and what the trigger is. The deliverable of this epic is the playbook: a per-repo unit/integration inventory plus three concrete ring-shaped gate designs (in-process integration → service-level integration on docker-compose → full-platform E2E on ephemeral Kubernetes). The deliverable when the playbook is picked will be filed as a child epic with the matching ring's implementation tickets.
This epic is dense intentionally. It is the reference artifact the team can reach for once strategy is chosen.
How this relates to tracebloc/backend#699 and tracebloc/backend#665
Parallel companion to #699, built on top of #665. Both epics stay open simultaneously.
We are not asking the team to choose between tracebloc/backend#699 and this epic. We're asking the team to have both in hand when strategy time comes.
Inline
[#699 contrast]annotations appear under each ring noting where tracebloc/backend#699's framing would push back or refine. The full summary of agreements / divergences is at the bottom.State of the test suite today (2026-05-28, pulled from latest CI runs)
Numbers refreshed from the most recent default-branch CI run of each repo (not from tracebloc/backend#665's 2026-05-16 snapshot). Test counts shown are what CI actually executes, which differs from "what's in the repo" for repos whose workflow scope is narrower than their suite (notably
backend).Tests(py3.11+3.12 matrix)fail-under=95enforced. Best in fleet.Test Suite(py3.11+3.12)fail-under=95enforced. At target.TestsRun TestsCITestsTestsCITest--coverageflag. Workflow red on develop since 2026-05-21 (contentlayer migration PR tracebloc/backend#302 + GTM env guard).TestsHelm Chart CIWhat the refreshed data shows
data-ingestors98.51%,tracebloc-py-package95.42%,design-system94.93%);tracebloc-clientis at 87.58% on a much larger suite and climbing. But:client-runtime,model-zoo,tracebloc-website. Formodel-zoothat's a deliberate design choice (contract-only tests). For the other two it's a measurement gap. (tracebloc/backend#710 closed the backend's instrumentation gap on 2026-05-28.)backendat 62.33% (first measurement, baselines on the lower end as expected of a large Django app that just got instrumentation) andaveraging-serviceat 68% (FL averaging math still the largest coverage gap relative to its risk profile, aligned with the gap the tracebloc/averaging-service#116 evolution epic targets). Backend is the lowest, but the runway is now visible — the previous "not measured" state hid this.tracebloc-website(red on develop since contentlayer migration) andfrontend-app(Tests workflow added but never green on develop). Neither produces coverage today. These are real visibility gaps independent of the integration-test question.averaging-service6→51,data-ingestors(effective) ~10→599,client-runtime53→125,tracebloc-client~1,102→2,082,tracebloc-py-package529→876,tracebloc-website35+22→209+82. The visibility-first + organic-CI-additions strategy in tracebloc/backend#665 produced real movement in two weeks.What this means for this epic
Updates relative to the original ring spec, none requiring a structural rewrite:
averaging-servicerow (golden-value test) — still the highest-leverage Ring 1 item. Coverage at 68% with 51 tests confirms the suite is shallow rather than wide; golden-value validation is the missing kind, not "more of the same."tracebloc-py-packagerow (contract tests against backend) — keep as drafted. 95.42% line coverage doesn't tell us whether the SDK actually still talks to a real backend; 100% mocked HTTP means contract drift is invisible regardless of coverage.tracebloc-clientrow (per-task integration loops) — keep as drafted. 87.58% on 8,727 statements means most missing lines are likely framework-specific paths exactly where the asymmetric integration coverage gap lives.backendrow (cross-appAPITestCaseflows) — prerequisite satisfied by tracebloc/backend#710 landing on 2026-05-28. The new 62.33% baseline means any cross-app flow tests added under Ring 1 will produce a measurable coverage delta. The 3,751 missing statements are an obvious next-target signal for where the cross-app flow tests should bite hardest.frontend-appCypress promotion — gains a prerequisite: the Tests workflow needs to be green on develop first. Until then promoting Cypress to live-backend is premature.tracebloc-website/client-runtime/model-zoo— none of these have CI coverage measurement; that's a separate "wire--coverage" task that doesn't belong in this epic. Mention as a prerequisite, defer the actual wiring to tracebloc/backend#665 or a dedicated visibility ticket.The platform workflow this epic is about
An honest end-to-end walk =
bash <(curl -fsSL tracebloc.io/i.sh)on a fresh box → data-ingestors loads a test dataset → tracebloc-py-package submits a use case + experiment → SB delivers → jobs-manager spawns a training pod → tracebloc-client trains → weights flow back → averaging-service averages → backend leaderboard reflects the cycle. Six repos crossed in one test.The three-ring playbook
Each ring is a self-contained gate. The rings are not a roadmap (don't deploy Ring 1 then Ring 2 then Ring 3 speculatively — see tracebloc/backend#699 contrast notes). They are three different gates at three different cost levels; pick one based on what Phase 0 (or any other prioritization signal) surfaces.
Ring 1 — in-process integration (4–6 weeks, per-repo authors)
What it builds. Extend existing test suites to chain multiple modules inside one repo. No new infrastructure; existing CI workflows run them.
system_tests/: (a) signup → activate → publish use-case → invite → join → submit-experiment, (b) flops exhaustion auto-stop, (c) edge heartbeat → cycle metric → leaderboard update, (d) team-merge re-allocates flops, (e) inference submission lifecycle. AllAPITestCase+ ORM +model_bakery.tests/integration/gated by@pytest.mark.integrationand aTRACEBLOC_INTEGRATION_BACKENDenv var. Phase A:responsescassettes recorded from a live staging backend. Phase B: embedded Django test server fixture. Scenarios: login → upload model → link dataset → start experiment → poll → download.*_integration.pypattern (already used intime_to_event_strategies/andcomputer_vision_strategies/) to tabular, text, segmentation, MLM, time-series, keypoint, object-detection. One full-loop test per task: build real small model on synthetic data, run N=1 epoch, assert metric is reasonable.kubernetesPython client againstkind/k3din CI. jobs-manager spawns a Job with the rightsecurityContext; pods-monitor sees its lifecycle. Mock Azure SB and MySQL still — only Kubernetes is real.CLAUDE.md("do NOT regress") currently have no real-cluster verification.Average.convert_weights(), assert against hand-computed averaged weights. No SB, no Azure.CYPRESS_LIVE=1).Cost: 4–6 weeks calendar, ~25% of one engineer per repo, parallelizable across the existing test authors. No new CI infrastructure.
Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)
What it builds. A new
integration-tests/directory (intracebloc/backendper the private-catch-all convention, ortracebloc/.githubif we want it org-neutral) that brings up multiple services in containers and tests cross-service contracts.Stack (
docker-compose.integration.yml):Test scope (~8–12 scenarios, ~5–10 min runtime):
Triggered on: PR to
backendandtracebloc-py-package; nightly elsewhere.Cost: 1–2 weeks scaffolding + 4–6 weeks scenario writing. Real cross-repo coupling — needs one owner.
Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)
Trigger: the actual customer-facing one-line installer —
bash <(curl -fsSL tracebloc.io/i.sh)for macOS/Linux,irm tracebloc.io/i.ps1 | iexfor Windows. The bootstrap script is a thin wrapper aroundgithub.com/tracebloc/client@${BRANCH}/scripts/install-k8s.sh. Do not bypass with a directhelm install— that skips the bootstrap, GPU detection, k3s bring-up, and Helm-install layers that real customers exercise. Use the script's existingBRANCHenv var to pin against a Helm-chart PR, andCLIENT_ENVto target a dedicatedtest-e2ebackend.Stack:
kindfor local + cheap CI runs, EKS for release-candidate runs (~$1/run)e2e-test-*prefix, torn down aftertracebloc-clienttraining image pinned to a 2-layer CNN, 1 epoch, ~100-row toy datasetbackend/e2e/(or wherever the epic ends up rooted)Scenario ladder (do in order within Ring 3):
Hard prerequisites before Ring 3 starts:
values.e2e.yamlprofile in theclientHelm chart with tiny resource requests, no GPU, externalized SB connection strings as test secretsCost: 8–12 weeks, single owner, ~30–60 min runtime per E2E pass.
Where this epic and tracebloc/backend#699 align / diverge — summary
Net: this epic is a deeper drill into one of tracebloc/backend#699's candidate gate families. It does not contradict tracebloc/backend#699; it pre-fills the answer for two of its candidates. The four areas where this epic is silent (culture, signal, Claude guardrails, regression metric) are owned by tracebloc/backend#699 and should not be re-litigated here.
Open questions / decisions needed before committing to a ring
These need group sign-off, not unilateral decisions:
kindwithout the bare-metal farm. Scenarios 4–5 should wait until the farm has at least one bare-metal node available.integration-tests/live? Options: (a)tracebloc/backend/integration-tests/(matches the private-catch-all convention), (b) a newtracebloc/integration-testsrepo (clean separation, but creates an 11th repo to maintain), (c)tracebloc/.github/integration-tests/(org-level neutral). Recommended: (a) for Ring 2, decide later for Ring 3.service/averaging.pyto extract pure math vs accept the risk." Ring 1's golden-value test + Ring 2's federated-through-SB scenarios make this incremental rather than blocking. Confirm we're OK with the incremental path.frontend-appCypress promotion. Shoulddesign-systemget an analogous integration ring (e.g., Chromatic visual regression)? Recommended: out of scope here; visual-regression has different failure modes and belongs in its own epic if we want it.Definition of done (if a ring is selected)
This epic's DoD is the playbook is ready to execute, not "all three rings shipped." Concrete:
If a ring is executed under this epic, that ring's DoD:
backend+tracebloc-py-packagePRs, runtime ≤10 min, regression-rate measured before/after for at least 4 weeksRelated
Discussion thread
This epic does not commit the team to any work. It commits the team to having a ready answer when tracebloc/backend#699 Phase 0 picks a gate, when a customer reports an integration regression, or when leadership decides the regression-rate-target window has come.
Suggested next step: review this body together (30 min), confirm the per-repo classification matches lived reality, agree on Open Questions 1–5, then leave the epic on the shelf until a triggering signal arrives.