Epic: Integration + E2E test architecture — repo-by-repo audit and three-ring playbook

## Purpose

Define the integration-test and end-to-end-test architecture for the tracebloc platform — what gets tested, at which boundary, and what the trigger is. The deliverable of *this* epic is the **playbook**: a per-repo unit/integration inventory plus three concrete ring-shaped gate designs (in-process integration → service-level integration on docker-compose → full-platform E2E on ephemeral Kubernetes). The deliverable when the playbook is *picked* will be filed as a child epic with the matching ring's implementation tickets.

This epic is dense intentionally. It is the reference artifact the team can reach for once strategy is chosen.

## How this relates to tracebloc/backend#699 and tracebloc/backend#665

**Parallel companion to [#699](https://github.com/tracebloc/backend/issues/699), built on top of [#665](https://github.com/tracebloc/backend/issues/665).** Both epics stay open simultaneously.

- **#665 — Test coverage, visibility-first** delivered (a) cross-repo coverage measurement and (b) Layer 1 / Layer 2 CI enforcement. We now know what each repo's test suite contains and CI is running everywhere with required-status gating.
- **#699 — Quality system, incident-driven safety net** says the *next* gate should be decided by a Phase 0 incident review, not by speculation. It lists ~7 candidate gates; "cross-service integration bugs" and "customer-platform incompatibility" are two of them. tracebloc/backend#699 is the meta-question "which gate next?"
- **This epic** is the **answer-in-advance** for two of tracebloc/backend#699's candidates. If Phase 0 of tracebloc/backend#699 surfaces integration / cross-service / customer-platform as the dominant incident class, the rings below are the ready-to-execute specs. If Phase 0 picks a different gate (deps CVEs, SDK contract drift, frontend regressions, etc.), this epic stays on the shelf as reference — the per-repo inventory is still useful for any future test investment, regardless of which gate is built first.

**We are not asking the team to choose between tracebloc/backend#699 and this epic. We're asking the team to have both in hand when strategy time comes.**

Inline `[#699 contrast]` annotations appear under each ring noting where tracebloc/backend#699's framing would push back or refine. The full summary of agreements / divergences is at the bottom.

## State of the test suite today (2026-05-28, pulled from latest CI runs)

Numbers refreshed from the most recent default-branch CI run of each repo (not from tracebloc/backend#665's 2026-05-16 snapshot). Test counts shown are **what CI actually executes**, which differs from "what's in the repo" for repos whose workflow scope is narrower than their suite (notably `backend`).

| Repo | Tests run in CI | Coverage % | Last successful run | Workflow | Notes |
|---|---:|---:|---|---|---|
| **data-ingestors** | 599 | **98.51%** | 2026-05-27 | `Tests` (py3.11+3.12 matrix) | 2691 stmts / 40 missing. `fail-under=95` enforced. Best in fleet. |
| **tracebloc-py-package** | 876 | **95.42%** | 2026-05-27 | `Test Suite` (py3.11+3.12) | `fail-under=95` enforced. At target. |
| **design-system** | 94 (4 test files) | **94.93%** lines / 91.24% branches | 2026-05-26 | `Tests` | Vitest. Scope still narrow (Button, Input, Switches atoms only). |
| **tracebloc-client** | 2,082 (3 xfail) | **87.58%** | 2026-05-28 | `Run Tests` | 8,727 stmts / 892 missing. Just below 90%. Largest test suite in the fleet. |
| **averaging-service** | 51 | **68%** | 2026-05-28 | `CI` | 625 stmts / 197 missing. Grew from 6 tests (May 21) to 51 since tracebloc/backend#665's snapshot. |
| **backend** | 378 (295 Django + 83 Bandit) | **62.33%** | 2026-05-28 | `Tests` | 11,360 stmts / 3,751 missing / 2,368 branches. **First CI-measured baseline** — coverage instrumentation just landed via [tracebloc/backend#710](https://github.com/tracebloc/backend/pull/710). Now the lowest in the fleet; substantial Django app surface not yet exercised. |
| **client-runtime** | 125 | **not measured in CI** | 2026-05-22 | `Tests` | pytest only, no coverage flag in workflow. Up from 53 since the 2026-05-16 snapshot. |
| **model-zoo** | contract tests across 5 jobs (sklearn / tensorflow / pytorch / survival / ruff) | **not measured in CI** | 2026-05-22 | `CI` | Contract test by design — coverage % isn't a meaningful number for a template library. |
| **tracebloc-website** | 209 unit + 82 E2E (last green) | **not measured in CI** | 2026-05-21 | `Test` | Vitest + Playwright. No `--coverage` flag. Workflow **red on develop since 2026-05-21** (contentlayer migration PR tracebloc/backend#302 + GTM env guard). |
| **frontend-app** | 0 on default branch (workflow just added) | **never produced** | — (never green on develop) | `Tests` | Tests workflow added via [#499](https://github.com/tracebloc/frontend-app/pull/499) but has never passed on develop. 1,057 tests exist locally; CI scope yet to validate them. |
| **client** (Helm) | 142 helm-unit + 4 schema + 4 template-render + lint | **N/A** | 2026-05-26 | `Helm Chart CI` | helm-unittest doesn't produce coverage. |

### What the refreshed data shows

1. **The "converging to 90%+" picture is true *where coverage is measured***, with caveats. Three repos clear 90% (`data-ingestors` 98.51%, `tracebloc-py-package` 95.42%, `design-system` 94.93%); `tracebloc-client` is at 87.58% on a much larger suite and climbing. But:
2. **Three repos have no coverage instrumentation wired in CI** — `client-runtime`, `model-zoo`, `tracebloc-website`. For `model-zoo` that's a deliberate design choice (contract-only tests). For the other two it's a measurement gap. ([tracebloc/backend#710](https://github.com/tracebloc/backend/pull/710) closed the backend's instrumentation gap on 2026-05-28.)
3. **Two repos are now genuinely below target** — `backend` at 62.33% (first measurement, baselines on the lower end as expected of a large Django app that just got instrumentation) and `averaging-service` at 68% (FL averaging math still the largest coverage gap relative to its risk profile, aligned with the gap the tracebloc/averaging-service#116 evolution epic targets). Backend is the lowest, but the runway is now visible — the previous "not measured" state hid this.
4. **Two repos have workflows that aren't usefully green on default branch** — `tracebloc-website` (red on develop since contentlayer migration) and `frontend-app` (Tests workflow added but never green on develop). Neither produces coverage today. These are real visibility gaps independent of the integration-test question.
5. **Test counts grew substantially since the 2026-05-16 tracebloc/backend#665 snapshot** — `averaging-service` 6→51, `data-ingestors` (effective) ~10→599, `client-runtime` 53→125, `tracebloc-client` ~1,102→2,082, `tracebloc-py-package` 529→876, `tracebloc-website` 35+22→209+82. The visibility-first + organic-CI-additions strategy in tracebloc/backend#665 produced real movement in two weeks.

### What this means for this epic

**Updates relative to the original ring spec, none requiring a structural rewrite:**

- **Ring 1 `averaging-service` row** (golden-value test) — still the highest-leverage Ring 1 item. Coverage at 68% with 51 tests confirms the suite is shallow rather than wide; golden-value validation is the missing kind, not "more of the same."
- **Ring 1 `tracebloc-py-package` row** (contract tests against backend) — keep as drafted. 95.42% line coverage doesn't tell us whether the SDK actually still talks to a real backend; 100% mocked HTTP means contract drift is invisible regardless of coverage.
- **Ring 1 `tracebloc-client` row** (per-task integration loops) — keep as drafted. 87.58% on 8,727 statements means most missing lines are likely framework-specific paths exactly where the asymmetric integration coverage gap lives.
- **Ring 1 `backend` row** (cross-app `APITestCase` flows) — prerequisite **satisfied** by [tracebloc/backend#710](https://github.com/tracebloc/backend/pull/710) landing on 2026-05-28. The new 62.33% baseline means any cross-app flow tests added under Ring 1 will produce a measurable coverage delta. The 3,751 missing statements are an obvious next-target signal for where the cross-app flow tests should bite hardest.
- **Ring 1 `frontend-app` Cypress promotion** — gains a prerequisite: the Tests workflow needs to be green on develop first. Until then promoting Cypress to live-backend is premature.
- **Ring 1 `tracebloc-website` / `client-runtime` / `model-zoo`** — none of these have CI coverage measurement; that's a separate "wire `--coverage`" task that doesn't belong in this epic. Mention as a prerequisite, defer the actual wiring to tracebloc/backend#665 or a dedicated visibility ticket.
- **Cross-repo integration coverage is still zero.** That's the bit only this epic addresses; coverage % per repo doesn't move it.

## The platform workflow this epic is about

```
┌─────────────────────────────────────────────────────────────────────┐
│  Customer / data scientist                                          │
│  • tracebloc-py-package  (SDK)        • frontend-app (Web UI)       │
│                       │     │              │                        │
│                       └──┬──┴──────────────┘                        │
└──────────────────────────┼──────────────────────────────────────────┘
                           ▼ HTTPS / DRF tokens
┌─────────────────────────────────────────────────────────────────────┐
│  backend (Django) — usermetadata (MSSQL/MySQL), MongoDB, S3, Redis  │
│      │                              │                               │
│      └─ Azure Service Bus ──────────┘                               │
│           (training queue, FLOPs queue, averaging queue)            │
└─────────────────────────────────────────────────────────────────────┘
                           ▼ AMQP-over-WS
┌─────────────────────────────────────────────────────────────────────┐
│  Customer Kubernetes cluster — deployed by client (Helm chart)      │
│                                                                     │
│   client-runtime/jobs-manager   ←─ polls SB, spawns training pods   │
│        │                                                            │
│        ├─► tracebloc-client (training pod)                          │
│        │      reads /data/shared (read-only)                        │
│        │      POSTs weights+FLOPs to jobs-manager proxy             │
│        │                                                            │
│   client-runtime/pods-monitor   ←─ lifecycle events back            │
│   client-runtime/resource-monitor (DaemonSet) ← CPU/GPU             │
│                                                                     │
│   data-ingestors (Job)         ─ writes /data/shared from CSV/Parq  │
│   averaging-service            ─ polls SB averaging queue,          │
│                                  averages weights → S3              │
│   MySQL (in-cluster)           ─ per-cluster state store            │
└─────────────────────────────────────────────────────────────────────┘
```

An honest end-to-end walk = `bash <(curl -fsSL tracebloc.io/i.sh)` on a fresh box → data-ingestors loads a test dataset → tracebloc-py-package submits a use case + experiment → SB delivers → jobs-manager spawns a training pod → tracebloc-client trains → weights flow back → averaging-service averages → backend leaderboard reflects the cycle. **Six repos crossed in one test.**

## The three-ring playbook

Each ring is a self-contained gate. The rings are **not** a roadmap (don't deploy Ring 1 then Ring 2 then Ring 3 speculatively — see tracebloc/backend#699 contrast notes). They are three different gates at three different cost levels; pick one based on what Phase 0 (or any other prioritization signal) surfaces.

### Ring 1 — in-process integration (4–6 weeks, per-repo authors)

**What it builds.** Extend existing test suites to chain multiple modules inside one repo. No new infrastructure; existing CI workflows run them.

| Repo | Concrete additions | Why |
|---|---|---|
| **backend** | 3–5 cross-app flow tests in new `system_tests/`: (a) signup → activate → publish use-case → invite → join → submit-experiment, (b) flops exhaustion auto-stop, (c) edge heartbeat → cycle metric → leaderboard update, (d) team-merge re-allocates flops, (e) inference submission lifecycle. All `APITestCase` + ORM + `model_bakery`. | These flows exist as manual playbook steps but have no automated chain test today. The largest gap in the most-tested repo. |
| **tracebloc-py-package** | New `tests/integration/` gated by `@pytest.mark.integration` and a `TRACEBLOC_INTEGRATION_BACKEND` env var. Phase A: `responses` cassettes recorded from a live staging backend. Phase B: embedded Django test server fixture. Scenarios: login → upload model → link dataset → start experiment → poll → download. | The whole SDK is 100% mock. A 5-test recorded-cassette suite would catch any backend API shape change. |
| **tracebloc-client** | Extend the `*_integration.py` pattern (already used in `time_to_event_strategies/` and `computer_vision_strategies/`) to tabular, text, segmentation, MLM, time-series, keypoint, object-detection. One full-loop test per task: build real small model on synthetic data, run N=1 epoch, assert metric is reasonable. | Asymmetric integration coverage across task types is the single largest framework-upgrade regression surface. |
| **client-runtime** | 2–3 tests using the `kubernetes` Python client against `kind`/`k3d` in CI. jobs-manager spawns a Job with the right `securityContext`; pods-monitor sees its lifecycle. Mock Azure SB and MySQL still — only Kubernetes is real. | The security invariants in `CLAUDE.md` ("do NOT regress") currently have no real-cluster verification. |
| **averaging-service** | Golden-value test: 3 real PyTorch state-dicts on disk, run `Average.convert_weights()`, assert against hand-computed averaged weights. No SB, no Azure. | The FL math has 4 unit tests but no real-data validation today. Aligns with tracebloc/averaging-service#116 evolution epic. |
| **data-ingestors** | One per-template test that runs the ingestor against a tiny real CSV / Parquet / image folder and asserts the output schema + file layout. | The current 16 files are mostly schema validators; nothing actually runs an ingestor on real data. |
| **frontend-app** | Promote the 10 existing Cypress specs to run against the deployed staging backend (gate behind `CYPRESS_LIVE=1`). | Cypress workflow already exists in [frontend-app#499](https://github.com/tracebloc/frontend-app/pull/499); upgrading it is a single-line env-var change. |

**Cost:** 4–6 weeks calendar, ~25% of one engineer per repo, parallelizable across the existing test authors. **No new CI infrastructure.**

> `[#699 contrast]` This is the *cheapest* ring and the one most likely to be picked under any plausible Phase 0 outcome that isn't pure "deps CVEs" or "customer-platform incompatibility." But tracebloc/backend#699 would warn: don't ship a green CI on shallow tests. Each Ring 1 test should be reviewed for whether it would actually catch a recent incident, not just whether it passes. The "averaging-service golden-value" and "client-runtime kind test" rows are the highest-quality items in this ring; the per-use-case real-framework loops in tracebloc-client are the ones at risk of being shape-tests.

### Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)

**What it builds.** A new `integration-tests/` directory (in `tracebloc/backend` per the private-catch-all convention, or `tracebloc/.github` if we want it org-neutral) that brings up multiple services in containers and tests cross-service contracts.

**Stack (`docker-compose.integration.yml`):**
- backend (Django + MSSQL + Redis)
- tracebloc-py-package (editable install)
- averaging-service container
- MongoDB
- Fake Azure Service Bus (emulator container; see Open Questions)
- **Excludes:** Kubernetes, jobs-manager / pods-monitor, real tracebloc-client training pod. Training pod replaced by a stub that POSTs canned weights to the requests-proxy.

**Test scope (~8–12 scenarios, ~5–10 min runtime):**
- Real SDK against real backend container: signup → use-case → experiment-submit
- Simulated cycle completion: stub POSTs weights → backend marks cycle done
- Averaging-service consumes from SB, averages 2 participants' weights, writes to S3
- backend leaderboard reflects the cycle
- Failure paths: SB unreachable, stub sends malformed weights, auth token expires mid-flow

**Triggered on:** PR to `backend` and `tracebloc-py-package`; nightly elsewhere.

**Cost:** 1–2 weeks scaffolding + 4–6 weeks scenario writing. Real cross-repo coupling — needs one owner.

> `[#699 contrast]` This ring is the most direct answer to tracebloc/backend#699's "cross-service integration bugs" candidate gate. tracebloc/backend#699's data-driven framing would ask: how many of the last 6 months of incidents were caught at the SDK ↔ backend boundary or backend ↔ averaging-service boundary? If the answer is ≥3, this ring is justified. If 0, it's speculation and the dependency-CVE or backend-logic-bug gates from tracebloc/backend#699's table should win.

### Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)

**Trigger:** the actual customer-facing one-line installer — `bash <(curl -fsSL tracebloc.io/i.sh)` for macOS/Linux, `irm tracebloc.io/i.ps1 | iex` for Windows. The bootstrap script is a thin wrapper around `github.com/tracebloc/client@${BRANCH}/scripts/install-k8s.sh`. **Do not** bypass with a direct `helm install` — that skips the bootstrap, GPU detection, k3s bring-up, and Helm-install layers that real customers exercise. Use the script's existing `BRANCH` env var to pin against a Helm-chart PR, and `CLIENT_ENV` to target a dedicated `test-e2e` backend.

**Stack:**
- Ephemeral cluster: `kind` for local + cheap CI runs, EKS for release-candidate runs (~$1/run)
- Real Azure Service Bus namespace per CI run with `e2e-test-*` prefix, torn down after
- Real `tracebloc-client` training image pinned to a 2-layer CNN, 1 epoch, ~100-row toy dataset
- Real data-ingestors run against the toy dataset
- Real SDK driver script in `backend/e2e/` (or wherever the epic ends up rooted)

**Scenario ladder (do in order within Ring 3):**
1. Single-cycle, single-participant — proves the wiring (~3 weeks alone)
2. Single-cycle, two-participant federated — proves the averaging path
3. Multi-cycle with mid-cycle pod kill recovery — proves pods-monitor reporting
4. Domain-scoped private use-case + cross-domain user denial — proves authorization
5. Inference submission lifecycle — separate path from training cycles

**Hard prerequisites before Ring 3 starts:**
1. A `values.e2e.yaml` profile in the `client` Helm chart with tiny resource requests, no GPU, externalized SB connection strings as test secrets
2. The [QA_TESTING_PLAYBOOK.md "End-to-End Smoke Test"](https://github.com/tracebloc/backend/blob/develop/QA_TESTING_PLAYBOOK.md#end-to-end-smoke-test-quick-path) (steps 1–14) is the source of truth — translate that file directly into a pytest suite

**Cost:** 8–12 weeks, single owner, ~30–60 min runtime per E2E pass.

> `[#699 contrast]` This ring is the most direct answer to tracebloc/backend#699's "customer-platform incompatibility" candidate gate, and it partners explicitly with [#690 (bare-metal test farm)](https://github.com/tracebloc/backend/issues/690) — the bare-metal farm provides the heterogeneous hardware substrate; this ring provides the test driver that runs on it. tracebloc/backend#699's data-driven framing would ask: have we shipped install failures that customers hit, or k8s-version incompatibilities, or RBAC drift? If yes, this ring earns its complexity. If no, defer — the manual playbook + bare-metal farm cover the realistic risk at current customer count.

## Where this epic and tracebloc/backend#699 align / diverge — summary

| Topic | tracebloc/backend#699 says | This epic says | Reconciliation |
|---|---|---|---|
| Order of test investment | Phase 0 incident review picks the gate | Three rings are described, ordering deferred | Same answer: don't pre-commit to Ring 1 → 2 → 3. Pick based on Phase 0 (or any equivalent signal). |
| Predetermined plans | "Not a 5-layer blueprint to deploy speculatively" | Ring playbooks exist on the shelf, conditional on selection | Same answer: drafts ≠ commitments. Having the draft makes ticket-filing fast when the gate is picked. |
| Test quality | Green CI on shallow tests is the failure mode | Each ring section names which sub-items are most at risk of being shape-tests | Same answer. Reviewer checklist for Ring 1: "would this test have caught a known incident?" |
| Tooling-vs-culture | Equal weight; pairing + ADRs + onboarding doc are first-class | Tooling-only focus | **This epic doesn't replicate tracebloc/backend#699's culture work.** Culture interventions stay in tracebloc/backend#699. |
| Customer-impact signal | "from:customer" kanban view, regression-rate metric | Silent | **Defer to tracebloc/backend#699.** Adding regression-rate tracking is not in this epic's scope. |
| Claude Code guardrails | Audit trail, branch-target enforcement, destructive-op restrictions | Silent | **Defer to tracebloc/backend#699.** |
| Success criterion | Measurable drop in "% of PRs causing follow-up fix within 7 days" | "Integration coverage exists where it didn't before" | **#699's metric is better.** If/when a ring is picked, the success criterion should be regression-rate, not coverage delta. |
| What's deferred | Canary, feature flags, perf regression, customer × code-path telemetry | Same list | Aligned. |

**Net:** this epic is a *deeper drill into one of tracebloc/backend#699's candidate gate families*. It does not contradict tracebloc/backend#699; it pre-fills the answer for two of its candidates. The four areas where this epic is silent (culture, signal, Claude guardrails, regression metric) are owned by tracebloc/backend#699 and should not be re-litigated here.

## Open questions / decisions needed before committing to a ring

These need group sign-off, not unilateral decisions:

1. **Selection trigger.** Do we wait for tracebloc/backend#699 Phase 0 to pick a gate, or can this epic be triggered independently (e.g., by a customer-reported integration bug)? Recommended: wait for Phase 0 unless a specific incident creates the urgency.
2. **Owner if a ring is picked.** Ring 1 fits per-repo authors (consistent with tracebloc/backend#665's Phase 2a pattern). Ring 2 + Ring 3 need a single owner — likely Asad (cross-cutting tech-lead, already named on tracebloc/backend#699) with this epic's authors driving repo-specific implementation.
3. **Azure Service Bus in CI** (relevant to Ring 2 + Ring 3):
   - Ring 2: emulator container (free, contract-drift risk) vs. a dedicated SB namespace per run (~$1, faithful). Recommended: emulator with periodic drift checks.
   - Ring 3: dedicated SB namespace per run is non-negotiable; emulator can't validate the actual customer connection path.
4. **Bare-metal farm dependency** (Ring 3): how much of Ring 3 is blocked on [#690](https://github.com/tracebloc/backend/issues/690) being ready? Recommended: scenarios 1–3 can run on `kind` without the bare-metal farm. Scenarios 4–5 should wait until the farm has at least one bare-metal node available.
5. **Where does `integration-tests/` live?** Options: (a) `tracebloc/backend/integration-tests/` (matches the private-catch-all convention), (b) a new `tracebloc/integration-tests` repo (clean separation, but creates an 11th repo to maintain), (c) `tracebloc/.github/integration-tests/` (org-level neutral). Recommended: (a) for Ring 2, decide later for Ring 3.
6. **averaging-service refactor question** carried over from tracebloc/backend#665: Lukas originally flagged "refactor `service/averaging.py` to extract pure math vs accept the risk." Ring 1's golden-value test + Ring 2's federated-through-SB scenarios make this incremental rather than blocking. Confirm we're OK with the incremental path.
7. **Frontend ring scope.** Ring 1 currently only covers `frontend-app` Cypress promotion. Should `design-system` get an analogous integration ring (e.g., Chromatic visual regression)? Recommended: out of scope here; visual-regression has different failure modes and belongs in its own epic if we want it.

## Definition of done (if a ring is selected)

This epic's DoD is **the playbook is ready to execute**, not "all three rings shipped." Concrete:

- [ ] Per-repo test classification table maintained and updated quarterly (next refresh: end of Q3 2026)
- [ ] When a ring is selected for execution, a child epic is filed with the ring's row-level tickets pre-drafted from this epic's content
- [ ] Cross-references to tracebloc/backend#699 / tracebloc/backend#665 / tracebloc/backend#690 / tracebloc/averaging-service#116 stay current

If a ring is **executed** under this epic, that ring's DoD:
- **Ring 1:** every row in the table has at least one passing integration test in CI, gated as a required status check, and each test has documented "incident-it-would-catch" justification
- **Ring 2:** docker-compose suite green on `backend` + `tracebloc-py-package` PRs, runtime ≤10 min, regression-rate measured before/after for at least 4 weeks
- **Ring 3:** scenarios 1–3 running nightly, scenario 4 + 5 running pre-release, full E2E completion in ≤60 min, regression-rate measured before/after for at least 4 weeks

## Related

- [#699 Quality system North Star](https://github.com/tracebloc/backend/issues/699) — meta-epic; this epic is one of its candidate gate playbooks
- [#665 Test coverage epic](https://github.com/tracebloc/backend/issues/665) — predecessor; Layer 1 + Layer 2 enforcement done; this epic extends beyond into cross-module / cross-repo
- [#690 Bare-metal test farm](https://github.com/tracebloc/backend/issues/690) — hardware substrate for Ring 3 scenarios 4–5
- [#677 Averaging service evolution](https://github.com/tracebloc/averaging-service/issues/116) — overlaps on Ring 1 averaging-service row + Ring 2 federated-through-SB scenarios
- [#667 GHAS decision](https://github.com/tracebloc/backend/issues/667) — different gate family (deps CVEs), not addressed by this epic

## Discussion thread

This epic does not commit the team to any work. It commits the team to **having a ready answer** when tracebloc/backend#699 Phase 0 picks a gate, when a customer reports an integration regression, or when leadership decides the regression-rate-target window has come.

Suggested next step: review this body together (30 min), confirm the per-repo classification matches lived reality, agree on Open Questions 1–5, then leave the epic on the shelf until a triggering signal arrives.





Repo	Tests run in CI	Coverage %	Last successful run	Workflow	Notes
data-ingestors	599	98.51%	2026-05-27	`Tests` (py3.11+3.12 matrix)	2691 stmts / 40 missing. `fail-under=95` enforced. Best in fleet.
tracebloc-py-package	876	95.42%	2026-05-27	`Test Suite` (py3.11+3.12)	`fail-under=95` enforced. At target.
design-system	94 (4 test files)	94.93% lines / 91.24% branches	2026-05-26	`Tests`	Vitest. Scope still narrow (Button, Input, Switches atoms only).
tracebloc-client	2,082 (3 xfail)	87.58%	2026-05-28	`Run Tests`	8,727 stmts / 892 missing. Just below 90%. Largest test suite in the fleet.
averaging-service	51	68%	2026-05-28	`CI`	625 stmts / 197 missing. Grew from 6 tests (May 21) to 51 since tracebloc/backend#665's snapshot.
backend	378 (295 Django + 83 Bandit)	62.33%	2026-05-28	`Tests`	11,360 stmts / 3,751 missing / 2,368 branches. First CI-measured baseline — coverage instrumentation just landed via tracebloc/backend#710. Now the lowest in the fleet; substantial Django app surface not yet exercised.
client-runtime	125	not measured in CI	2026-05-22	`Tests`	pytest only, no coverage flag in workflow. Up from 53 since the 2026-05-16 snapshot.
model-zoo	contract tests across 5 jobs (sklearn / tensorflow / pytorch / survival / ruff)	not measured in CI	2026-05-22	`CI`	Contract test by design — coverage % isn't a meaningful number for a template library.
tracebloc-website	209 unit + 82 E2E (last green)	not measured in CI	2026-05-21	`Test`	Vitest + Playwright. No `--coverage` flag. Workflow red on develop since 2026-05-21 (contentlayer migration PR tracebloc/backend#302 + GTM env guard).
frontend-app	0 on default branch (workflow just added)	never produced	— (never green on develop)	`Tests`	Tests workflow added via #499 but has never passed on develop. 1,057 tests exist locally; CI scope yet to validate them.
client (Helm)	142 helm-unit + 4 schema + 4 template-render + lint	N/A	2026-05-26	`Helm Chart CI`	helm-unittest doesn't produce coverage.

Repo	Concrete additions	Why
backend	3–5 cross-app flow tests in new `system_tests/`: (a) signup → activate → publish use-case → invite → join → submit-experiment, (b) flops exhaustion auto-stop, (c) edge heartbeat → cycle metric → leaderboard update, (d) team-merge re-allocates flops, (e) inference submission lifecycle. All `APITestCase` + ORM + `model_bakery`.	These flows exist as manual playbook steps but have no automated chain test today. The largest gap in the most-tested repo.
tracebloc-py-package	New `tests/integration/` gated by `@pytest.mark.integration` and a `TRACEBLOC_INTEGRATION_BACKEND` env var. Phase A: `responses` cassettes recorded from a live staging backend. Phase B: embedded Django test server fixture. Scenarios: login → upload model → link dataset → start experiment → poll → download.	The whole SDK is 100% mock. A 5-test recorded-cassette suite would catch any backend API shape change.
tracebloc-client	Extend the `*_integration.py` pattern (already used in `time_to_event_strategies/` and `computer_vision_strategies/`) to tabular, text, segmentation, MLM, time-series, keypoint, object-detection. One full-loop test per task: build real small model on synthetic data, run N=1 epoch, assert metric is reasonable.	Asymmetric integration coverage across task types is the single largest framework-upgrade regression surface.
client-runtime	2–3 tests using the `kubernetes` Python client against `kind`/`k3d` in CI. jobs-manager spawns a Job with the right `securityContext`; pods-monitor sees its lifecycle. Mock Azure SB and MySQL still — only Kubernetes is real.	The security invariants in `CLAUDE.md` ("do NOT regress") currently have no real-cluster verification.
averaging-service	Golden-value test: 3 real PyTorch state-dicts on disk, run `Average.convert_weights()`, assert against hand-computed averaged weights. No SB, no Azure.	The FL math has 4 unit tests but no real-data validation today. Aligns with tracebloc/averaging-service#116 evolution epic.
data-ingestors	One per-template test that runs the ingestor against a tiny real CSV / Parquet / image folder and asserts the output schema + file layout.	The current 16 files are mostly schema validators; nothing actually runs an ingestor on real data.
frontend-app	Promote the 10 existing Cypress specs to run against the deployed staging backend (gate behind `CYPRESS_LIVE=1`).	Cypress workflow already exists in frontend-app#499; upgrading it is a single-line env-var change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Integration + E2E test architecture — repo-by-repo audit and three-ring playbook #54

Purpose

How this relates to tracebloc/backend#699 and tracebloc/backend#665

State of the test suite today (2026-05-28, pulled from latest CI runs)

What the refreshed data shows

What this means for this epic

The platform workflow this epic is about

The three-ring playbook

Ring 1 — in-process integration (4–6 weeks, per-repo authors)

Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)

Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)

Where this epic and tracebloc/backend#699 align / diverge — summary

Open questions / decisions needed before committing to a ring

Definition of done (if a ring is selected)

Related

Discussion thread

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Topic	tracebloc/backend#699 says	This epic says	Reconciliation
Order of test investment	Phase 0 incident review picks the gate	Three rings are described, ordering deferred	Same answer: don't pre-commit to Ring 1 → 2 → 3. Pick based on Phase 0 (or any equivalent signal).
Predetermined plans	"Not a 5-layer blueprint to deploy speculatively"	Ring playbooks exist on the shelf, conditional on selection	Same answer: drafts ≠ commitments. Having the draft makes ticket-filing fast when the gate is picked.
Test quality	Green CI on shallow tests is the failure mode	Each ring section names which sub-items are most at risk of being shape-tests	Same answer. Reviewer checklist for Ring 1: "would this test have caught a known incident?"
Tooling-vs-culture	Equal weight; pairing + ADRs + onboarding doc are first-class	Tooling-only focus	This epic doesn't replicate tracebloc/backend#699's culture work. Culture interventions stay in tracebloc/backend#699.
Customer-impact signal	"from:customer" kanban view, regression-rate metric	Silent	Defer to tracebloc/backend#699. Adding regression-rate tracking is not in this epic's scope.
Claude Code guardrails	Audit trail, branch-target enforcement, destructive-op restrictions	Silent	Defer to tracebloc/backend#699.
Success criterion	Measurable drop in "% of PRs causing follow-up fix within 7 days"	"Integration coverage exists where it didn't before"	#699's metric is better. If/when a ring is picked, the success criterion should be regression-rate, not coverage delta.
What's deferred	Canary, feature flags, perf regression, customer × code-path telemetry	Same list	Aligned.

Epic: Integration + E2E test architecture — repo-by-repo audit and three-ring playbook #54

Description

Purpose

How this relates to tracebloc/backend#699 and tracebloc/backend#665

State of the test suite today (2026-05-28, pulled from latest CI runs)

What the refreshed data shows

What this means for this epic

The platform workflow this epic is about

The three-ring playbook

Ring 1 — in-process integration (4–6 weeks, per-repo authors)

Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)

Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)

Where this epic and tracebloc/backend#699 align / diverge — summary

Open questions / decisions needed before committing to a ring

Definition of done (if a ring is selected)

Related

Discussion thread

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions