Skip to content

Epic: Integration + E2E test architecture — repo-by-repo audit and three-ring playbook #54

@aptracebloc

Description

@aptracebloc

Purpose

Define the integration-test and end-to-end-test architecture for the tracebloc platform — what gets tested, at which boundary, and what the trigger is. The deliverable of this epic is the playbook: a per-repo unit/integration inventory plus three concrete ring-shaped gate designs (in-process integration → service-level integration on docker-compose → full-platform E2E on ephemeral Kubernetes). The deliverable when the playbook is picked will be filed as a child epic with the matching ring's implementation tickets.

This epic is dense intentionally. It is the reference artifact the team can reach for once strategy is chosen.

How this relates to tracebloc/backend#699 and tracebloc/backend#665

Parallel companion to #699, built on top of #665. Both epics stay open simultaneously.

  • #665 — Test coverage, visibility-first delivered (a) cross-repo coverage measurement and (b) Layer 1 / Layer 2 CI enforcement. We now know what each repo's test suite contains and CI is running everywhere with required-status gating.
  • #699 — Quality system, incident-driven safety net says the next gate should be decided by a Phase 0 incident review, not by speculation. It lists ~7 candidate gates; "cross-service integration bugs" and "customer-platform incompatibility" are two of them. tracebloc/backend#699 is the meta-question "which gate next?"
  • This epic is the answer-in-advance for two of tracebloc/backend#699's candidates. If Phase 0 of tracebloc/backend#699 surfaces integration / cross-service / customer-platform as the dominant incident class, the rings below are the ready-to-execute specs. If Phase 0 picks a different gate (deps CVEs, SDK contract drift, frontend regressions, etc.), this epic stays on the shelf as reference — the per-repo inventory is still useful for any future test investment, regardless of which gate is built first.

We are not asking the team to choose between tracebloc/backend#699 and this epic. We're asking the team to have both in hand when strategy time comes.

Inline [#699 contrast] annotations appear under each ring noting where tracebloc/backend#699's framing would push back or refine. The full summary of agreements / divergences is at the bottom.

State of the test suite today (2026-05-28, pulled from latest CI runs)

Numbers refreshed from the most recent default-branch CI run of each repo (not from tracebloc/backend#665's 2026-05-16 snapshot). Test counts shown are what CI actually executes, which differs from "what's in the repo" for repos whose workflow scope is narrower than their suite (notably backend).

Repo Tests run in CI Coverage % Last successful run Workflow Notes
data-ingestors 599 98.51% 2026-05-27 Tests (py3.11+3.12 matrix) 2691 stmts / 40 missing. fail-under=95 enforced. Best in fleet.
tracebloc-py-package 876 95.42% 2026-05-27 Test Suite (py3.11+3.12) fail-under=95 enforced. At target.
design-system 94 (4 test files) 94.93% lines / 91.24% branches 2026-05-26 Tests Vitest. Scope still narrow (Button, Input, Switches atoms only).
tracebloc-client 2,082 (3 xfail) 87.58% 2026-05-28 Run Tests 8,727 stmts / 892 missing. Just below 90%. Largest test suite in the fleet.
averaging-service 51 68% 2026-05-28 CI 625 stmts / 197 missing. Grew from 6 tests (May 21) to 51 since tracebloc/backend#665's snapshot.
backend 378 (295 Django + 83 Bandit) 62.33% 2026-05-28 Tests 11,360 stmts / 3,751 missing / 2,368 branches. First CI-measured baseline — coverage instrumentation just landed via tracebloc/backend#710. Now the lowest in the fleet; substantial Django app surface not yet exercised.
client-runtime 125 not measured in CI 2026-05-22 Tests pytest only, no coverage flag in workflow. Up from 53 since the 2026-05-16 snapshot.
model-zoo contract tests across 5 jobs (sklearn / tensorflow / pytorch / survival / ruff) not measured in CI 2026-05-22 CI Contract test by design — coverage % isn't a meaningful number for a template library.
tracebloc-website 209 unit + 82 E2E (last green) not measured in CI 2026-05-21 Test Vitest + Playwright. No --coverage flag. Workflow red on develop since 2026-05-21 (contentlayer migration PR tracebloc/backend#302 + GTM env guard).
frontend-app 0 on default branch (workflow just added) never produced — (never green on develop) Tests Tests workflow added via #499 but has never passed on develop. 1,057 tests exist locally; CI scope yet to validate them.
client (Helm) 142 helm-unit + 4 schema + 4 template-render + lint N/A 2026-05-26 Helm Chart CI helm-unittest doesn't produce coverage.

What the refreshed data shows

  1. The "converging to 90%+" picture is true where coverage is measured, with caveats. Three repos clear 90% (data-ingestors 98.51%, tracebloc-py-package 95.42%, design-system 94.93%); tracebloc-client is at 87.58% on a much larger suite and climbing. But:
  2. Three repos have no coverage instrumentation wired in CIclient-runtime, model-zoo, tracebloc-website. For model-zoo that's a deliberate design choice (contract-only tests). For the other two it's a measurement gap. (tracebloc/backend#710 closed the backend's instrumentation gap on 2026-05-28.)
  3. Two repos are now genuinely below targetbackend at 62.33% (first measurement, baselines on the lower end as expected of a large Django app that just got instrumentation) and averaging-service at 68% (FL averaging math still the largest coverage gap relative to its risk profile, aligned with the gap the tracebloc/averaging-service#116 evolution epic targets). Backend is the lowest, but the runway is now visible — the previous "not measured" state hid this.
  4. Two repos have workflows that aren't usefully green on default branchtracebloc-website (red on develop since contentlayer migration) and frontend-app (Tests workflow added but never green on develop). Neither produces coverage today. These are real visibility gaps independent of the integration-test question.
  5. Test counts grew substantially since the 2026-05-16 tracebloc/backend#665 snapshotaveraging-service 6→51, data-ingestors (effective) ~10→599, client-runtime 53→125, tracebloc-client ~1,102→2,082, tracebloc-py-package 529→876, tracebloc-website 35+22→209+82. The visibility-first + organic-CI-additions strategy in tracebloc/backend#665 produced real movement in two weeks.

What this means for this epic

Updates relative to the original ring spec, none requiring a structural rewrite:

  • Ring 1 averaging-service row (golden-value test) — still the highest-leverage Ring 1 item. Coverage at 68% with 51 tests confirms the suite is shallow rather than wide; golden-value validation is the missing kind, not "more of the same."
  • Ring 1 tracebloc-py-package row (contract tests against backend) — keep as drafted. 95.42% line coverage doesn't tell us whether the SDK actually still talks to a real backend; 100% mocked HTTP means contract drift is invisible regardless of coverage.
  • Ring 1 tracebloc-client row (per-task integration loops) — keep as drafted. 87.58% on 8,727 statements means most missing lines are likely framework-specific paths exactly where the asymmetric integration coverage gap lives.
  • Ring 1 backend row (cross-app APITestCase flows) — prerequisite satisfied by tracebloc/backend#710 landing on 2026-05-28. The new 62.33% baseline means any cross-app flow tests added under Ring 1 will produce a measurable coverage delta. The 3,751 missing statements are an obvious next-target signal for where the cross-app flow tests should bite hardest.
  • Ring 1 frontend-app Cypress promotion — gains a prerequisite: the Tests workflow needs to be green on develop first. Until then promoting Cypress to live-backend is premature.
  • Ring 1 tracebloc-website / client-runtime / model-zoo — none of these have CI coverage measurement; that's a separate "wire --coverage" task that doesn't belong in this epic. Mention as a prerequisite, defer the actual wiring to tracebloc/backend#665 or a dedicated visibility ticket.
  • Cross-repo integration coverage is still zero. That's the bit only this epic addresses; coverage % per repo doesn't move it.

The platform workflow this epic is about

┌─────────────────────────────────────────────────────────────────────┐
│  Customer / data scientist                                          │
│  • tracebloc-py-package  (SDK)        • frontend-app (Web UI)       │
│                       │     │              │                        │
│                       └──┬──┴──────────────┘                        │
└──────────────────────────┼──────────────────────────────────────────┘
                           ▼ HTTPS / DRF tokens
┌─────────────────────────────────────────────────────────────────────┐
│  backend (Django) — usermetadata (MSSQL/MySQL), MongoDB, S3, Redis  │
│      │                              │                               │
│      └─ Azure Service Bus ──────────┘                               │
│           (training queue, FLOPs queue, averaging queue)            │
└─────────────────────────────────────────────────────────────────────┘
                           ▼ AMQP-over-WS
┌─────────────────────────────────────────────────────────────────────┐
│  Customer Kubernetes cluster — deployed by client (Helm chart)      │
│                                                                     │
│   client-runtime/jobs-manager   ←─ polls SB, spawns training pods   │
│        │                                                            │
│        ├─► tracebloc-client (training pod)                          │
│        │      reads /data/shared (read-only)                        │
│        │      POSTs weights+FLOPs to jobs-manager proxy             │
│        │                                                            │
│   client-runtime/pods-monitor   ←─ lifecycle events back            │
│   client-runtime/resource-monitor (DaemonSet) ← CPU/GPU             │
│                                                                     │
│   data-ingestors (Job)         ─ writes /data/shared from CSV/Parq  │
│   averaging-service            ─ polls SB averaging queue,          │
│                                  averages weights → S3              │
│   MySQL (in-cluster)           ─ per-cluster state store            │
└─────────────────────────────────────────────────────────────────────┘

An honest end-to-end walk = bash <(curl -fsSL tracebloc.io/i.sh) on a fresh box → data-ingestors loads a test dataset → tracebloc-py-package submits a use case + experiment → SB delivers → jobs-manager spawns a training pod → tracebloc-client trains → weights flow back → averaging-service averages → backend leaderboard reflects the cycle. Six repos crossed in one test.

The three-ring playbook

Each ring is a self-contained gate. The rings are not a roadmap (don't deploy Ring 1 then Ring 2 then Ring 3 speculatively — see tracebloc/backend#699 contrast notes). They are three different gates at three different cost levels; pick one based on what Phase 0 (or any other prioritization signal) surfaces.

Ring 1 — in-process integration (4–6 weeks, per-repo authors)

What it builds. Extend existing test suites to chain multiple modules inside one repo. No new infrastructure; existing CI workflows run them.

Repo Concrete additions Why
backend 3–5 cross-app flow tests in new system_tests/: (a) signup → activate → publish use-case → invite → join → submit-experiment, (b) flops exhaustion auto-stop, (c) edge heartbeat → cycle metric → leaderboard update, (d) team-merge re-allocates flops, (e) inference submission lifecycle. All APITestCase + ORM + model_bakery. These flows exist as manual playbook steps but have no automated chain test today. The largest gap in the most-tested repo.
tracebloc-py-package New tests/integration/ gated by @pytest.mark.integration and a TRACEBLOC_INTEGRATION_BACKEND env var. Phase A: responses cassettes recorded from a live staging backend. Phase B: embedded Django test server fixture. Scenarios: login → upload model → link dataset → start experiment → poll → download. The whole SDK is 100% mock. A 5-test recorded-cassette suite would catch any backend API shape change.
tracebloc-client Extend the *_integration.py pattern (already used in time_to_event_strategies/ and computer_vision_strategies/) to tabular, text, segmentation, MLM, time-series, keypoint, object-detection. One full-loop test per task: build real small model on synthetic data, run N=1 epoch, assert metric is reasonable. Asymmetric integration coverage across task types is the single largest framework-upgrade regression surface.
client-runtime 2–3 tests using the kubernetes Python client against kind/k3d in CI. jobs-manager spawns a Job with the right securityContext; pods-monitor sees its lifecycle. Mock Azure SB and MySQL still — only Kubernetes is real. The security invariants in CLAUDE.md ("do NOT regress") currently have no real-cluster verification.
averaging-service Golden-value test: 3 real PyTorch state-dicts on disk, run Average.convert_weights(), assert against hand-computed averaged weights. No SB, no Azure. The FL math has 4 unit tests but no real-data validation today. Aligns with tracebloc/averaging-service#116 evolution epic.
data-ingestors One per-template test that runs the ingestor against a tiny real CSV / Parquet / image folder and asserts the output schema + file layout. The current 16 files are mostly schema validators; nothing actually runs an ingestor on real data.
frontend-app Promote the 10 existing Cypress specs to run against the deployed staging backend (gate behind CYPRESS_LIVE=1). Cypress workflow already exists in frontend-app#499; upgrading it is a single-line env-var change.

Cost: 4–6 weeks calendar, ~25% of one engineer per repo, parallelizable across the existing test authors. No new CI infrastructure.

[#699 contrast] This is the cheapest ring and the one most likely to be picked under any plausible Phase 0 outcome that isn't pure "deps CVEs" or "customer-platform incompatibility." But tracebloc/backend#699 would warn: don't ship a green CI on shallow tests. Each Ring 1 test should be reviewed for whether it would actually catch a recent incident, not just whether it passes. The "averaging-service golden-value" and "client-runtime kind test" rows are the highest-quality items in this ring; the per-use-case real-framework loops in tracebloc-client are the ones at risk of being shape-tests.

Ring 2 — service-level integration on docker-compose (6–10 weeks, single owner)

What it builds. A new integration-tests/ directory (in tracebloc/backend per the private-catch-all convention, or tracebloc/.github if we want it org-neutral) that brings up multiple services in containers and tests cross-service contracts.

Stack (docker-compose.integration.yml):

  • backend (Django + MSSQL + Redis)
  • tracebloc-py-package (editable install)
  • averaging-service container
  • MongoDB
  • Fake Azure Service Bus (emulator container; see Open Questions)
  • Excludes: Kubernetes, jobs-manager / pods-monitor, real tracebloc-client training pod. Training pod replaced by a stub that POSTs canned weights to the requests-proxy.

Test scope (~8–12 scenarios, ~5–10 min runtime):

  • Real SDK against real backend container: signup → use-case → experiment-submit
  • Simulated cycle completion: stub POSTs weights → backend marks cycle done
  • Averaging-service consumes from SB, averages 2 participants' weights, writes to S3
  • backend leaderboard reflects the cycle
  • Failure paths: SB unreachable, stub sends malformed weights, auth token expires mid-flow

Triggered on: PR to backend and tracebloc-py-package; nightly elsewhere.

Cost: 1–2 weeks scaffolding + 4–6 weeks scenario writing. Real cross-repo coupling — needs one owner.

[#699 contrast] This ring is the most direct answer to tracebloc/backend#699's "cross-service integration bugs" candidate gate. tracebloc/backend#699's data-driven framing would ask: how many of the last 6 months of incidents were caught at the SDK ↔ backend boundary or backend ↔ averaging-service boundary? If the answer is ≥3, this ring is justified. If 0, it's speculation and the dependency-CVE or backend-logic-bug gates from tracebloc/backend#699's table should win.

Ring 3 — full-platform E2E via the installer (8–12 weeks, single owner)

Trigger: the actual customer-facing one-line installer — bash <(curl -fsSL tracebloc.io/i.sh) for macOS/Linux, irm tracebloc.io/i.ps1 | iex for Windows. The bootstrap script is a thin wrapper around github.com/tracebloc/client@${BRANCH}/scripts/install-k8s.sh. Do not bypass with a direct helm install — that skips the bootstrap, GPU detection, k3s bring-up, and Helm-install layers that real customers exercise. Use the script's existing BRANCH env var to pin against a Helm-chart PR, and CLIENT_ENV to target a dedicated test-e2e backend.

Stack:

  • Ephemeral cluster: kind for local + cheap CI runs, EKS for release-candidate runs (~$1/run)
  • Real Azure Service Bus namespace per CI run with e2e-test-* prefix, torn down after
  • Real tracebloc-client training image pinned to a 2-layer CNN, 1 epoch, ~100-row toy dataset
  • Real data-ingestors run against the toy dataset
  • Real SDK driver script in backend/e2e/ (or wherever the epic ends up rooted)

Scenario ladder (do in order within Ring 3):

  1. Single-cycle, single-participant — proves the wiring (~3 weeks alone)
  2. Single-cycle, two-participant federated — proves the averaging path
  3. Multi-cycle with mid-cycle pod kill recovery — proves pods-monitor reporting
  4. Domain-scoped private use-case + cross-domain user denial — proves authorization
  5. Inference submission lifecycle — separate path from training cycles

Hard prerequisites before Ring 3 starts:

  1. A values.e2e.yaml profile in the client Helm chart with tiny resource requests, no GPU, externalized SB connection strings as test secrets
  2. The QA_TESTING_PLAYBOOK.md "End-to-End Smoke Test" (steps 1–14) is the source of truth — translate that file directly into a pytest suite

Cost: 8–12 weeks, single owner, ~30–60 min runtime per E2E pass.

[#699 contrast] This ring is the most direct answer to tracebloc/backend#699's "customer-platform incompatibility" candidate gate, and it partners explicitly with #690 (bare-metal test farm) — the bare-metal farm provides the heterogeneous hardware substrate; this ring provides the test driver that runs on it. tracebloc/backend#699's data-driven framing would ask: have we shipped install failures that customers hit, or k8s-version incompatibilities, or RBAC drift? If yes, this ring earns its complexity. If no, defer — the manual playbook + bare-metal farm cover the realistic risk at current customer count.

Where this epic and tracebloc/backend#699 align / diverge — summary

Topic tracebloc/backend#699 says This epic says Reconciliation
Order of test investment Phase 0 incident review picks the gate Three rings are described, ordering deferred Same answer: don't pre-commit to Ring 1 → 2 → 3. Pick based on Phase 0 (or any equivalent signal).
Predetermined plans "Not a 5-layer blueprint to deploy speculatively" Ring playbooks exist on the shelf, conditional on selection Same answer: drafts ≠ commitments. Having the draft makes ticket-filing fast when the gate is picked.
Test quality Green CI on shallow tests is the failure mode Each ring section names which sub-items are most at risk of being shape-tests Same answer. Reviewer checklist for Ring 1: "would this test have caught a known incident?"
Tooling-vs-culture Equal weight; pairing + ADRs + onboarding doc are first-class Tooling-only focus This epic doesn't replicate tracebloc/backend#699's culture work. Culture interventions stay in tracebloc/backend#699.
Customer-impact signal "from:customer" kanban view, regression-rate metric Silent Defer to tracebloc/backend#699. Adding regression-rate tracking is not in this epic's scope.
Claude Code guardrails Audit trail, branch-target enforcement, destructive-op restrictions Silent Defer to tracebloc/backend#699.
Success criterion Measurable drop in "% of PRs causing follow-up fix within 7 days" "Integration coverage exists where it didn't before" #699's metric is better. If/when a ring is picked, the success criterion should be regression-rate, not coverage delta.
What's deferred Canary, feature flags, perf regression, customer × code-path telemetry Same list Aligned.

Net: this epic is a deeper drill into one of tracebloc/backend#699's candidate gate families. It does not contradict tracebloc/backend#699; it pre-fills the answer for two of its candidates. The four areas where this epic is silent (culture, signal, Claude guardrails, regression metric) are owned by tracebloc/backend#699 and should not be re-litigated here.

Open questions / decisions needed before committing to a ring

These need group sign-off, not unilateral decisions:

  1. Selection trigger. Do we wait for tracebloc/backend#699 Phase 0 to pick a gate, or can this epic be triggered independently (e.g., by a customer-reported integration bug)? Recommended: wait for Phase 0 unless a specific incident creates the urgency.
  2. Owner if a ring is picked. Ring 1 fits per-repo authors (consistent with tracebloc/backend#665's Phase 2a pattern). Ring 2 + Ring 3 need a single owner — likely Asad (cross-cutting tech-lead, already named on tracebloc/backend#699) with this epic's authors driving repo-specific implementation.
  3. Azure Service Bus in CI (relevant to Ring 2 + Ring 3):
    • Ring 2: emulator container (free, contract-drift risk) vs. a dedicated SB namespace per run (~$1, faithful). Recommended: emulator with periodic drift checks.
    • Ring 3: dedicated SB namespace per run is non-negotiable; emulator can't validate the actual customer connection path.
  4. Bare-metal farm dependency (Ring 3): how much of Ring 3 is blocked on #690 being ready? Recommended: scenarios 1–3 can run on kind without the bare-metal farm. Scenarios 4–5 should wait until the farm has at least one bare-metal node available.
  5. Where does integration-tests/ live? Options: (a) tracebloc/backend/integration-tests/ (matches the private-catch-all convention), (b) a new tracebloc/integration-tests repo (clean separation, but creates an 11th repo to maintain), (c) tracebloc/.github/integration-tests/ (org-level neutral). Recommended: (a) for Ring 2, decide later for Ring 3.
  6. averaging-service refactor question carried over from tracebloc/backend#665: Lukas originally flagged "refactor service/averaging.py to extract pure math vs accept the risk." Ring 1's golden-value test + Ring 2's federated-through-SB scenarios make this incremental rather than blocking. Confirm we're OK with the incremental path.
  7. Frontend ring scope. Ring 1 currently only covers frontend-app Cypress promotion. Should design-system get an analogous integration ring (e.g., Chromatic visual regression)? Recommended: out of scope here; visual-regression has different failure modes and belongs in its own epic if we want it.

Definition of done (if a ring is selected)

This epic's DoD is the playbook is ready to execute, not "all three rings shipped." Concrete:

  • Per-repo test classification table maintained and updated quarterly (next refresh: end of Q3 2026)
  • When a ring is selected for execution, a child epic is filed with the ring's row-level tickets pre-drafted from this epic's content
  • Cross-references to tracebloc/backend#699 / tracebloc/backend#665 / tracebloc/backend#690 / tracebloc/averaging-service#116 stay current

If a ring is executed under this epic, that ring's DoD:

  • Ring 1: every row in the table has at least one passing integration test in CI, gated as a required status check, and each test has documented "incident-it-would-catch" justification
  • Ring 2: docker-compose suite green on backend + tracebloc-py-package PRs, runtime ≤10 min, regression-rate measured before/after for at least 4 weeks
  • Ring 3: scenarios 1–3 running nightly, scenario 4 + 5 running pre-release, full E2E completion in ≤60 min, regression-rate measured before/after for at least 4 weeks

Related

Discussion thread

This epic does not commit the team to any work. It commits the team to having a ready answer when tracebloc/backend#699 Phase 0 picks a gate, when a customer reports an integration regression, or when leadership decides the regression-rate-target window has come.

Suggested next step: review this body together (30 min), confirm the per-repo classification matches lived reality, agree on Open Questions 1–5, then leave the epic on the shelf until a triggering signal arrives.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions