test: share testcontainers per service via pkg/testutil by Joey0538 · Pull Request #208 · hmchangw/chat

Joey0538 · 2026-05-20T06:25:55Z

Summary

Consolidates testcontainer management for every integration test in the repo. pkg/testutil is now the single source of truth for Elasticsearch, NATS, Valkey, Mongo, Cassandra and MinIO startup; each service uses shared containers via the testutil helpers instead of rolling its own.

Originally scoped to search-service (per-test ES startup was the worst offender at ~30-60s × 4 Rooms tests + 2 CCS pairs). Expanded after audit to fix the same pattern across every integration package.

Why merge? Tests used to leak containers — even after a green run, MongoDB / NATS / Cassandra / Valkey containers piled up on the dev server eating resources. That happened because the shared helpers never terminated their containers and we have to disable Ryuk on our CI runner. This PR adds explicit TerminateAll via TestMain so containers go away on every clean test exit.

What changed

`pkg/testutil`

Helper	Before	After
`MongoDB(t, prefix)`	shared via `sync.Once`, no Terminate	shared, + `TerminateMongo`
`CassandraKeyspace(t, prefix)`	shared via `sync.Once`, no Terminate	shared, + `TerminateCassandra`
`MinIO(t, prefix)`	shared via `sync.Once`, no Terminate	shared, + `TerminateMinIO`
`Elasticsearch(t)`	n/a	new, shared single-node ES + `TerminateElasticsearch`
`NATS(t)`	n/a	new, shared NATS w/ JetStream + `TerminateNATS`
`StartValkeyCluster(t)`	per-test (from main #199)	unchanged
`SharedValkeyCluster(t)`	n/a	new, process-shared cluster-mode Valkey + `FlushValkey` + `TerminateValkey`
`TerminateAll()`	n/a	new, umbrella cleanup
`RunTests(m)`	n/a	new, one-liner TestMain helper
`init()`	n/a	new, disables Ryuk repo-wide (our CI runner can't run the reaper sidecar)

Each shared helper has a matching no-t EnsureXxx() variant for TestMain pre-warming.

Valkey: two variants

Reconciled with main's PR #199 (cluster-mode Valkey support). Two helpers, choose by use case:

SharedValkeyCluster(t) — default. Process-shared cluster (started via sync.Once, reaped via TerminateAll). Callers pair with t.Cleanup(testutil.FlushValkey) for keyspace isolation. Used by search-service tests, pkg/roomsubcache, pkg/valkeyutil.
StartValkeyCluster(t) — per-test cluster. Use only when the test asserts on cluster-routing state (pkg/roomkeystore's CLUSTER KEYSLOT checks) or owns a store wrapper that Close()s the underlying client (room-service, room-worker).

Per-service migrations

Inline container starts removed (replaced by testutil helpers): search-service (ES/NATS), search-sync-worker (ES/NATS), room-service (NATS), inbox-worker (NATS), tools/loadgen (NATS), pkg/natsrouter, pkg/roomkeystore, pkg/roomsubcache, pkg/valkeyutil.
TestMain added for explicit cleanup: 18 packages, each calling testutil.TerminateAll() on exit via testutil.RunTests(m).
Kept local: search-service CCS tests (need 2 ES on a shared docker network), pkg/roomkeysender (NATS with WebSocket transport for a JS bridge), pkg/roomcrypto (Node container). Each still imports testutil so the Ryuk-disable init runs.

Container reference / cleanup contract

Every testcontainer in this PR has:

A stored container reference (not just URL/addr) so it can be terminated later.
Best-effort Terminate on init-failure paths inside ensureXxx helpers.
Final teardown driven by either TestMain → TerminateAll (shared containers) or t.Cleanup (per-test local containers).

This holds regardless of test pass / fail / panic: t.Cleanup runs on failure + panic, and TestMain's code := m.Run(); TerminateAll(); os.Exit(code) wrap ensures cleanup runs before process exit. All Terminate* functions are idempotent and nil their refs after cleanup, so duplicate calls are safe.

Isolation strategy

Elasticsearch: per-test unique index name (uniqueESIndex, t.Name()-hashed), DELETEd on cleanup.
Valkey (shared): FLUSHALL across every master on cleanup via testutil.FlushValkey(t).
Valkey (per-test): automatic — each test gets its own cluster.
NATS: per-test *nats.Conn; subscriptions removed via router.Shutdown + nc.Drain before the next test.
Mongo/Cassandra/MinIO: per-test DB / keyspace / bucket name (already in testutil).

Other changes

Package consistency: every integration test file uses internal package <pkg> / package main, never external *_test. Required by CLAUDE.md and gives access to unexported identifiers.
history-service/docker-local/ removed — redundant per-service dev compose stack; the canonical local stack lives at repo root.
CLAUDE.md updated — Section 4's Integration Tests rewritten to match this PR's conventions.

Performance impact (per `go test` invocation, in-process)

Container	Before (search-service)	After
Elasticsearch (single-node)	4 (Rooms ×4)	1
Elasticsearch (CCS pair)	4 (2 tests × 2 nodes)	4 (unchanged)
NATS	11	1
Valkey	5	1
Mongo	already shared	already shared

TestMain pre-warms ES + NATS + Valkey + Mongo in parallel goroutines so cold-start wall-clock drops to max(ES, NATS, Valkey) ≈ ES alone instead of the sum. ES heap also dropped from 512m to 256m for tests (seeds are <10 docs/test).

Note: container sharing is per-process, so it applies within a single go test ./<service>/... invocation. Across services (separate test binaries), each gets its own pool — which is what the CI matrix already does anyway.

Pattern for new integration tests

//go:build integration

package mypackage

import (
    "testing"

    "github.com/hmchangw/chat/pkg/testutil"
)

func TestMain(m *testing.M) { testutil.RunTests(m) }

func TestSomething(t *testing.T) {
    natsURL := testutil.NATS(t)
    db      := testutil.MongoDB(t, "my_service")
    // ...
}

Verification

go vet (default and -tags=integration) clean across the whole repo
make lint: 0 issues
make test (unit) passes
Integration tests can only be exercised end-to-end in CI (no Docker locally) — relying on the CI matrix run on this PR

Test plan

All test-integration (<service>) matrix jobs pass
Wall-clock for the search-service integration job drops noticeably (it was dominated by ES cold starts)
Other services' integration jobs still pass at parity or better
No leaked containers on the dev server after a green test run

https://claude.ai/code/session_01MzdTDAGknG36FYtqd5MBTw

coderabbitai · 2026-05-20T06:26:08Z

Warning

Rate limit exceeded

@Joey0538 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 11 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7457c8f-d7d6-44b2-b595-43b9455f370e

📥 Commits

Reviewing files that changed from the base of the PR and between cda13a2 and e757c42.

📒 Files selected for processing (30)

CLAUDE.md
history-service/docker-local/.env.example
history-service/docker-local/docker-compose.yml
history-service/internal/service/integration_test.go
history-service/internal/service/main_test.go
pkg/natsrouter/integration_test.go
pkg/natsrouter/main_test.go
pkg/roomcrypto/main_test.go
pkg/roomkeysender/integration_test.go
pkg/roomkeysender/main_test.go
pkg/roomkeystore/main_test.go
pkg/roomsubcache/integration_test.go
pkg/roomsubcache/main_test.go
pkg/testutil/elasticsearch.go
pkg/testutil/init.go
pkg/testutil/terminate.go
pkg/testutil/testmain.go
pkg/testutil/valkey.go
pkg/valkeyutil/integration_test.go
pkg/valkeyutil/main_test.go
room-service/integration_test.go
search-service/integration_apps_test.go
search-service/integration_ccs_test.go
search-service/integration_messages_test.go
search-service/integration_rooms_test.go
search-service/integration_users_test.go
search-service/setup_shared_test.go
search-sync-worker/integration_test.go
tools/loadgen/integration_test.go
tools/loadgen/main_test.go

📝 Walkthrough

Walkthrough

This PR establishes a centralized integration test infrastructure in pkg/testutil, consolidates process-shared containers (Elasticsearch, NATS, Valkey, MinIO, Mongo, Cassandra) with concurrent pre-warming and synchronized teardown, adds search-service integration tests for apps, messages, rooms, and users with per-test fixtures, introduces CCS (cross-cluster search) tests, and migrates all existing integration tests to use the shared container helpers.

Changes

Shared Integration Test Infrastructure

Layer / File(s)	Summary
Shared container lifecycle and readiness `pkg/testutil/elasticsearch.go`, `pkg/testutil/nats.go`, `pkg/testutil/valkey.go`, `pkg/testutil/minio.go`, `pkg/testutil/mongo.go`, `pkg/testutil/cassandra.go`, `pkg/testutil/init.go`, `pkg/testutil/terminate.go`, `pkg/testutil/testmain.go`	Process-shared singletons for Elasticsearch, NATS, Valkey, MinIO, Mongo, and Cassandra, each with sync.Once guards, startup helpers, test accessors that fail tests on error, pre-warm helpers for TestMain, and best-effort idempotent termination with stderr logging. Disable Ryuk reaper by default via init(). TerminateAll() sequences termination. RunTests(m) runs tests and calls TerminateAll() before exit.
Search-service test harness and per-test utilities `search-service/setup_shared_test.go`	TestMain concurrently pre-warms ES/NATS/Valkey/Mongo and delegates to testutil.RunTests(m). seedDoc PUTs JSON documents with refresh=true to Elasticsearch. uniqueESIndex creates deterministic per-test index names from fnv-64a(t.Name()) and registers DELETE cleanup. valkeyClient connects per-test Valkey client and registers FlushDB cleanup.
Search-service feature tests: apps, messages, rooms, users `search-service/integration_apps_test.go`, `search-service/integration_messages_test.go`, `search-service/integration_rooms_test.go`, `search-service/integration_users_test.go`	Per-feature NATS-backed fixtures and integration tests. Apps tests verify prototype pipeline matching and assistant-enabled filtering using Mongo-seeded documents. Messages v2 tests use Elasticsearch HTTP stub and assert hit projections and empty-query errors. Rooms tests seed spotlight documents, verify happy-path room search with account boundaries, apply roomType filters, and assert invalid roomType errors. Users tests stub third-party HTTP endpoints and verify response mapping, empty-query rejection, and error code conversion.
Cross-cluster search (CCS) integration tests `search-service/integration_ccs_test.go`	CCS fixture creates Docker network, two Elasticsearch nodes (local/remote) with transport aliases, configures cluster.remote in proxy mode, polls for remote connectivity, and installs deterministic index templates. Two e2e tests: unrestricted CCS seeds local user-room doc listing both local and remote rooms as unrestricted, indexes one message in each cluster, and asserts merged response with correct roomId→siteId attribution; restricted CCS seeds HSS-based room restriction, indexes remote messages covering pre-HSS and post-HSS parent/reply combinations with tshow-dependent thread reply inclusion/exclusion, and asserts exactly three hits with precise message ID filtering.
TestMain entry points across services `broadcast-worker/main_test.go`, `message-worker/main_test.go`, `notification-worker/main_test.go`, `inbox-worker/main_test.go`, `room-service/main_test.go`, `room-worker/main_test.go`, `pkg/minioutil/main_test.go`, `pkg/mongoutil/main_test.go`, `pkg/natsrouter/main_test.go`, `pkg/roomcrypto/main_test.go`, `pkg/roomkeysender/main_test.go`, `pkg/userstore/main_test.go`, `history-service/internal/cassrepo/main_test.go`, `history-service/internal/mongorepo/main_test.go`, `history-service/internal/service/main_test.go`	Integration-build-tagged TestMain functions that delegate test execution to testutil.RunTests(m), wiring all services into the centralized test harness.
Migrate existing integration tests to shared infrastructure `inbox-worker/integration_test.go`, `pkg/natsrouter/integration_test.go`, `pkg/roomkeystore/integration_test.go`, `pkg/roomsubcache/integration_test.go`, `room-service/integration_test.go`, `room-worker/integration_test.go`, `search-sync-worker/integration_test.go`, `tools/loadgen/integration_test.go`, `tools/loadgen/members_integration_test.go`	Update setupNATS, setupValkey, setupElasticsearch, and MongoDB helpers to use testutil.NATS(t), testutil.Valkey(t), testutil.Elasticsearch(t), and testutil.MongoDB(t) instead of starting dedicated testcontainers per-test or per-package. Remove local container startup/termination logic and testcontainers-specific imports. Register per-test cleanup via t.Cleanup() to flush Valkey and close resources.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

hmchangw/chat#157: Shared MinIO container lifecycle and helper functions in pkg/testutil/minio.go overlap with prior test utility work, though the bulk of this PR (search-service CCS/integration tests and container orchestration) is independent.

Suggested reviewers

hmchangw
GITMateuszCharczuk
mliu33

Poem

🐰 Tests now share one docker dream,
No more containers—one gleaming team,
Elasticsearch, NATS, Valkey aligned,
Cross-cluster search—borders refined!
Happy paths dance, bad queries decline,
The harness stands tall, the infra's divine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.98% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'test: share testcontainers per service via pkg/testutil' accurately describes the main change: centralizing testcontainer management through a shared utility package.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/check-test-containers-V1EMq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@search-service/setup_shared_test.go`:
- Around line 224-226: The FLUSHDB error is only logged so test suite can
proceed with leftover keys; change the cleanup to fail the test on error by
replacing the t.Logf call with a test-failing assertion (e.g., call
t.Fatalf("flush valkey at %s: %v", addr, err) or use require.NoError(t, err)) so
that the rc.FlushDB(ctx).Err() failure in the setup (the statement calling
rc.FlushDB) causes the test to stop and surface the isolation breach
immediately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7bc2262b-cf97-40d2-b014-49aa3bae44da

📥 Commits

Reviewing files that changed from the base of the PR and between 7b5747b and 35212f0.

📒 Files selected for processing (2)

search-service/integration_test.go
search-service/setup_shared_test.go

…rrors A silent log on FLUSHDB failure could let state leak into the next sibling test, surfacing far from the real root cause. t.Errorf marks the offending test failed while still allowing the rest of its cleanup chain (valkeyutil.Disconnect, etc.) to run. Per CodeRabbit review on PR #208.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@search-service/integration_ccs_test.go`:
- Around line 132-144: The test fixture that creates the router via
natsrouter.New("search-service-test") registers middleware and handlers but
never calls router.Shutdown, causing NATS subscriptions to leak; update the
fixture to register a t.Cleanup (or defer) that calls router.Shutdown() (or
router.Shutdown(ctx) if it requires a context) after handler.Register and Flush,
ensuring the router created by natsrouter.New is properly shut down when the
test ends.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81248fd1-e0d2-4a24-9b4c-b28ac23851a1

📥 Commits

Reviewing files that changed from the base of the PR and between 35212f0 and e7718a4.

📒 Files selected for processing (3)

search-service/integration_ccs_test.go
search-service/integration_test.go
search-service/setup_shared_test.go

- Add t.Cleanup(router.Shutdown) to setupCCSFixture to match every other fixture in the package (per CodeRabbit review on PR #208). - Trim narration / what-comments across the integration_*_test.go files and setup_shared_test.go. Preserve genuine WHY comments: X-Elastic-Product product-check, mallory access boundary, FNV hash rationale, FLUSHDB isolation note, Flush before subscribe-race.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (8)

pkg/testutil/minio.go (1)
72-74: ⚡ Quick win

Replace stderr printf logging with structured slog.

Line 73 should use log/slog JSON logging with key-value fields rather than fmt.Fprintf.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/minio.go` around lines 72 - 74, Replace the stderr printf in the
minio container termination block with structured slog JSON logging: when
minioContainer.Terminate(ctx) returns an error, call the slog logger (e.g.,
slog.Error or slog.Log with Level "error") and include key-value fields such as
"msg" ("terminate shared minio failed"), "err" (the error), and any context like
"component":"minio" or "action":"terminate" instead of using
fmt.Fprintf(os.Stderr,...); update the import to use log/slog if not already
present.
pkg/testutil/valkey.go (1)
44-51: ⚡ Quick win

Align cleanup paths with logging/error-handling guidelines.

Lines 44/50 silently drop cleanup errors, and Line 85 logs with fmt.Fprintf. Please switch these to structured slog logging with key-value fields.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers" and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 84-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/valkey.go` around lines 44 - 51, The cleanup paths in
pkg/testutil/valkey.go currently ignore errors from container.Terminate(ctx) and
use fmt.Fprintf for logging; change these to capture and handle termination
errors instead of discarding them (do not use blank identifier), assign or log
the terminate error if termination fails, and replace any fmt.Fprintf calls with
structured slog calls (slog.Error/slog.Info) using key-value fields; update the
branches around container.Terminate(ctx), valkeyInitErr assignments, and the
fmt.Fprintf usage to call slog.* with context like ("msg", "failed to terminate
container", "err", err) or include both errors when setting valkeyInitErr so
cleanup failures are visible.
pkg/testutil/mongo.go (2)
61-71: ⚡ Quick win

Make TerminateMongo fully idempotent by clearing shared references.

After teardown, mongoClient and mongoContainer stay non-nil, so repeated calls can re-run disconnect/terminate on stale state. Reset both references after attempting teardown.
Proposed diff
 func TerminateMongo() {
 	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 	defer cancel()
 	if mongoClient != nil {
 		if err := mongoClient.Disconnect(ctx); err != nil {
 			fmt.Fprintf(os.Stderr, "disconnect shared mongo client: %v\n", err)
 		}
+		mongoClient = nil
 	}
 	if mongoContainer != nil {
 		if err := mongoContainer.Terminate(ctx); err != nil {
 			fmt.Fprintf(os.Stderr, "terminate shared mongo: %v\n", err)
 		}
+		mongoContainer = nil
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/mongo.go` around lines 61 - 71, The TerminateMongo function
currently disconnects the shared mongoClient and terminates mongoContainer but
leaves the package-level variables set; update TerminateMongo so that after
attempting mongoClient.Disconnect(ctx) and mongoContainer.Terminate(ctx) it sets
mongoClient = nil and mongoContainer = nil (clearing the shared references) to
make teardown idempotent; ensure you still log errors from Disconnect and
Terminate (using the existing fmt.Fprintf calls) before nil-ing the variables.
39-46: ⚡ Quick win

Fix Mongo cleanup logging and silent error-discard paths.

Lines 39/45 silently discard terminate errors, and Lines 63/68 use fmt.Fprintf logging. Please move these to structured slog logging and keep cleanup failures explicit.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers" and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 62-69
pkg/testutil/cassandra.go (1)
48-55: ⚡ Quick win

Use structured logging and explicit handling in Cassandra cleanup paths.

Lines 48/54/69 discard cleanup errors silently, and Line 93 uses fmt.Fprintf. Please switch to structured slog logs and avoid silent cleanup-error drops.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers" and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 69-70, 92-94
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/cassandra.go` around lines 48 - 55, The cleanup paths in
pkg/testutil/cassandra.go currently discard errors (e.g., "_ =
container.Terminate(ctx)") and use fmt.Fprintf for output; update the error
handling in the Cassandra initialization/cleanup blocks (around the
container.Terminate calls that follow get cassandra host/port errors and the
final cleanup) to capture and handle termination errors instead of ignoring
them, set/return cassInitErr as appropriate, and emit structured JSON logs using
log/slog (not fmt) with clear context (e.g., include the container ID, operation
like "terminate container", and both the original and termination errors).
Replace any fmt.Fprintf usage with slog calls, and if a termination error is
intentionally ignored, add a brief comment explaining why; reference
cassInitErr, container.MappedPort, and the container.Terminate calls to locate
the changes.
pkg/testutil/elasticsearch.go (1)
54-61: ⚡ Quick win

Use structured slog logging and avoid silent cleanup-error drops.

Lines 54/60 silently discard Terminate errors, and Line 93 uses unstructured fmt.Fprintf logging. Please log cleanup failures with structured slog fields instead of stderr formatting.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers" and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 92-94
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/elasticsearch.go` around lines 54 - 61, The code silently
ignores errors from container.Terminate and uses unstructured fmt.Fprintf
logging; update the cleanup and logging to use log/slog with structured fields:
capture and check the error returned by container.Terminate where currently
discarded and log it via slog.Error (include context fields like
"action":"terminate", "container":container.Name or container.ID and the error),
and replace the fmt.Fprintf usages with equivalent slog calls (slog.Error or
slog.Info) including structured keys; if any Terminate error is intentionally
ignored, add an explicit comment explaining why. Ensure esInitErr assignment
logic (esInitErr = fmt.Errorf(...)) remains but uses slog for diagnostic output
instead of fmt.
pkg/testutil/nats.go (1)
44-45: ⚡ Quick win

Standardize NATS teardown/error paths to structured logging.

Line 44 drops a cleanup error silently, and Line 77 uses fmt.Fprintf for logging. Use slog JSON logs with fields for both init-cleanup and terminate failures.

As per coding guidelines: "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers" and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 76-78
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/nats.go` around lines 44 - 45, The teardown and init error
handling in pkg/testutil/nats.go should use log/slog structured JSON logs and
must not silently discard errors: replace the ignored assignment "_ =
c.Terminate(ctx)" with proper error capture (e.g., errTerm := c.Terminate(ctx))
and when terminating on failure set or augment natsInitErr and log both init and
terminate failures via slog (use slog.Error with key/value fields referencing
the operation, error, and any context such as nats container id); also replace
the fmt.Fprintf usage around lines 76-78 with slog.Error/slog.Info calls that
include fields for the original init error and the terminate error instead of
printing plain text. Ensure you update references to natsInitErr and c.Terminate
to propagate or wrap errors consistently rather than discarding them.
pkg/testutil/init.go (1)
17-18: ⚡ Quick win

Handle the discarded os.Setenv error explicitly.

Line 18 ignores the error silently. Please either handle it or add an inline note explaining why discarding it is intentional here.

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/init.go` around lines 17 - 18, The call to
os.Setenv("TESTCONTAINERS_RYUK_DISABLED", "true") currently discards its error;
update the init logic to handle the returned error from os.Setenv rather than
ignoring it — either check the error and call log.Fatalf or t.Fatalf (or return
the error) so failures are surfaced, or if you truly intend to ignore it add an
explicit comment explaining why and suppress only after documenting; reference
the env var name TESTCONTAINERS_RYUK_DISABLED and the os.Setenv call when making
the change so the handling is applied in the same initialization block.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@search-service/integration_apps_test.go`:
- Line 40: Teardown calls currently discard errors (serverNATS.Drain() and
router.Shutdown(...)); change them to capture the returned error and report it
to the test instead of ignoring it — e.g., replace the discarded calls in the
t.Cleanup and shutdown block with error checks like "if err :=
serverNATS.Drain(); err != nil { t.Errorf("serverNATS.Drain failed: %v", err) }"
and similarly check the error returned by router.Shutdown(ctx) and report it via
t.Errorf or t.Fatalf so teardown failures are visible.

In `@search-service/integration_ccs_test.go`:
- Line 138: The cleanup currently discards router.Shutdown errors; update the
t.Cleanup closure to capture the returned error from
router.Shutdown(context.Background()) and surface it (e.g., call t.Fatalf or
t.Errorf with the error) instead of using `_ =`, so failing shutdowns fail the
test and prevent leaked subscriptions. Refer to the t.Cleanup closure and the
router.Shutdown call to locate and modify the code.

In `@search-service/integration_messages_test.go`:
- Line 38: The test fixture currently discards errors from io.Copy (reading
r.Body) and other calls (w.Write, Drain, Shutdown); update the test in
integration_messages_test.go to either check and assert these errors (e.g.,
require.NoError/if err != nil t.Fatalf) or explicitly comment why an error may
be safely ignored. Locate the uses of io.Copy(r.Body, io.Discard) and subsequent
w.Write, Drain, and Shutdown calls and replace the blank-error assignments with
proper error handling/assertions referencing those symbols so failures surface
during test runs.

In `@search-service/integration_rooms_test.go`:
- Line 47: Replace the silent error discards with proper error handling: in the
t.Cleanup calls that currently do `_ = serverNC.Drain()` (and any other `_ =
...` teardown calls) check the returned error and call t.Fatalf/t.Errorf (or
require.NoError) so teardown failures are reported instead of ignored, and in
the index setup where json.Marshal and ReadAll (or similar marshal/read calls)
currently drop errors, capture the error and fail the test or return it so the
test harness sees the failure; if any discard is intentional, add an explicit
comment explaining why and keep the explicit check/comment next to
serverNC.Drain(), json.Marshal, ReadAll (and any os.Remove/cleanup calls) to
satisfy the "never ignore errors silently" guideline.

In `@search-service/integration_users_test.go`:
- Line 41: Tests currently discard errors returned from cleanup/write calls
(e.g., the call to serverNC.Drain()), which hides fixture failures; update each
site (the serverNC.Drain() call and the other two cleanup/write calls
referenced) to capture the returned error and fail the test if non-nil (for
example: err := serverNC.Drain(); if err != nil { t.Fatalf("serverNC.Drain
failed: %v", err) }) so errors are surfaced with clear context.

In `@search-service/setup_shared_test.go`:
- Line 47: In seedDoc, don’t discard the error return from io.ReadAll; update
the call inside function seedDoc to capture both variables (e.g., body, err :=
io.ReadAll(resp.Body)), check err and fail the test or return a descriptive
error (using t.Fatalf/t.Errorf or returning the error) so response-body read
failures are surfaced in test output, and include the response status and any
relevant context in the error message for easier debugging.
- Around line 64-67: TestMain is currently discarding errors returned by the
Ensure* bootstrap functions inside the goroutine (the anonymous func wrapping f
func() error) so failures are hidden; change the bootstrapping to capture and
fail-fast on any non-nil error returned by the Ensure* calls (e.g., collect
errors via a buffered errCh or a synced slice while still calling wg.Done, then
after wg.Wait check for any error and call processLogger.Fatal/os.Exit(1) or
t.Fatalf) before calling m.Run; locate the anonymous goroutine and the TestMain
orchestration (wg, m.Run, the func(f func() error) { ... }(fn)) and modify it to
propagate errors instead of discarding them.

---

Nitpick comments:
In `@pkg/testutil/cassandra.go`:
- Around line 48-55: The cleanup paths in pkg/testutil/cassandra.go currently
discard errors (e.g., "_ = container.Terminate(ctx)") and use fmt.Fprintf for
output; update the error handling in the Cassandra initialization/cleanup blocks
(around the container.Terminate calls that follow get cassandra host/port errors
and the final cleanup) to capture and handle termination errors instead of
ignoring them, set/return cassInitErr as appropriate, and emit structured JSON
logs using log/slog (not fmt) with clear context (e.g., include the container
ID, operation like "terminate container", and both the original and termination
errors). Replace any fmt.Fprintf usage with slog calls, and if a termination
error is intentionally ignored, add a brief comment explaining why; reference
cassInitErr, container.MappedPort, and the container.Terminate calls to locate
the changes.

In `@pkg/testutil/elasticsearch.go`:
- Around line 54-61: The code silently ignores errors from container.Terminate
and uses unstructured fmt.Fprintf logging; update the cleanup and logging to use
log/slog with structured fields: capture and check the error returned by
container.Terminate where currently discarded and log it via slog.Error (include
context fields like "action":"terminate", "container":container.Name or
container.ID and the error), and replace the fmt.Fprintf usages with equivalent
slog calls (slog.Error or slog.Info) including structured keys; if any Terminate
error is intentionally ignored, add an explicit comment explaining why. Ensure
esInitErr assignment logic (esInitErr = fmt.Errorf(...)) remains but uses slog
for diagnostic output instead of fmt.

In `@pkg/testutil/init.go`:
- Around line 17-18: The call to os.Setenv("TESTCONTAINERS_RYUK_DISABLED",
"true") currently discards its error; update the init logic to handle the
returned error from os.Setenv rather than ignoring it — either check the error
and call log.Fatalf or t.Fatalf (or return the error) so failures are surfaced,
or if you truly intend to ignore it add an explicit comment explaining why and
suppress only after documenting; reference the env var name
TESTCONTAINERS_RYUK_DISABLED and the os.Setenv call when making the change so
the handling is applied in the same initialization block.

In `@pkg/testutil/minio.go`:
- Around line 72-74: Replace the stderr printf in the minio container
termination block with structured slog JSON logging: when
minioContainer.Terminate(ctx) returns an error, call the slog logger (e.g.,
slog.Error or slog.Log with Level "error") and include key-value fields such as
"msg" ("terminate shared minio failed"), "err" (the error), and any context like
"component":"minio" or "action":"terminate" instead of using
fmt.Fprintf(os.Stderr,...); update the import to use log/slog if not already
present.

In `@pkg/testutil/mongo.go`:
- Around line 61-71: The TerminateMongo function currently disconnects the
shared mongoClient and terminates mongoContainer but leaves the package-level
variables set; update TerminateMongo so that after attempting
mongoClient.Disconnect(ctx) and mongoContainer.Terminate(ctx) it sets
mongoClient = nil and mongoContainer = nil (clearing the shared references) to
make teardown idempotent; ensure you still log errors from Disconnect and
Terminate (using the existing fmt.Fprintf calls) before nil-ing the variables.

In `@pkg/testutil/nats.go`:
- Around line 44-45: The teardown and init error handling in
pkg/testutil/nats.go should use log/slog structured JSON logs and must not
silently discard errors: replace the ignored assignment "_ = c.Terminate(ctx)"
with proper error capture (e.g., errTerm := c.Terminate(ctx)) and when
terminating on failure set or augment natsInitErr and log both init and
terminate failures via slog (use slog.Error with key/value fields referencing
the operation, error, and any context such as nats container id); also replace
the fmt.Fprintf usage around lines 76-78 with slog.Error/slog.Info calls that
include fields for the original init error and the terminate error instead of
printing plain text. Ensure you update references to natsInitErr and c.Terminate
to propagate or wrap errors consistently rather than discarding them.

In `@pkg/testutil/valkey.go`:
- Around line 44-51: The cleanup paths in pkg/testutil/valkey.go currently
ignore errors from container.Terminate(ctx) and use fmt.Fprintf for logging;
change these to capture and handle termination errors instead of discarding them
(do not use blank identifier), assign or log the terminate error if termination
fails, and replace any fmt.Fprintf calls with structured slog calls
(slog.Error/slog.Info) using key-value fields; update the branches around
container.Terminate(ctx), valkeyInitErr assignments, and the fmt.Fprintf usage
to call slog.* with context like ("msg", "failed to terminate container", "err",
err) or include both errors when setting valkeyInitErr so cleanup failures are
visible.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9e8a5a40-0a62-4f6d-80d1-c9c4b049b15b

📥 Commits

Reviewing files that changed from the base of the PR and between e7718a4 and 7b695cd.

📒 Files selected for processing (16)

.github/workflows/ci.yml
pkg/testutil/cassandra.go
pkg/testutil/elasticsearch.go
pkg/testutil/init.go
pkg/testutil/minio.go
pkg/testutil/mongo.go
pkg/testutil/nats.go
pkg/testutil/terminate.go
pkg/testutil/valkey.go
search-service/integration_apps_test.go
search-service/integration_ccs_test.go
search-service/integration_messages_test.go
search-service/integration_rooms_test.go
search-service/integration_test.go
search-service/integration_users_test.go
search-service/setup_shared_test.go

💤 Files with no reviewable changes (1)

search-service/integration_test.go

✅ Files skipped from review due to trivial changes (1)

.github/workflows/ci.yml

…gration tests The integration_test.go fixtures used to start fresh containers per test — ~14 fixture invocations across the file, with ES startup (~30-60s) being the worst offender. The four search.rooms tests each waited on a brand-new ES container. This refactor moves shared container startup into setup_shared_test.go, using the same sync.Once pattern already used by pkg/testutil/mongo.go. Each fixture isolates per-test state instead of using a fresh container: - Elasticsearch: unique index name per test via t.Name() hash; the index is DELETEd on cleanup. - Valkey: keyspace is FLUSHDB'd on cleanup so the next test starts clean. Uses a raw go-redis client to avoid exposing FLUSHDB on the production valkeyutil.Client interface. - NATS: each test creates its own *nats.Conn pair; subscriptions are removed via router.Shutdown + nc.Drain before the next test runs. CCS tests are the one exception: they need two networked ES nodes with shared docker-network aliases, which doesn't fit a process-shared pool. They keep their per-test ES pair but now piggyback on the shared Valkey and NATS. Expected impact on a clean run: ES container starts drop from ~6 (4 Rooms + 2 CCS) to 3 (1 shared + 2 CCS); NATS starts drop from ~11 to 1; Valkey starts drop from ~5 to 1.

…rrors A silent log on FLUSHDB failure could let state leak into the next sibling test, surfacing far from the real root cause. t.Errorf marks the offending test failed while still allowing the rest of its cleanup chain (valkeyutil.Disconnect, etc.) to run. Per CodeRabbit review on PR #208.

CCS is the one fixture in the file that owns special infrastructure — a pair of ES nodes on a shared docker network with transport-port aliases — so it doesn't fit the process-shared container pattern the other fixtures now use. Moving it (plus its CCS-only helpers: putClusterSetting, waitForRemoteConnected, install/build/messageTestTemplate, userRoomTestTemplate) into integration_ccs_test.go leaves integration_test.go homogeneous: all remaining tests follow the same "NATS request → assert response" pattern. testHTTPClient and seedDoc stay in integration_test.go because they're shared with the non-CCS path (uniqueESIndex cleanup, putTestSpotlightIndex, the Rooms tests' seeds).

One file per search endpoint, matching the CCS split pattern: integration_apps_test.go — search.apps (Mongo + NATS) integration_users_test.go — search.users (NATS + httptest stub) integration_rooms_test.go — search.rooms (shared ES + Valkey + NATS) integration_messages_test.go — search.messages v2 (ES httptest stub + NATS) testUserRoomIndex, testHTTPClient, and seedDoc move into setup_shared_test.go alongside the shared-container helpers — they're test infrastructure used by multiple endpoints, so colocating them keeps the per-endpoint files focused. integration_test.go is removed; the package no longer has a catch-all integration file. CI auto-discovery and path filtering both still match because every new file lives under search-service/ and carries the //go:build integration tag.

- Add t.Cleanup(router.Shutdown) to setupCCSFixture to match every other fixture in the package (per CodeRabbit review on PR #208). - Trim narration / what-comments across the integration_*_test.go files and setup_shared_test.go. Preserve genuine WHY comments: X-Elastic-Product product-check, mallory access boundary, FNV hash rationale, FLUSHDB isolation note, Flush before subscribe-race.

…S heap - TestMain pre-warms shared ES, NATS, and Valkey concurrently via goroutines. First-test wall-clock for cold start drops to max(ES, NATS, Valkey) ≈ ES alone (~30-60s) instead of the sum. Refactored each sharedXxx(t) into a no-t ensureSharedXxx() so TestMain can drive them without a *testing.T. - Drop ES heap from -Xms512m -Xmx512m to -Xms256m -Xmx256m for both the shared single-node ES and the two CCS ES nodes. The test seed sizes (<10 docs per test) fit comfortably in 256m; saves ~5-10s startup per container and ~250MB RAM per node.

… in CI CI sets TESTCONTAINERS_RYUK_DISABLED=true on the integration job because Ryuk doesn't run reliably on this runner. Without Ryuk the shared containers (started via sync.Once with no t.Cleanup) would leak per test job. Wrap m.Run() so TestMain calls terminateShared() on clean exits and explicitly Terminates each shared container (ES, Valkey, NATS). Locally Ryuk stays enabled as a safety net for SIGKILL / Ctrl+C where m.Run never returns.

pkg/testutil/mongo.go now exports TerminateMongo, which disconnects the shared client and stops the shared container. search-service's terminateShared() calls it so the Mongo container is reaped on clean exits — same need as ES/Valkey/NATS now that CI runs with TESTCONTAINERS_RYUK_DISABLED=true. Other services that use testutil.MongoDB can opt in by calling TerminateMongo from their own TestMain when they need explicit cleanup.

pkg/testutil now mirrors the Mongo/Cassandra/MinIO pattern for ES, NATS, and Valkey: each exposes Xxx(t) for tests, EnsureXxx() for TestMain pre-warming, and TerminateXxx() for explicit cleanup. testutil.TerminateAll fans out to every Terminate (each is idempotent if the container was never started) so any service's TestMain becomes a one-liner. NATS starts with --jetstream so consumers that need streams just work against the shared container; core pub/sub consumers pay nothing extra. search-service drops its local sync.Once helpers and uses the testutil helpers throughout. setup_shared_test.go shrinks to just the test-wide testHTTPClient / seedDoc / uniqueESIndex / freshValkeyClient helpers that are specific to the search-* tests. CCS keeps its own networked ES pair (still local) but uses shared NATS/Valkey via testutil. Next: migrate other services that use shared containers (search-sync-worker, room-service, room-worker, etc.) to the testutil helpers + add TestMain calls.

Ryuk fails to start on our CI runner (can't pull / run the sidecar image), so set TESTCONTAINERS_RYUK_DISABLED=true once in pkg/testutil's init(). Every integration test that imports testutil inherits this automatically — no per-job env config needed. Removes the TESTCONTAINERS_RYUK_DISABLED env var from the CI workflow's integration step; the init() makes it redundant. Developers can still override locally with TESTCONTAINERS_RYUK_DISABLED=false to re-enable Ryuk for debugging container leaks. Cleanup contract: testutil.TerminateAll (called from each service's TestMain) is the only mechanism that reaps the shared containers. SIGKILL / Ctrl+C leaks containers — acceptable trade-off for the simplification, and on CI it doesn't matter since the runner VM is torn down anyway.

…ainers Per consumer: - search-sync-worker, room-service, room-worker, inbox-worker, tools/loadgen, pkg/natsrouter, pkg/roomkeystore, pkg/roomsubcache: drop inline testcontainers.GenericContainer / natsmod.Run starts and use the testutil helpers (Elasticsearch, NATS, Valkey) instead. Sibling-test isolation: Valkey gets FLUSHDB on cleanup via the new testutil.FlushValkey; ES gets per-test unique indices where the suite needs them; NATS gets unique stream / queue-group names where it already did. - pkg/roomkeysender, pkg/roomcrypto: keep their per-test containers (NATS-with-WebSocket and Node have special configs); just add TestMain so they import testutil. - broadcast-worker, message-worker, notification-worker, history-service/internal/{mongorepo,cassrepo,service}, pkg/userstore, pkg/mongoutil, pkg/minioutil: add main_test.go with TestMain calling testutil.TerminateAll. These only consume Mongo/Cassandra/MinIO helpers from testutil — no inline container starts to remove. search-service TestMain now fails fast on pre-warm errors (CodeRabbit). TerminateMongo / TerminateCassandra nil their refs after cleanup so they are idempotent (CodeRabbit). Ryuk works on our CI runner after all, so init() and the CI env var are removed — Ryuk stays enabled as the safety net for SIGKILL / Ctrl+C; TerminateAll runs first on clean exits and the explicit cleanup is what the tests actually rely on.

Merge from main pulled in tools/loadgen/members_integration_test.go which referenced the local setupNATS helper that this PR deleted. Swap it to testutil.NATS(t) like the sibling integration test.

CI reliably needs TESTCONTAINERS_RYUK_DISABLED=true after all. Set it via pkg/testutil's init() so every test process inherits it without per-job env config. Developers can override locally with TESTCONTAINERS_RYUK_DISABLED=false to debug container leaks. Cleanup audit — every test container in this PR has: - A stored container reference (not just URL/addr). - Best-effort Terminate on init-failure paths inside ensureXxx helpers. - Final teardown driven by TestMain → testutil.TerminateAll OR by t.Cleanup for per-test local containers. This holds regardless of test pass/fail: t.Cleanup runs on failure + panic, and TestMain's `code := m.Run(); TerminateAll(); os.Exit(code)` wrap ensures cleanup runs before process exit.

…ants, naming - Add testutil.RunTests(m) one-liner helper. Collapses 17 identical per-service TestMains into `func TestMain(m *testing.M) { testutil.RunTests(m) }`. - Drop search-service's local flushValkey + freshValkeyClient — the former duplicates testutil.FlushValkey exactly; the latter is renamed to valkeyClient (the underlying container is shared, "fresh" was misleading). - Promote search-service NATS queue-group literals to testQueueGroup / testQueueGroupSubs / testQueueGroupV2 constants. - Add testutil.EnsureMongo to search-service's prewarm slice — apps tests used to pay Mongo startup serially against the first call. - Trim the over-long docstrings on init.go and terminate.go. - Replace the cross-file "// Flush — see setupAppsFixture for the rationale" comment with the inline one-liner everywhere; no more grep-to-find chain. - Fix stale "Prewarm" reference in elasticsearch.go EnsureXxx comment.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (6)

search-service/setup_shared_test.go (1)

52-52: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle response-body read failures in seedDoc.

Line 52 drops the io.ReadAll error, which can hide useful diagnostics when indexing fails.
Suggested patch
-	body, _ := io.ReadAll(resp.Body)
+	body, err := io.ReadAll(resp.Body)
+	require.NoError(t, err)
As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/setup_shared_test.go` at line 52, In the seedDoc helper, don't
ignore the io.ReadAll error; change the read to capture the error (e.g., body,
err := io.ReadAll(resp.Body)) and handle it by failing the test or returning the
error from seedDoc (use t.Fatalf or return fmt.Errorf) with a message that
includes the error and useful context (response status and partial body) so
failures are diagnosable; update any callers of seedDoc if needed to propagate
the error.

search-service/integration_rooms_test.go (1)

47-47: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make index fixture setup/teardown fail loudly on errors.

Line 47, Line 72, Line 101, and Line 108 currently suppress errors, which can hide index setup failures and cleanup leaks.

Suggested patch

-	t.Cleanup(func() { _ = serverNC.Drain() })
+	t.Cleanup(func() { assert.NoError(t, serverNC.Drain()) })
...
-	t.Cleanup(func() { _ = router.Shutdown(context.Background()) })
+	t.Cleanup(func() { assert.NoError(t, router.Shutdown(context.Background())) })
...
-	data, _ := json.Marshal(body)
+	data, err := json.Marshal(body)
+	require.NoError(t, err)
...
-	b, _ := io.ReadAll(resp.Body)
+	b, err := io.ReadAll(resp.Body)
+	require.NoError(t, err)

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 72-72, 101-101, 108-108

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/integration_rooms_test.go` at line 47, The cleanup and
teardown currently discard errors (e.g., in the t.Cleanup closures calling
serverNC.Drain() and other index fixture setup/teardown calls), which can hide
failures; change each discarded-error site to check the returned error and fail
the test loudly (use t.Fatalf or testing/require's NoError/NoErrorf) so that any
Drain()/Close()/teardown error surfaces immediately (refer to the exact calls to
serverNC.Drain(), the index fixture setup/teardown closures, and any other
places where errors are assigned to `_` in the test harness).

search-service/integration_ccs_test.go (1)

74-74: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Surface CCS fixture cleanup/read failures explicitly.

Several paths still discard errors (network/container/router cleanup and HTTP body reads), which can mask CCS test instability and resource leaks.

Suggested patch

-	t.Cleanup(func() { _ = nw.Remove(ctx) })
+	t.Cleanup(func() { assert.NoError(t, nw.Remove(ctx)) })
...
-	t.Cleanup(func() { _ = serverNC.Drain() })
+	t.Cleanup(func() { assert.NoError(t, serverNC.Drain()) })
...
-	t.Cleanup(func() { _ = router.Shutdown(context.Background()) })
+	t.Cleanup(func() { assert.NoError(t, router.Shutdown(context.Background())) })
...
-	t.Cleanup(func() { _ = container.Terminate(ctx) })
+	t.Cleanup(func() { assert.NoError(t, container.Terminate(ctx)) })
...
-	respBody, _ := io.ReadAll(resp.Body)
+	respBody, err := io.ReadAll(resp.Body)
+	require.NoError(t, err)
...
-			body, _ := io.ReadAll(resp.Body)
+			body, readErr := io.ReadAll(resp.Body)
+			require.NoError(t, readErr)

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 114-114, 138-138, 190-190, 275-275, 292-293

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/integration_ccs_test.go` at line 74, The test currently
discards cleanup and read errors (e.g., the t.Cleanup call that calls
nw.Remove(ctx) and other network/container/router cleanup and HTTP body reads),
which hides failures and leaks; update each silent discard to capture the error
and report it via the test (for cleanup use t.Cleanup(func(){ if err :=
nw.Remove(ctx); err != nil { t.Errorf("nw.Remove failed: %v", err) } }) or
t.Logf/t.Fatalf as appropriate) and for HTTP responses ensure resp.Body is
closed and check errors from io.ReadAll and resp.Body.Close (e.g., if b, err :=
io.ReadAll(resp.Body); err != nil { t.Fatalf("read body: %v", err) }). Locate
and fix the occurrences around the symbols nw.Remove, container/router cleanup
calls, and HTTP response body reads to explicitly handle and report errors
rather than ignoring them.

search-service/integration_messages_test.go (1)

38-38: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently drop fixture and teardown errors.

The handler and cleanup paths currently ignore multiple errors (Line 38, Line 45, Line 58, Line 83), which can hide integration flakiness.

Suggested patch

-		_, _ = io.Copy(io.Discard, r.Body)
+		if _, err := io.Copy(io.Discard, r.Body); err != nil {
+			t.Errorf("drain ES stub request body: %v", err)
+			return
+		}
...
-		_, _ = w.Write([]byte(`{"hits":{"total":{"value":1},"hits":[{"_source":{` +
+		if _, err := w.Write([]byte(`{"hits":{"total":{"value":1},"hits":[{"_source":{` +
 			`"messageId":"m1","roomId":"r1","siteId":"site-a","userId":"u1",` +
-			`"userAccount":"alice","content":"hello","createdAt":"2026-04-01T12:00:00Z"}}]}}`))
+			`"userAccount":"alice","content":"hello","createdAt":"2026-04-01T12:00:00Z"}}]}}`)); err != nil {
+			t.Errorf("write ES stub response: %v", err)
+		}
...
-	t.Cleanup(func() { _ = serverNATS.Drain() })
+	t.Cleanup(func() { assert.NoError(t, serverNATS.Drain()) })
...
-	t.Cleanup(func() { _ = router.Shutdown(context.Background()) })
+	t.Cleanup(func() { assert.NoError(t, router.Shutdown(context.Background())) })

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 45-47, 58-58, 83-83

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/integration_messages_test.go` at line 38, Fix the silent error
drops in the test by checking and handling the return values instead of
discarding them: replace calls like io.Copy(io.Discard, r.Body) and other
ignored returns (e.g., r.Body.Close(), resp.Body.Close(), os.RemoveAll(...))
with error checks (e.g., if err := io.Copy(io.Discard, r.Body); err != nil {
t.Fatalf("copying body: %v", err) } or use require.NoError(t, err) as
appropriate), or if you truly intend to ignore an error, add an explicit comment
explaining why; update all occurrences referenced (io.Copy with r.Body, any
r.Body.Close()/resp.Body.Close(), and os.RemoveAll) so no error is silently
dropped.

search-service/integration_users_test.go (1)

41-41: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t suppress fixture cleanup and stub write errors.

Line 41, Line 61, and Line 72 ignore returned errors; this can hide teardown failures or incomplete stub responses.

Suggested patch

-	t.Cleanup(func() { _ = serverNC.Drain() })
+	t.Cleanup(func() { assert.NoError(t, serverNC.Drain()) })
...
-	t.Cleanup(func() { _ = router.Shutdown(context.Background()) })
+	t.Cleanup(func() { assert.NoError(t, router.Shutdown(context.Background())) })
...
-		_, _ = w.Write([]byte(stubResp))
+		_, err := w.Write([]byte(stubResp))
+		assert.NoError(t, err)

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 61-61, 72-72

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/integration_users_test.go` at line 41, Several cleanup and
stub calls (e.g., the t.Cleanup wrapper that currently does "_ =
serverNC.Drain()" and the two other places that discard errors) silently ignore
returned errors; update each site to check the error and fail or log it instead.
Replace discarded-error patterns like "_ = serverNC.Drain()" (and other
discarded calls such as stub.Write/Close/etc.) with an explicit check: if err :=
serverNC.Drain(); err != nil { t.Fatalf("serverNC.Drain failed: %v", err) } (or
use require.NoError(t, err) if testify is used) so teardown and stub-write
failures are surfaced rather than suppressed.

search-service/integration_apps_test.go (1)

40-40: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Surface fixture teardown failures instead of discarding them.

Line 40 and Line 64 ignore cleanup errors; if teardown fails, this test can pass while leaving leaked subscriptions/connections.

Suggested patch

-	t.Cleanup(func() { _ = serverNATS.Drain() })
+	t.Cleanup(func() { assert.NoError(t, serverNATS.Drain()) })
...
-	t.Cleanup(func() {
-		_ = router.Shutdown(context.Background())
-	})
+	t.Cleanup(func() {
+		assert.NoError(t, router.Shutdown(context.Background()))
+	})

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 64-65

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search-service/integration_apps_test.go` at line 40, The test currently
discards errors returned by serverNATS.Drain in the t.Cleanup closures; update
those cleanup functions (the t.Cleanup closures that call serverNATS.Drain on
lines around the two occurrences) to capture the returned error and fail the
test if teardown fails (e.g., if err := serverNATS.Drain(); err != nil {
t.Fatalf("serverNATS.Drain failed: %v", err) }) so teardown failures surface
instead of being ignored.

🧹 Nitpick comments (7)

pkg/testutil/elasticsearch.go (1)

54-60: ⚡ Quick win

Use structured slog logging for termination paths and avoid silent drops.

Line 54/60 currently discard terminate failures, and Line 93 uses fmt.Fprintf. Please log these errors with structured slog fields.

Proposed patch

 import (
 	"context"
 	"fmt"
+	"log/slog"
 	"os"
 	"sync"
 	"testing"
 	"time"
@@
 var (
 	esOnce      sync.Once
 	esContainer testcontainers.Container
 	esURL       string
 	esInitErr   error
+	esLogger    = slog.New(slog.NewJSONHandler(os.Stderr, nil))
 )
@@
 		host, err := container.Host(ctx)
 		if err != nil {
-			_ = container.Terminate(ctx)
+			if termErr := container.Terminate(ctx); termErr != nil {
+				esLogger.Warn("cleanup failed after elasticsearch host lookup error", "error", termErr)
+			}
 			esInitErr = fmt.Errorf("get es host: %w", err)
 			return
 		}
 		port, err := container.MappedPort(ctx, "9200")
 		if err != nil {
-			_ = container.Terminate(ctx)
+			if termErr := container.Terminate(ctx); termErr != nil {
+				esLogger.Warn("cleanup failed after elasticsearch port lookup error", "error", termErr)
+			}
 			esInitErr = fmt.Errorf("get es port: %w", err)
 			return
 		}
@@
 	if err := esContainer.Terminate(ctx); err != nil {
-		fmt.Fprintf(os.Stderr, "terminate shared elasticsearch: %v\n", err)
+		esLogger.Error("terminate shared elasticsearch", "error", err)
 	}

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers." and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 92-94

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/elasticsearch.go` around lines 54 - 60, The termination error
from container.Terminate(ctx) is being ignored and fmt.Fprintf is used for error
output; update the termination paths in the function that sets esInitErr to log
termination failures with structured log/slog (e.g., call slog.Error with fields
like "err" and contextual keys) instead of silently discarding the error, and
replace the fmt.Fprintf usage with slog.Error/slog.Log JSON-formatted logs;
ensure esInitErr still wraps the original error (esInitErr = fmt.Errorf("get es
host: %w", err)) but also log the terminate error (container.Terminate(ctx))
with slog including intent if discarding is intentional.

pkg/testutil/mongo.go (1)

61-63: ⚡ Quick win

Switch Mongo termination logs to slog JSON.

Line 62 and Line 68 should use structured slog fields instead of plain stderr formatting.

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers."

Also applies to: 67-69

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/mongo.go` around lines 61 - 63, Replace the plain stderr prints
around mongoClient.Disconnect and the other termination messages with structured
slog calls: instead of fmt.Fprintf(os.Stderr, "...%v...", err) use slog.Error
(or the package logger) with a clear message like "disconnect shared mongo
client" and include the error as a field (err) and any contextual fields (e.g.,
"phase":"shutdown" or "client":"sharedMongo"). Update the other nearby
fmt.Fprintf/Println termination lines the same way so all shutdown logs use
log/slog JSON structured fields rather than text formatting, targeting the
mongoClient.Disconnect call and the adjacent shutdown messages in the same
function.

pkg/testutil/nats.go (1)

44-45: ⚡ Quick win

Align NATS termination logging/error handling with repo logging rules.

Line 44 silently drops cleanup failure, and Line 77 uses plain stderr formatting. Switch to JSON slog logging with key-value fields for both paths.

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers." and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 76-78
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/nats.go` around lines 44 - 45, Replace the silent discard of
c.Terminate(ctx) and the plain-text stderr output with structured JSON slog
logging: capture the error returned by c.Terminate(ctx) instead of assigning it
to underscore and log it via slog with key-value fields (e.g.,
"action":"nats.terminate", "error":err) near where natsInitErr is set, and
replace the fmt.Fprintf/os.Stderr usage around lines 76-78 with slog JSON
logging (e.g., "action":"nats.init", "msg":..., "error":err) so both cleanup and
init failures use log/slog structured JSON and no errors are ignored.

pkg/testutil/valkey.go (1)

45-52: ⚡ Quick win

Use slog and stop silently swallowing cleanup errors in Valkey lifecycle.

Please replace the silent _ = container.Terminate(ctx) calls and stderr formatting with structured slog logging.

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers." and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 102-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/valkey.go` around lines 45 - 52, Replace the silent error
discards around container.Terminate(ctx) with structured slog logging and avoid
swallowing the terminate error: when container.Terminate(ctx) returns an error,
call slog.Error with a descriptive message and include the error and context
(e.g., slog.String("step","terminate"), slog.Error(err),
slog.String("component","valkey"), possibly the host/port variables) instead of
assigning `_ = container.Terminate(ctx)`; keep the existing fmt.Errorf wrapping
for valkeyInitErr (e.g., valkeyInitErr = fmt.Errorf("get valkey host: %w", err))
but log the cleanup failure immediately after the terminate call. Apply the same
replacement for the other occurrences mentioned (the container.Terminate(ctx) at
lines 102-104) so no Terminate error is silently ignored and all logs use
log/slog structured fields.

pkg/testutil/minio.go (1)

72-74: ⚡ Quick win

Use structured slog logging for MinIO termination errors.

Line 73 should use JSON slog key-value logging instead of fmt.Fprintf.

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/minio.go` around lines 72 - 74, Replace the plain-text error
print when terminating MinIO with structured slog JSON logging: locate the call
to minioContainer.Terminate(ctx) and the subsequent fmt.Fprintf(os.Stderr,
"terminate shared minio: %v\n", err) and change it to use the package-level slog
logger (e.g., slog.Error or slog.With(...).Error) to emit a key/value log
containing a clear message ("terminate shared minio failed") and the error under
a key like "err" (and include any context keys such as "operation":"terminate"
if desired); remove the fmt and os.Stderr usage and ensure the import for
log/slog is present.

pkg/testutil/cassandra.go (1)

48-55: ⚡ Quick win

Use structured logging and avoid silent cleanup-error drops in Cassandra helper.

Please replace silent terminate-error discards and stderr formatting with JSON slog key-value logs.

As per coding guidelines, "Use log/slog with JSON format for all logging — never use fmt.Println, log.Println, or text-format loggers." and "Never ignore errors silently — comment if intentionally discarded."

Also applies to: 69-70, 90-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/testutil/cassandra.go` around lines 48 - 55, Replace the silent discard
of container.Terminate(ctx) and plain-text error handling with structured slog
JSON logging: after any call to container.Terminate(ctx) capture its returned
error (e.g., termErr := container.Terminate(ctx)); if termErr != nil call
slog.Error or slog.Log with key/value pairs (e.g., slog.Error("terminate
container failed", "terminate_error", termErr, "context_err", err, "step", "get
cassandra host/port")) instead of discarding; keep setting cassInitErr (e.g.,
cassInitErr = fmt.Errorf("get cassandra host: %w", err)) but log both the
original err and any terminate error via slog; apply the same pattern for the
other occurrences around the cassInitErr paths and any places that used
fmt.Println/ignored Terminate errors (references: cassInitErr variable and
container.Terminate(ctx) and the container.MappedPort call).

tools/loadgen/members_integration_test.go (1)

120-125: ⚡ Quick win

Replace fixed post-run sleep with condition-based waiting.

time.Sleep(500ms) here is timing-sensitive; under load it can flake. Wait on the actual counters instead.

Suggested fix

-	// Allow trailing events to settle before checking counters.
-	time.Sleep(500 * time.Millisecond)
-
-	// Both simulated stages must have seen traffic.
-	assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")
-	assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")
+	require.Eventually(t, func() bool {
+		return rsCalls.Load() > 0 && rwCalls.Load() > 0
+	}, 5*time.Second, 50*time.Millisecond, "expected both simulated stages to see traffic")

As per coding guidelines: "Never use time.Sleep() for goroutine synchronization — use proper sync primitives."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/members_integration_test.go` around lines 120 - 125, Replace
the fixed time.Sleep(500 * time.Millisecond) in members_integration_test.go with
a condition-based wait that polls the actual counters rsCalls.Load() and
rwCalls.Load() until both are > 0 or a reasonable timeout elapses; implement a
small loop using time.NewTicker and time.After (or testing's deadline) that
checks rsCalls.Load() and rwCalls.Load() each tick and returns when both are
positive, and call t.Fatalf or t.Fatal if the timeout is reached to fail the
test, ensuring no goroutine synchronization relies on a fixed sleep.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@history-service/internal/service/main_test.go`:
- Line 3: All tests in this directory must share the same package so TestMain in
main_test.go controls the full test lifecycle; change the package declaration in
integration_test.go, main_test.go, messages_test.go, threads_test.go,
room_times_test.go, and utils_test.go from "package service_test" or "package
service" as needed so they all read "package service". Ensure TestMain (in
main_test.go) remains in the same package and that any references to unexported
types still compile after switching the four files that were "service_test" to
"service".

In `@pkg/natsrouter/main_test.go`:
- Line 3: The test file declares package natsrouter_test but must use the same
package as the code under test; change the package declaration from "package
natsrouter_test" to "package natsrouter" so the tests run in-package and can
access unexported types and methods used by this test file (update the
top-of-file package line accordingly).

In `@pkg/roomkeysender/main_test.go`:
- Line 3: The test file currently declares package "roomkeysender_test" which
violates the repo rule requiring test files to use the same package as the code;
change the package declaration in main_test.go from "package roomkeysender_test"
to "package roomkeysender" so the test runs in-package and can access unexported
types (no other changes required).

In `@search-sync-worker/integration_test.go`:
- Around line 44-50: The prewarm goroutines in TestMain are silently discarding
errors from testutil.EnsureElasticsearch and testutil.EnsureNATS; modify
TestMain so it collects and checks errors before calling m.Run (e.g., replace
the current fire-and-forget goroutine pattern using wg with an error channel or
an errgroup) and fail fast if any call to EnsureElasticsearch or EnsureNATS
returns a non-nil error; ensure the fix references the existing wg, the list of
functions (testutil.EnsureElasticsearch, testutil.EnsureNATS) and that TestMain
exits with a non-zero status (or calls t.Fatalf equivalent) when any prewarm
error is observed.

---

Duplicate comments:
In `@search-service/integration_apps_test.go`:
- Line 40: The test currently discards errors returned by serverNATS.Drain in
the t.Cleanup closures; update those cleanup functions (the t.Cleanup closures
that call serverNATS.Drain on lines around the two occurrences) to capture the
returned error and fail the test if teardown fails (e.g., if err :=
serverNATS.Drain(); err != nil { t.Fatalf("serverNATS.Drain failed: %v", err) })
so teardown failures surface instead of being ignored.

In `@search-service/integration_ccs_test.go`:
- Line 74: The test currently discards cleanup and read errors (e.g., the
t.Cleanup call that calls nw.Remove(ctx) and other network/container/router
cleanup and HTTP body reads), which hides failures and leaks; update each silent
discard to capture the error and report it via the test (for cleanup use
t.Cleanup(func(){ if err := nw.Remove(ctx); err != nil { t.Errorf("nw.Remove
failed: %v", err) } }) or t.Logf/t.Fatalf as appropriate) and for HTTP responses
ensure resp.Body is closed and check errors from io.ReadAll and resp.Body.Close
(e.g., if b, err := io.ReadAll(resp.Body); err != nil { t.Fatalf("read body:
%v", err) }). Locate and fix the occurrences around the symbols nw.Remove,
container/router cleanup calls, and HTTP response body reads to explicitly
handle and report errors rather than ignoring them.

In `@search-service/integration_messages_test.go`:
- Line 38: Fix the silent error drops in the test by checking and handling the
return values instead of discarding them: replace calls like io.Copy(io.Discard,
r.Body) and other ignored returns (e.g., r.Body.Close(), resp.Body.Close(),
os.RemoveAll(...)) with error checks (e.g., if err := io.Copy(io.Discard,
r.Body); err != nil { t.Fatalf("copying body: %v", err) } or use
require.NoError(t, err) as appropriate), or if you truly intend to ignore an
error, add an explicit comment explaining why; update all occurrences referenced
(io.Copy with r.Body, any r.Body.Close()/resp.Body.Close(), and os.RemoveAll) so
no error is silently dropped.

In `@search-service/integration_rooms_test.go`:
- Line 47: The cleanup and teardown currently discard errors (e.g., in the
t.Cleanup closures calling serverNC.Drain() and other index fixture
setup/teardown calls), which can hide failures; change each discarded-error site
to check the returned error and fail the test loudly (use t.Fatalf or
testing/require's NoError/NoErrorf) so that any Drain()/Close()/teardown error
surfaces immediately (refer to the exact calls to serverNC.Drain(), the index
fixture setup/teardown closures, and any other places where errors are assigned
to `_` in the test harness).

In `@search-service/integration_users_test.go`:
- Line 41: Several cleanup and stub calls (e.g., the t.Cleanup wrapper that
currently does "_ = serverNC.Drain()" and the two other places that discard
errors) silently ignore returned errors; update each site to check the error and
fail or log it instead. Replace discarded-error patterns like "_ =
serverNC.Drain()" (and other discarded calls such as stub.Write/Close/etc.) with
an explicit check: if err := serverNC.Drain(); err != nil {
t.Fatalf("serverNC.Drain failed: %v", err) } (or use require.NoError(t, err) if
testify is used) so teardown and stub-write failures are surfaced rather than
suppressed.

In `@search-service/setup_shared_test.go`:
- Line 52: In the seedDoc helper, don't ignore the io.ReadAll error; change the
read to capture the error (e.g., body, err := io.ReadAll(resp.Body)) and handle
it by failing the test or returning the error from seedDoc (use t.Fatalf or
return fmt.Errorf) with a message that includes the error and useful context
(response status and partial body) so failures are diagnosable; update any
callers of seedDoc if needed to propagate the error.

---

Nitpick comments:
In `@pkg/testutil/cassandra.go`:
- Around line 48-55: Replace the silent discard of container.Terminate(ctx) and
plain-text error handling with structured slog JSON logging: after any call to
container.Terminate(ctx) capture its returned error (e.g., termErr :=
container.Terminate(ctx)); if termErr != nil call slog.Error or slog.Log with
key/value pairs (e.g., slog.Error("terminate container failed",
"terminate_error", termErr, "context_err", err, "step", "get cassandra
host/port")) instead of discarding; keep setting cassInitErr (e.g., cassInitErr
= fmt.Errorf("get cassandra host: %w", err)) but log both the original err and
any terminate error via slog; apply the same pattern for the other occurrences
around the cassInitErr paths and any places that used fmt.Println/ignored
Terminate errors (references: cassInitErr variable and container.Terminate(ctx)
and the container.MappedPort call).

In `@pkg/testutil/elasticsearch.go`:
- Around line 54-60: The termination error from container.Terminate(ctx) is
being ignored and fmt.Fprintf is used for error output; update the termination
paths in the function that sets esInitErr to log termination failures with
structured log/slog (e.g., call slog.Error with fields like "err" and contextual
keys) instead of silently discarding the error, and replace the fmt.Fprintf
usage with slog.Error/slog.Log JSON-formatted logs; ensure esInitErr still wraps
the original error (esInitErr = fmt.Errorf("get es host: %w", err)) but also log
the terminate error (container.Terminate(ctx)) with slog including intent if
discarding is intentional.

In `@pkg/testutil/minio.go`:
- Around line 72-74: Replace the plain-text error print when terminating MinIO
with structured slog JSON logging: locate the call to
minioContainer.Terminate(ctx) and the subsequent fmt.Fprintf(os.Stderr,
"terminate shared minio: %v\n", err) and change it to use the package-level slog
logger (e.g., slog.Error or slog.With(...).Error) to emit a key/value log
containing a clear message ("terminate shared minio failed") and the error under
a key like "err" (and include any context keys such as "operation":"terminate"
if desired); remove the fmt and os.Stderr usage and ensure the import for
log/slog is present.

In `@pkg/testutil/mongo.go`:
- Around line 61-63: Replace the plain stderr prints around
mongoClient.Disconnect and the other termination messages with structured slog
calls: instead of fmt.Fprintf(os.Stderr, "...%v...", err) use slog.Error (or the
package logger) with a clear message like "disconnect shared mongo client" and
include the error as a field (err) and any contextual fields (e.g.,
"phase":"shutdown" or "client":"sharedMongo"). Update the other nearby
fmt.Fprintf/Println termination lines the same way so all shutdown logs use
log/slog JSON structured fields rather than text formatting, targeting the
mongoClient.Disconnect call and the adjacent shutdown messages in the same
function.

In `@pkg/testutil/nats.go`:
- Around line 44-45: Replace the silent discard of c.Terminate(ctx) and the
plain-text stderr output with structured JSON slog logging: capture the error
returned by c.Terminate(ctx) instead of assigning it to underscore and log it
via slog with key-value fields (e.g., "action":"nats.terminate", "error":err)
near where natsInitErr is set, and replace the fmt.Fprintf/os.Stderr usage
around lines 76-78 with slog JSON logging (e.g., "action":"nats.init",
"msg":..., "error":err) so both cleanup and init failures use log/slog
structured JSON and no errors are ignored.

In `@pkg/testutil/valkey.go`:
- Around line 45-52: Replace the silent error discards around
container.Terminate(ctx) with structured slog logging and avoid swallowing the
terminate error: when container.Terminate(ctx) returns an error, call slog.Error
with a descriptive message and include the error and context (e.g.,
slog.String("step","terminate"), slog.Error(err),
slog.String("component","valkey"), possibly the host/port variables) instead of
assigning `_ = container.Terminate(ctx)`; keep the existing fmt.Errorf wrapping
for valkeyInitErr (e.g., valkeyInitErr = fmt.Errorf("get valkey host: %w", err))
but log the cleanup failure immediately after the terminate call. Apply the same
replacement for the other occurrences mentioned (the container.Terminate(ctx) at
lines 102-104) so no Terminate error is silently ignored and all logs use
log/slog structured fields.

In `@tools/loadgen/members_integration_test.go`:
- Around line 120-125: Replace the fixed time.Sleep(500 * time.Millisecond) in
members_integration_test.go with a condition-based wait that polls the actual
counters rsCalls.Load() and rwCalls.Load() until both are > 0 or a reasonable
timeout elapses; implement a small loop using time.NewTicker and time.After (or
testing's deadline) that checks rsCalls.Load() and rwCalls.Load() each tick and
returns when both are positive, and call t.Fatalf or t.Fatal if the timeout is
reached to fail the test, ensuring no goroutine synchronization relies on a
fixed sleep.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c86391a-3e00-4016-bb2f-28e445f06c8b

📥 Commits

Reviewing files that changed from the base of the PR and between 7b695cd and cda13a2.

📒 Files selected for processing (40)

broadcast-worker/main_test.go
history-service/internal/cassrepo/main_test.go
history-service/internal/mongorepo/main_test.go
history-service/internal/service/main_test.go
inbox-worker/integration_test.go
inbox-worker/main_test.go
message-worker/main_test.go
notification-worker/main_test.go
pkg/minioutil/main_test.go
pkg/mongoutil/main_test.go
pkg/natsrouter/integration_test.go
pkg/natsrouter/main_test.go
pkg/roomcrypto/main_test.go
pkg/roomkeysender/main_test.go
pkg/roomkeystore/integration_test.go
pkg/roomsubcache/integration_test.go
pkg/testutil/cassandra.go
pkg/testutil/elasticsearch.go
pkg/testutil/init.go
pkg/testutil/minio.go
pkg/testutil/mongo.go
pkg/testutil/nats.go
pkg/testutil/terminate.go
pkg/testutil/testmain.go
pkg/testutil/valkey.go
pkg/userstore/main_test.go
room-service/integration_test.go
room-service/main_test.go
room-worker/integration_test.go
room-worker/main_test.go
search-service/integration_apps_test.go
search-service/integration_ccs_test.go
search-service/integration_messages_test.go
search-service/integration_rooms_test.go
search-service/integration_test.go
search-service/integration_users_test.go
search-service/setup_shared_test.go
search-sync-worker/integration_test.go
tools/loadgen/integration_test.go
tools/loadgen/members_integration_test.go

💤 Files with no reviewable changes (1)

search-service/integration_test.go

✅ Files skipped from review due to trivial changes (1)

notification-worker/main_test.go

Match the pattern already used by search-service's TestMain — collect prewarm errors in a channel and surface them before m.Run rather than letting individual tests fail with confusing 'couldn't connect' errors. Per CodeRabbit review on PR #208.

Repo convention (CLAUDE.md): test files live in the same package as the code under test so they can reach unexported identifiers. Seven files were using external _test packages — switch them all to internal: pkg/natsrouter/{integration_test,main_test}.go pkg/roomsubcache/integration_test.go (+ new main_test.go) pkg/roomkeysender/{integration_test,main_test}.go history-service/internal/service/{integration_test,main_test}.go Mechanical changes: strip the self-import, de-prefix package-qualified identifiers (natsrouter.New → New, service.New → New, etc.), change the package declaration. No behaviour change. Added pkg/roomsubcache/main_test.go that was missing — it uses testutil.Valkey but had no TestMain to call testutil.TerminateAll. Resolves CodeRabbit "package _test" review threads.

pkg/roomkeystore was using testutil.Valkey + testutil.FlushValkey but had no TestMain to call testutil.TerminateAll. Same pattern as pkg/roomsubcache (also caught and fixed in earlier commit). Consistency audit verified — every integration test package now has a TestMain that calls testutil.TerminateAll (via testutil.RunTests for the common case, or a custom prewarm wrapper for search-service and search-sync-worker). The only exception is auth-service, which uses no testcontainers (pure httptest).

…ablished The old Integration Tests section instructed contributors to "write setup<Dep>(t) helpers that start a container" — exactly the pattern this PR consolidated away. Rewrite the section to capture what we actually expect now: - Internal package, never external *_test - Containers come from pkg/testutil helpers, not inline starts - Every package has a TestMain calling testutil.TerminateAll - Ryuk is disabled repo-wide; explicit cleanup is the contract - Per-test isolation responsibility per container type - When inline GenericContainer is acceptable (CCS, roomcrypto, roomkeysender) - Where new shared container helpers belong (pkg/testutil with the Xxx + EnsureXxx + TerminateXxx triple)

Main introduced Valkey cluster-mode (PR #199 - 'feat: Valkey cluster support with room key ensure RPC'). Reconciliation: - Adopt main's per-test testutil.StartValkeyCluster instead of my shared Valkey helpers. valkey is now per-test (each test gets its own cluster-mode container), not process-shared. - Drop my testutil.Valkey / EnsureValkey / TerminateValkey / FlushValkey — replaced by StartValkeyCluster. - Remove TerminateValkey from TerminateAll (nothing process-shared to clean up). - Update search-service per-endpoint files (integration_ccs_test.go, integration_rooms_test.go) to use valkeyutil.WrapClusterClient(testutil.StartValkeyCluster(t)). - Drop search-service's local valkeyClient/flushValkey helpers. - Keep main's roomkeystore/roomsubcache integration tests but switch roomsubcache back to internal package (consistency). - Add main_test.go for pkg/valkeyutil (newly has integration tests on main). - Re-migrate room-service setupNATS to testutil.NATS (main reverted this when it touched the file). - Update tools/loadgen unit tests for the new ValkeyAddrs []string config field. - Update CLAUDE.md: Valkey is now per-test, not shared.

- pkg/testutil/terminate.go: drop TerminateValkey from TerminateAll (no process-shared Valkey to clean up; StartValkeyCluster is per-test). - search-service/setup_shared_test.go: drop local valkeyClient helper and EnsureValkey prewarm. - search-service/integration_ccs_test.go, search-service/integration_rooms_test.go: use valkeyutil.WrapClusterClient(testutil.StartValkeyCluster(t)). - room-service/integration_test.go: re-apply NATS migration to testutil.NATS (was reverted when merge took main's version). - pkg/roomsubcache/integration_test.go: switch back to internal package (roomsubcache, not roomsubcache_test) for consistency. - pkg/valkeyutil/main_test.go: add missing TestMain. - tools/loadgen/main_test.go: update unit tests for ValkeyAddrs []string config field rename. - CLAUDE.md: document Valkey as per-test, not shared.

testutil/valkey.go now exposes two variants: - SharedValkeyCluster(t) — process-shared cluster via sync.Once. Fast (cluster boots once per go test invocation). Callers register t.Cleanup(func() { testutil.FlushValkey(t) }) for keyspace isolation. - StartValkeyCluster(t) — per-test cluster (unchanged). Use only for tests asserting on cluster-routing state (pkg/roomkeystore) or that own a store wrapper that closes the underlying client. Wired TerminateValkey back into TerminateAll so the shared cluster is reaped at process exit. Migrated safe callers to the shared variant: - search-service/integration_ccs_test.go, integration_rooms_test.go - pkg/roomsubcache/integration_test.go - pkg/valkeyutil/integration_test.go - search-service TestMain prewarm now includes EnsureValkey Kept on per-test (deliberate): - pkg/roomkeystore (CLUSTER KEYSLOT routing assertions) - room-service, room-worker (use NewValkeyClusterStoreFromClient whose store.Close() releases the underlying client — would kill the shared cluster mid-run if a handler ever called Close) Expected wins: ~15-25s per go test ./search-service/... and similar per-package savings elsewhere on packages that hit Valkey heavily. CLAUDE.md updated to document both variants + when to use each.

history-service/docker-local was a per-service dev compose stack (NATS + Mongo + Cassandra + the service itself). The canonical local dev stack lives in the repo-root docker-local/ — no need for the per-service duplicate.

ngangwar962

LGTM

Net -120 lines of comments / narration across the PR. Targeted only WHAT-comments and progress narration; preserved load-bearing WHY: - pkg/testutil/{init,terminate,testmain,valkey}.go: docstrings cut to one or two lines each; deleted "intended for ..." sentences that restate function names. Kept the newValkeyClusterClient ClusterSlots-override explanation (load-bearing WHY). - search-service/setup_shared_test.go: removed redundant "pre-warms ..." multi-paragraph TestMain comment; same for uniqueESIndex. - search-service/integration_ccs_test.go: deleted the 14 t.Logf progress lines (failures from require.NoError already pinpoint the phase), trimmed setupCCSFixture / installTemplates / startESForCCS / buildTestTemplate docstrings, removed in-test narration like "Local message in local room" / "Round-trips through the real natsrouter ...". Kept the PROXY-mode rationale (real WHY) and the Clause-A/B1/B2 inline scenario comments (load-bearing for restricted query semantics). - search-service/integration_apps_test.go: dropped the "Prototype pipeline" follow-up note (belongs in a TODO, not a comment). - search-service/integration_rooms_test.go: trimmed roomsFixture and putTestSpotlightIndex docstrings. - search-service/integration_users_test.go: removed inline "controls the stub response" comment on a self-explanatory field. - search-sync-worker/integration_test.go: TestMain comment cut to one line, matching search-service. - tools/loadgen/integration_test.go: removed "Allow trailing events to flow" / "Assert canonical stream drained" / "Assert seed data visible" — code is self-explanatory. - pkg/roomcrypto/main_test.go, pkg/roomkeysender/main_test.go: removed 2-line "Import testutil for the Ryuk-disable init() side effect" preamble; the import alone now conveys it. CodeRabbit suggestions to extract a setupRouter helper or a PrewarmFailFast helper deferred — out of scope for a comment-trim pass.

…searchIndex helpers Three new testutil helpers consolidate patterns the PR was already duplicating across 5+ files: - testutil.PrewarmFailFast(fns...): runs EnsureXxx funcs concurrently, returns the first error. - testutil.RunTestsWithPrewarm(m, fns...): the standard TestMain wrap with pre-warm fail-fast. On prewarm failure, exits 1 after TerminateAll. On success, hands off to RunTests. - testutil.ElasticsearchIndex(t, prefix): per-test ES index name (fnv hash of t.Name()) + DELETE on cleanup. Completes the per-test isolation triad alongside MongoDB / CassandraKeyspace / MinIO. Migrations: - search-service/setup_shared_test.go: TestMain collapses from 30 lines (manual wg+errCh+goroutines) to a single RunTestsWithPrewarm call. Drops the local uniqueESIndex helper. Adds a setupRouter helper that every per-endpoint fixture uses to wire NATS server+client conns + router with the given queue group + Flush + cleanups. - search-service/integration_{apps,users,rooms,messages,ccs}_test.go: each fixture's setup function now ends with `clientNATS := setupRouter(t, queueGroup, h.Register)`. Drops ~9 lines of plumbing per fixture (45 lines saved). - search-sync-worker/integration_test.go: replaces the manual prewarm block with testutil.PrewarmFailFast. Keeps its custom m.Run wrap because it needs to close the lazy-init JetStream conn between m.Run and TerminateAll. CLAUDE.md updated to point at the new helpers.

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread search-service/setup_shared_test.go Outdated

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread search-service/integration_ccs_test.go Outdated

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

claude added 12 commits May 20, 2026 09:35

test(loadgen): migrate members_integration_test.go to testutil.NATS

66cecfe

Merge from main pulled in tools/loadgen/members_integration_test.go which referenced the local setupNATS helper that this PR deleted. Swap it to testutil.NATS(t) like the sibling integration test.

Joey0538 force-pushed the claude/check-test-containers-V1EMq branch from 293cb24 to 66cecfe Compare May 20, 2026 09:37

Joey0538 changed the title ~~test(search-service): share NATS / ES / Valkey containers across integration tests~~ test: share testcontainers per service via pkg/testutil May 20, 2026

claude added 2 commits May 20, 2026 10:31

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread history-service/internal/service/main_test.go Outdated

Comment thread pkg/natsrouter/main_test.go Outdated

Comment thread pkg/roomkeysender/main_test.go Outdated

Comment thread search-sync-worker/integration_test.go Outdated

claude added 4 commits May 20, 2026 10:46

ngangwar962 reviewed May 21, 2026

View reviewed changes

Comment thread pkg/testutil/valkey.go

claude added 2 commits May 21, 2026 02:10

claude added 2 commits May 21, 2026 02:25

chore(history-service): remove local docker-compose folder

c6424e3

history-service/docker-local was a per-service dev compose stack (NATS + Mongo + Cassandra + the service itself). The canonical local dev stack lives in the repo-root docker-local/ — no need for the per-service duplicate.

ngangwar962 self-requested a review May 21, 2026 02:36

ngangwar962 approved these changes May 21, 2026

View reviewed changes

claude added 2 commits May 21, 2026 02:48

Joey0538 merged commit 62bb85b into main May 21, 2026
24 checks passed

coderabbitai Bot mentioned this pull request May 28, 2026

feat(loadgen): max-rps SLO finder for message-send & history #240

Merged

7 tasks

Conversation

Joey0538 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

pkg/testutil

Valkey: two variants

Per-service migrations

Container reference / cleanup contract

Isolation strategy

Other changes

Performance impact (per go test invocation, in-process)

Pattern for new integration tests

Verification

Test plan

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngangwar962 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Joey0538 commented May 20, 2026 •

edited

Loading

`pkg/testutil`

Performance impact (per `go test` invocation, in-process)

coderabbitai Bot commented May 20, 2026 •

edited

Loading