feat(loadgen): max-rps SLO finder for message-send & history by hmchangw · Pull Request #240 · hmchangw/chat

hmchangw · 2026-05-28T23:02:33Z

Summary

Adds a loadgen max-rps --workload=messages|history subcommand that ramps target RPS through an explicit --steps list, holds at each step, evaluates an SLO verdict, and reports the largest RPS that held:

ANSWER: max RPS = 2000 (workload=messages, preset=medium)
        Next limit: E2 p95=143ms > 100ms

This applies the same step-up-and-hold-under-SLO control loop as PR #234 ("daily-IM, find sustainable N"), but the ramped axis is RPS and the load is the existing open-loop generators. It reuses the message-send (run) and history (history-sustained) workloads rather than introducing new ones.

How it works

Generic engine (ramp.go, verdict.go, maxrps_report.go): parseRPSSteps (k-suffix), a step lifecycle (warmup → reset → hold → measure → cooldown), evaluateRPSStep, and a console + CSV report. All identifiers are rps-prefixed so there is no package main symbol collision with feat(loadgen): daily-IM load scenario to find sustainable N #234's daily_* code regardless of merge order, and convergence later is mechanical.
Per-workload adapters behind a small rpsWorkload interface: maxrps_messages.go (E1/E2 latency + JetStream consumer-pending deltas, via a reset-per-step Collector and Prometheus counter diffs) and maxrps_history.go (per-endpoint latency via two sequential fresh-collector runs; no consumer queue).
Verdict precedence (differs from feat(loadgen): daily-IM load scenario to find sustainable N #234, deliberately): explicit measurement-failure → INCONCLUSIVE; else TRIP if any gated latency p95/p99, error rate, or consumer-pending-growth exceeds threshold; else INCONCLUSIVE if achieved RPS fell below (1−tolerance)×target while healthy (load box was the limit, not the service); else PASS. TRIP is checked before the shortfall guard so server-induced backpressure isn't misread as a harness limit.
Defaults (all flag-tunable): p95≤100ms, p99≤250ms, error-rate≤0.1%, pending-growth≤+1000, rate-tolerance 5%, warmup 10s / hold 30s / cooldown 5s, stop-on-trip on. Default steps: messages 500,1k,2k,5k,10k, history 200,500,1k,2k,5k.

Design spec: docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md
Implementation plan: docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md

Test plan

make lint — 0 issues
make test SERVICE=tools/loadgen — green under -race (pure logic at/near 100%: verdict.go/ramp.go 100%, report ~96/86%)
make fmt — no changes
gosec — clean (note: project gosec config excludes tools/)
make test-integration SERVICE=tools/loadgen — compiles & links under the integration tag; runs in CI (Docker/testcontainers unavailable in the dev sandbox). Covers a 2-step messages ramp end-to-end.
govulncheck / semgrep — run in CI (network/pipx unavailable locally; no new dependencies were added)
Manual: make -C tools/loadgen/deploy up && make -C tools/loadgen/deploy run-max-rps WORKLOAD=messages PRESET=medium

Notes

Tooling-only change — no client-facing handler touched, so docs/client-api.md is unchanged.
Out of scope (per spec §9): binary-search refinement, auto-geometric ramp, members workload, gopsutil box-health.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

Generated by Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added max-rps load testing subcommand that automatically discovers maximum achievable RPS for supported workloads while evaluating SLO compliance at each step.
- Configurable workload targets (messages, history) with step-based RPS ramping.
- Verdict reporting (PASS/TRIP/INCONCLUSIVE) with CSV export capability.
Documentation
- Added comprehensive usage guide with examples and CLI flag reference.
- Included interpretation guidance for test results.
Tests
- Added unit and integration tests for RPS evaluation and reporting logic.

Mirrors PR #234's step-up-and-hold-under-SLO control loop, ramping target RPS instead of user count N, to auto-find the max sustainable RPS for the messages and history loadgen workloads. Standalone on main with rps-prefixed identifiers so it neither depends on nor collides with #234. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

Task-by-task TDD plan: generic ramp engine + verdict + report, reusing the existing message-send and history open-loop generators, exposed as a max-rps subcommand. rps-prefixed identifiers avoid collision with PR #234. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

… and precedence tests https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

…w guard, cleanups https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

…tor error, log gather error - Add shutdownSrv helper and call it on all three subscription-error return paths so the metrics HTTP server is not leaked when construction fails. - Log the Gather() error in snapshotCounters instead of silently discarding it. - Strengthen TestDiffCounters to assert zero-delta for marshal/gatekeeper/bad_reply. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

Correct the README "Reading the output" section (INCONCLUSIVE does not stop the ramp and is not a pass; fix the no-pass ANSWER string), validate --page-limit/--request-timeout > 0, and document the deliberate straggler-exclusion in the messages hold-window counter snapshot. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

coderabbitai · 2026-05-28T23:02:46Z

📝 Walkthrough

Walkthrough

This pull request introduces loadgen max-rps, a complete tooling feature that discovers the maximum RPS achievable for messages and history workloads under SLO constraints. The implementation includes a workload-agnostic ramp control engine, per-workload adapters, SLO-based verdict classification (PASS/TRIP/INCONCLUSIVE), and both tabular and CSV reporting.

Changes

loadgen max-rps feature

Layer / File(s)	Summary
Design & specification documentation `docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md`, `docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md`	Plans and formal specification documents defining the max-RPS feature: workload-agnostic ramp engine architecture, per-workload adapter seam, CLI surface, verdict precedence rules (latency/error/pending thresholds → TRIP, then achieved-rate shortfall → INCONCLUSIVE), per-step lifecycle with warmup/hold/cooldown, reporting outputs (console + CSV), test coverage requirements (unit/integration), and non-code deliverables (README and Makefile).
RPS verdict evaluation engine `tools/loadgen/verdict.go`, `tools/loadgen/verdict_test.go`	Core classification logic that evaluates each RPS step into PASS/TRIP/INCONCLUSIVE by checking latency percentiles (p95/p99), error rate, pending-growth, and achieved-rate-vs-target thresholds in strict precedence order; includes comprehensive table-driven tests covering boundary conditions, latency overrides, and verdict reasons.
Ramp control engine & orchestration `tools/loadgen/ramp.go`, `tools/loadgen/ramp_test.go`	Orchestrates per-step RPS ramp execution: parses step sequences with k-suffix scaling and validates ascending order, manages warmup/hold/cooldown timing, evaluates each step via verdict logic, supports stop-on-trip or run-all modes, and handles context cancellation; includes tests for parsing constraints, ramp semantics, and exit code mapping.
Report generation & output formatting `tools/loadgen/maxrps_report.go`, `tools/loadgen/maxrps_report_test.go`	Generates human-readable per-step tables (target/achieved RPS, p95/p99 per latency series, error rate, worst pending, verdict) and CSV rows; emits "ANSWER" line with the highest passing RPS, "Next limit" guidance when a trip occurs, and handles multi-series alignment with zero-filling for missing data.
Collector.Reset() for per-step measurement isolation `tools/loadgen/collector.go`, `tools/loadgen/collector_test.go`	Adds a `Reset()` method that clears correlation state and latency samples while preserving active subscriptions, enabling clean per-step hold-window measurements without subscription recreation.
Messages workload adapter `tools/loadgen/maxrps_messages.go`, `tools/loadgen/maxrps_messages_test.go`	Implements messages-specific RPS step execution: snapshots metric counters (published/error reasons), queries JetStream consumer pending counts, manages per-step generator lifecycle with collector reset at hold boundaries, and normalizes deltas/samples into `rpsStepInputs`; includes pure-logic tests for counter delta computation and input normalization.
History workload adapter `tools/loadgen/maxrps_history.go`, `tools/loadgen/maxrps_history_test.go`	Implements history-specific RPS step execution: runs sequential warmup/hold generator phases, extracts latency samples into named series (history/thread), computes attempted/failed/saturation counts, returns normalized inputs; includes test validating sample counts and structure.
CLI control flow & main dispatcher `tools/loadgen/maxrps.go`, `tools/loadgen/maxrps_test.go`, `tools/loadgen/main.go`	`runMaxRPS` command implementation: parses flags, applies workload-dependent default steps, validates workload-specific parameters, constructs and initializes the appropriate workload adapter, runs ramp orchestration, renders console report and optional CSV, returns exit code; dispatcher wires `max-rps` subcommand to CLI entry point.
Deployment wiring & documentation `tools/loadgen/README.md`, `tools/loadgen/deploy/Makefile`	Makefile adds configurable `WORKLOAD` and `STEPS` variables plus `run-max-rps` target to invoke max-rps with parameterized flags; README documents the workflow, SLO signal gating, verdict interpretation, and full CLI flag reference.
End-to-end integration test `tools/loadgen/integration_test.go`	Sets up JetStream stream with ack-only durable consumers, creates a fake gatekeeper for message injection, runs a two-step RPS ramp via messages workload, asserts non-trip verdicts and positive achieved metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hmchangw/chat#117: Extends the existing tools/loadgen/Collector by adding Collector.Reset() to support per-step ramp hold-window measurement logic.
hmchangw/chat#230: Implements the history workload adapter using the HistoryCollector/HistoryGenerator machinery introduced in that PR.
hmchangw/chat#208: The new TestMaxRPS_Messages_TwoStepRamp integration test is built on top of the loadgen integration test harness refactored in that PR.

Suggested reviewers

mliu33
ngangwar962

Poem

🐰 A ramp climbs high toward the RPS sky,
With verdicts passed and failures nigh,
Each step holds steady, SLOs guide,
Till max is found—the truth inside!
Reports emerge with answers clear,
The load has spoken, crystal-sheer. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat(loadgen): max-rps SLO finder for message-send & history' clearly and concisely summarizes the main change: adding a new max-rps subcommand that discovers maximum achievable RPS under SLO constraints for two specific workloads.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/max-rps-slo-loadgen-ajKRi

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

tools/loadgen/integration_test.go (1)

142-142: ⚡ Quick win

Use a bounded context for runRamp to avoid hanging integration runs.

Running the ramp on context.Background() can hang CI if a subscriber/workload path stalls. Wrap ramp execution in context.WithTimeout.

Proposed fix

-	ctx := context.Background()
+	ctx := context.Background()
@@
-	results := runRamp(ctx, w, &rampConfig{
+	rampCtx, cancel := context.WithTimeout(ctx, 20*time.Second)
+	defer cancel()
+	results := runRamp(rampCtx, w, &rampConfig{

Also applies to: 194-202

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/integration_test.go` at line 142, Replace uses of
context.Background() created for ramp runs with a bounded context via
context.WithTimeout to prevent hanging tests: create ctx, cancel :=
context.WithTimeout(context.Background(), 2*time.Minute) and defer cancel()
before calling runRamp (and any subscriber/workload invocations), pass that ctx
into runRamp and related functions, and repeat this change for the other
occurrences around lines where ctx is defined (also update imports to include
time if needed).

tools/loadgen/ramp.go (1)

32-64: 💤 Low value

Optional: Add explicit empty-steps check for consistency with plan.

The plan (line 597-599) includes a final check for len(out) == 0, but the implementation returns directly without it. While unreachable due to the "empty step" error at line 38-40, adding the explicit check would match the plan and serve as defensive documentation.
📋 Optional consistency fix
 		out = append(out, n)
 	}
+	if len(out) == 0 {
+		return nil, fmt.Errorf("no steps parsed from %q", s)
+	}
 	return out, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/ramp.go` around lines 32 - 64, In parseRPSSteps, after the loop
and before returning, add an explicit defensive check for len(out) == 0 and
return a clear error (e.g., fmt.Errorf("no steps provided") or similar) to match
the plan; this should reference the out slice and be placed just before the
final return out, nil in the parseRPSSteps function.

tools/loadgen/README.md (1)

244-262: 💤 Low value

Flags table omits workload-specific flags.

runMaxRPS also registers --inject (messages) and the history-only tunables (--mix, --before-mode, --scrollback-pages, --page-limit, --request-timeout). Consider documenting them so the reference is complete.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/README.md` around lines 244 - 262, Update the Flags section to
include workload-specific flags registered by runMaxRPS: add a "messages-only"
subsection row for --inject (and note default and meaning) and a "history-only"
subsection with rows for --mix, --before-mode, --scrollback-pages, --page-limit,
and --request-timeout; mark which flags apply only to history or messages,
provide their defaults and short notes (e.g., `--inject` = messages-only,
default empty; `--mix`, `--before-mode`, `--scrollback-pages`, `--page-limit`,
`--request-timeout` = history-only), and place them near the existing Flags
table so the README documents all tunables runMaxRPS registers.

tools/loadgen/maxrps_messages.go (1)

237-252: 💤 Low value

Return early on holdErr to skip the unnecessary drain/snapshot on cancellation.

When ctx is cancelled mid-hold, holdErr is non-nil but the code still snapshots end counters/pending (against the dead ctx) and incurs the fixed 2s drain before discarding the result. Checking holdErr right after the hold avoids the wasted work and shutdown delay.

♻️ Proposed reorder

 	holdErr := waitOrCancel(ctx, hold)
+	if holdErr != nil {
+		cancel()
+		wg.Wait()
+		return rpsStepInputs{}, holdErr
+	}
 
 	// Counters are snapshotted at hold-end, before the drain: gatekeeper/bad_reply
 	// errors whose reply lands during the drain are deliberately excluded (see the
 	// straggler-exclusion rationale on buildMessagesInputs). The drain only lets
 	// trailing E1/E2 latency samples settle for the percentile signals.
 	endCounts := w.snapshotCounters()
 	endPending, perr2 := w.snapshotPending(ctx)
 	cancel()
 	wg.Wait()
 	time.Sleep(2 * time.Second) // drain trailing replies/broadcasts
 	w.collector.DiscardBefore(holdStart)
-
-	if holdErr != nil {
-		return rpsStepInputs{}, holdErr
-	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_messages.go` around lines 237 - 252, After calling
waitOrCancel(ctx, hold) capture holdErr and immediately return if holdErr != nil
to avoid running w.snapshotCounters(), w.snapshotPending(ctx), cancel(),
wg.Wait(), time.Sleep(2 * time.Second) and w.collector.DiscardBefore(holdStart)
when the context was cancelled; move the holdErr check to just after holdErr :=
waitOrCancel(ctx, hold) so the expensive snapshot/drain/cleanup only runs on
successful holds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/loadgen/integration_test.go`:
- Around line 165-166: The callback passed to cons.Consume currently discards
errors (e.g., func(msg jetstream.Msg) { _ = msg.Ack() }); change it to surface
failures instead of ignoring them: inside the callback check the returned error
from msg.Ack() (and any marshal/publish calls) and propagate it to the test
(either call t.Fatalf/require.NoError from the callback context or send the
error on a channel and assert it in the main test goroutine). Update both the
Consume callback at cons.Consume(...) and the similar callback at the later
occurrence (the ack/marshal/publish call on lines 181-182) to fail the test when
an error occurs so no ack/marshal/publish errors are silently dropped.

In `@tools/loadgen/README.md`:
- Around line 268-271: The fenced code block showing sample output in README.md
should be annotated with a language to satisfy markdownlint MD040; update the
block delimiter from ``` to ```text so the snippet starting with "ANSWER: max
RPS = 2000 (workload=messages, preset=medium)" is fenced as plain text.

---

Nitpick comments:
In `@tools/loadgen/integration_test.go`:
- Line 142: Replace uses of context.Background() created for ramp runs with a
bounded context via context.WithTimeout to prevent hanging tests: create ctx,
cancel := context.WithTimeout(context.Background(), 2*time.Minute) and defer
cancel() before calling runRamp (and any subscriber/workload invocations), pass
that ctx into runRamp and related functions, and repeat this change for the
other occurrences around lines where ctx is defined (also update imports to
include time if needed).

In `@tools/loadgen/maxrps_messages.go`:
- Around line 237-252: After calling waitOrCancel(ctx, hold) capture holdErr and
immediately return if holdErr != nil to avoid running w.snapshotCounters(),
w.snapshotPending(ctx), cancel(), wg.Wait(), time.Sleep(2 * time.Second) and
w.collector.DiscardBefore(holdStart) when the context was cancelled; move the
holdErr check to just after holdErr := waitOrCancel(ctx, hold) so the expensive
snapshot/drain/cleanup only runs on successful holds.

In `@tools/loadgen/ramp.go`:
- Around line 32-64: In parseRPSSteps, after the loop and before returning, add
an explicit defensive check for len(out) == 0 and return a clear error (e.g.,
fmt.Errorf("no steps provided") or similar) to match the plan; this should
reference the out slice and be placed just before the final return out, nil in
the parseRPSSteps function.

In `@tools/loadgen/README.md`:
- Around line 244-262: Update the Flags section to include workload-specific
flags registered by runMaxRPS: add a "messages-only" subsection row for --inject
(and note default and meaning) and a "history-only" subsection with rows for
--mix, --before-mode, --scrollback-pages, --page-limit, and --request-timeout;
mark which flags apply only to history or messages, provide their defaults and
short notes (e.g., `--inject` = messages-only, default empty; `--mix`,
`--before-mode`, `--scrollback-pages`, `--page-limit`, `--request-timeout` =
history-only), and place them near the existing Flags table so the README
documents all tunables runMaxRPS registers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a7e8914-4979-46de-8426-64e9d2523ed6

📥 Commits

Reviewing files that changed from the base of the PR and between ad10203 and bd2a863.

📒 Files selected for processing (20)

docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md
docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md
tools/loadgen/README.md
tools/loadgen/collector.go
tools/loadgen/collector_test.go
tools/loadgen/deploy/Makefile
tools/loadgen/integration_test.go
tools/loadgen/main.go
tools/loadgen/maxrps.go
tools/loadgen/maxrps_history.go
tools/loadgen/maxrps_history_test.go
tools/loadgen/maxrps_messages.go
tools/loadgen/maxrps_messages_test.go
tools/loadgen/maxrps_report.go
tools/loadgen/maxrps_report_test.go
tools/loadgen/maxrps_test.go
tools/loadgen/ramp.go
tools/loadgen/ramp_test.go
tools/loadgen/verdict.go
tools/loadgen/verdict_test.go

coderabbitai · 2026-05-28T23:11:20Z

+		cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() })
+		require.NoError(t, err)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop silently discarding callback errors in the test path.

Ack/marshal/publish failures are currently ignored, which can make this integration test pass while message flow is broken. Capture and fail on these errors.

Proposed fix

- cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() }) + cc, err := cons.Consume(func(msg jetstream.Msg) { + if err := msg.Ack(); err != nil { + t.Errorf("ack failed: %v", err) + } + }) @@ - data, _ := json.Marshal(evt) - _, _ = js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data) + data, err := json.Marshal(evt) + if err != nil { + t.Errorf("marshal message event failed: %v", err) + return + } + if _, err := js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data); err != nil { + t.Errorf("publish canonical event failed: %v", err) + }

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded".

Also applies to: 181-182

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/integration_test.go` around lines 165 - 166, The callback passed to cons.Consume currently discards errors (e.g., func(msg jetstream.Msg) { _ = msg.Ack() }); change it to surface failures instead of ignoring them: inside the callback check the returned error from msg.Ack() (and any marshal/publish calls) and propagate it to the test (either call t.Fatalf/require.NoError from the callback context or send the error on a channel and assert it in the main test goroutine). Update both the Consume callback at cons.Consume(...) and the similar callback at the later occurrence (the ack/marshal/publish call on lines 181-182) to fail the test when an error occurs so no ack/marshal/publish errors are silently dropped.

coderabbitai · 2026-05-28T23:11:20Z

+```
+ANSWER: max RPS = 2000 (workload=messages, preset=medium)
+        Next limit: E2 p95=143ms > 100ms
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language to the fenced code block.

markdownlint flags this block (MD040). Use a plain text fence for the sample output.

📝 Proposed fix

-``` +```text ANSWER: max RPS = 2000 (workload=messages, preset=medium) Next limit: E2 p95=143ms > 100ms

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 268-268: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/README.md` around lines 268 - 271, The fenced code block showing sample output in README.md should be annotated with a language to satisfy markdownlint MD040; update the block delimiter from ``` to ```text so the snippet starting with "ANSWER: max RPS = 2000 (workload=messages, preset=medium)" is fenced as plain text.

mliu33

Thanks

…set) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation.

* docs: daily-IM load scenario design spec * docs: daily-IM load scenario implementation plan * loadgen: add DailyBands field and daily-light/heavy/power presets * loadgen: banded fixture build for daily-IM presets * loadgen: diurnal envelope for daily-IM scenario * loadgen: user state machine + action picker for daily-IM * loadgen: send/read-receipt/room-list action handlers * loadgen: history/mute/room-create/member-add/thread action handlers * loadgen: direct receiver pool for daily-IM scenario Add directPool which, for each simulated user, opens a dedicated nats.Conn and Subscribes to every room's broadcast subject. Each delivery is timestamped and correlated against the shared Collector via RecordBroadcast so end-to-end (E2) latency can be measured. Wired into the daily-IM runtime by the next task; symbols carry //nolint:unused until that wiring lands so this commit stands alone. https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: multiplex receiver pool with drop counting https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: SLO verdict evaluator for daily-IM steps * loadgen: pending poller + service scraper + self-metrics * loadgen: parseDailyConfig CLI flags + validation * loadgen: per-step lifecycle (warmup/hold/cooldown) https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: per-user emitter goroutines + runDaily control loop * loadgen: console + CSV report for daily-IM scenario * loadgen: production envFactory and runDaily entrypoint Wires the daily-IM runtime end-to-end: prodEnvFactory builds the real NATS direct/multiplex pools, pending poller, service scraper, and JWT minting; runDaily is the production entrypoint dispatched from main.go. Drops //nolint:unused suppressions on symbols now genuinely consumed via the production path (directPool/multiplexPool + constructors, pollPending, serviceScraper + Scrape, runStep/activateUsers/stepEnv). startEmitter and doAction remain suppressed pending the per-user emitter wiring task. https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: dispatch test + usage line for daily subcommand * loadgen: integration test for daily-IM scenario * loadgen: deploy/run-daily target for daily-IM scenario * docs(loadgen): document daily-IM scenario * loadgen: httptest unit coverage for pending poller + service scraper * loadgen: fixes from code review (recall pass) Apply correctness and cleanup fixes surfaced by /simplify high-effort review: Correctness: - runStep: return Inconclusive on ctx cancel and on pollPending error instead of zero-valued StepResult that renderConsole treats as PASS - multiplexPool.Add: subscribe BEFORE mutating dispatch/refcount, so partial-subscribe failures leave the pool consistent - multiplexPool.Close: stop closing per-user channels — race against in-flight route() could panic on send-on-closed; rely on Drain+GC - multiplexPool.route: hold lock through inbox send so Close can't race; drop counter incremented outside lock to avoid contention - newMultiplexPool: return error instead of panic on connect failure - runDailyForTest: defer Close on direct+multiplex pools to drain connections cleanly between/after runs - directPool/multiplexPool: also subscribe to UserRoomEvent(account) so DM broadcasts (the dominant receive traffic for daily presets) are observed - prodEnvFactory: wire publish/request fns through stepEnv; emitters now actually emit actions in production (previously startEmitter was dead code with //nolint:unused, making the verdict trivially pass) - doAction: use env.siteID instead of hardcoded "site-local" - runDailyForTest: propagate baseCfg.SiteID into fixtures and stepEnv - prodEnvFactory: drop the hardcoded :9100/metrics URLs — backend services don't expose HTTP endpoints, so the scrape was a permanent no-op masquerading as a real signal Cleanup: - 5s tick in runStep hold loop → single context-aware timer - maxInt helper → slices.Max - joinReasons body → strings.Join - scrapeErrorCounter manual read loop → io.ReadAll Known limitation documented in preset.go: large-band rooms have UserCount > LargeRoomThreshold in production fixtures, which causes message-gatekeeper to block non-thread sends from member-role users. Daily presets work around this by funneling sends to smaller rooms; large-band rooms still exercise fan-out via receive-side subscriptions. * loadgen: fix integration test call site after newMultiplexPool sig change * loadgen(daily): add CLI usage documentation with examples `loadgen daily -h` now prints a description of the scenario, SLO thresholds, receiver topology, preset summary, flag list with units, and three example invocations. Also exit cleanly on -h/--help instead of logging a 'parse daily config error=flag: help requested' line. * loadgen(daily): log fixture build start + elapsed time 100k-user fixtures take minutes to build (O(N*perUser*R) weighted picker in preset.go); without a log line up front, the silence looked like a hang. Now you see 'building fixtures users=N' immediately and 'fixtures built rooms=X subscriptions=Y elapsed=Zs' when the picker finishes, so it's clear whether the process is in fixture build, in the per-step warmup, or stuck. The picker itself remains the original weighted-by-remaining-capacity algorithm; an earlier attempt to swap it for a shuffled slot-bag preserved the asymptotic complexity win but broke the DM band's perfect-fit tail (any duplicate-skip burns a slot the last user needs) and was reverted. A proper fix (stub-pairing for DM band only) is deferred to a follow-up. * loadgen(daily): tolerate flaky /jsz; drop pending-growth signal vs aborting A misbehaving NATS monitoring endpoint was marking every step Inconclusive — too strict. Now: - pollPending retries up to 3x with linear backoff (200ms, 400ms) so transient failures are absorbed. - If the poll still fails after retries, runStep logs a warning and drops the pending-growth signal for that step (passes nil ConsumerPending to evaluateStep). The other four signals (latency, error rate, service errors, self-metrics) still produce a real verdict. - Only ctx.Canceled / DeadlineExceeded from the poller still maps to Inconclusive, since that indicates the run is shutting down, not a flaky server. * loadgen(daily): NatsMonitoringURL env var for the /jsz poller Adds NATS_MONITORING_URL (default http://nats:8222/jsz, the docker-compose default) to the base config. prodEnvFactory.Build reads it for the daily-IM pending-growth poller. Lets users running against non-default infra (host-port-mapped NATS, monitoring on a different port, etc.) point the poller at the right endpoint without recompiling. * loadgen(preset): stub-pair DM band for O(N*perUser) fixture build The DM band's weighted-by-capacity picker was the dominant cost in BuildFixtures: at N users with bands.DMs=perUser the band has R = N*perUser/2 rooms, making the per-pick O(R) scan quadratic in N. At N=100000 (the default --steps tip) the DM band alone was ~3e12 inner-loop iterations, taking hours. Stub-pairing (configuration model) generates a perUser-regular bipartite graph in O(N*perUser): 1. Each user contributes perUser stubs to a flat list. 2. Shuffle the list once. 3. Pair consecutive stubs into DM rooms. 4. Fix self-pairs (stubs[2k]==stubs[2k+1]) by swapping the second stub with a later position whose neighbours don't create a new self-loop. Self-loops are rare (~perUser expected over the whole list). By construction every user appears in exactly perUser DMs and every DM has 2 distinct users. The Small/Medium/Large bands keep the existing weighted picker (they're tractable at typical N -- O(R) per pick is fine when R is in the thousands). Locked in by TestBuildFixtures_DailyHeavy_FastAtScale: at N=2000 the whole BuildFixtures call now completes in ~2s on a developer box; the test fails the build if it ever regresses past 30s. Existing TestBuildFixtures_DailyBands (which already asserts the per-user membership count invariant at N=200) still passes. * loadgen(daily): correctness + operational fixes for core-tool use Three perf+correctness+ops reviews surfaced 14 findings. Applies the verdict-affecting and operationally-critical ones; flags the performance/memory ones as follow-ups in commit messages and code. Correctness: - **Deterministic per-user RNG seed.** startEmitter was seeded from time.Now().UnixNano() XOR len(u.ID) — len is constant for all users ("u-NNNNNN"), so same-nanosecond activations got identical seeds and emitted perfectly correlated action streams. Replaced with a splitmix mix of env.runSeed and the user index, so two runs at the same run-seed produce identical action streams (the whole point of a reproducible verdict). - **Per-step envelope re-anchor.** Emitters used to capture holdStart at activation time, so users from step N kept emitting against step N's curve all the way through step N+1 — by step 2 the early half of the user population was firing at envelope baseline (0.4×rate) instead of following the new step's hold curve, making later steps' "PASS" verdicts wrong. Now stepEnv carries holdStart and holdDuration as atomic.Int64 fields; emitters re-read them on every tick; runStep calls env.setHold(now, hold) right before Reset() at the start of each hold. - **Post-activation hold anchor.** holdStart was computed *before* activateUsers, which is rate-limited at 500 users/sec. At a +50k delta this meant 100s of activation, mostly during what was supposed to be warmup, with fresh users emitting at the envelope's clamped edge instead of ramping. The re-anchor above also fixes this — hold begins after warmup actually elapsed. - **percentile uses ceil-based nearest-rank.** Old floor-based indexing (int(p*(len-1))) returned p98 for p99 on ≤100-sample windows, so tight smoke tests false-PASSed. Now uses ceil(p*len)-1, matching the standard. - **diffPending walks both sides.** A consumer that disappears mid-hold (crashed/deleted) used to be silently dropped from the verdict. Now surfaces as End=0/negative-Delta and trips with reason "<durable> disappeared mid-hold". - **HTTP timeout on /jsz.** pollPendingOnce used http.DefaultClient with no timeout; a hung monitoring endpoint would wedge the whole run until the operator killed the process. Dedicated 5s-timeout client. - **ctx-aware ramp loop.** runDailyForTest now checks ctx.Err() between steps, so SIGINT mid-cooldown doesn't produce a junk trail of INCONCLUSIVE rows for steps that never started. Operational guards: - **Zero-traffic → INCONCLUSIVE.** Publisher conn failure or unwired emitters used to produce a silent PASS at every step. Now evaluateStep returns INCONCLUSIVE with "zero actions attempted" when AttemptedOps==0. - **Effective-N tracking + <95% threshold.** activateUsers increments env.activatedCount/skippedCount atomically; if EffectiveN/N < 95% the step returns INCONCLUSIVE with the shortfall in the reason. StepResult.EffectiveN is surfaced in console (e.g. "20000(10000)") and added as a CSV column. Stops "N=20000 PASS" silently meaning "10000 active". - **Goroutine-as-CPU proxy disabled.** readCPUPercent's heuristic (NumGoroutine/5000 × 100) tripped INCONCLUSIVE at any scale ≥4k users — exactly the regime this tool is built for. Now returns 0; INCONCLUSIVE relies on the GC pause signal alone until a real CPU sample (gopsutil/proc-self-stat) is wired. Follow-ups (not in this commit): - **Perf.** Collector single-mutex contention (~520k locks/sec at N=100k), unbounded byReqID/byMsgID growth (spec promised TTL+1M cap), per-user inbox channels that nobody reads, small/medium band O(N²) picker, one-goroutine-per-user (100k goroutines), per-tick actionCtx allocation. Each is a real win but a non-trivial refactor; queued for the next pass. - **Ergonomics.** Configurable SLO thresholds + seed flags, heartbeat between warmup and verdict, ABORTED marker on ctx-cancel, full per-durable CSV columns, --validate dry-run. * loadgen(perf): slot-bag picker for non-DM bands + shard Collector Two perf fixes flagged by the post-review pass. preset.go: small/medium/large band fixture picker switched from the O(N x perUser x R) weighted-scan to a configuration-model slot-bag walk with retry on per-user duplicates. Same shape as the DM band's stub-pairing but generalised for heterogeneous room capacities. N=10000 daily-heavy fixture build drops from minutes to ~1s; N=100k drops from ~30+ min to seconds. TestBuildFixtures_DailyHeavy_FastAtScale bumped to N=10000 to lock the speedup in. collector.go: byReqID/byMsgID/e1/e2 sharded by FNV-1a hash of the correlation key across 64 shards (reqShard for requestID-keyed, msgShard for messageID-keyed; each has its own mutex). At N=100k a busy daily-IM run produces ~150k+ Record* calls/sec, all of which previously serialised on a single sync.Mutex — loadgen became the bottleneck and inflated latency samples before the backend tripped. Per-shard contention drops by ~64x. Determinism preserved: emit-as-you-pick instead of range-over-map (Go map iteration order is randomised), and the slot-bag uses the same seeded r.Intn sequence both runs. TestBuildFixtures_DailyBands still passes its same-seed -> equal Fixtures invariant. * loadgen: reconcile with main after rebase (drop dup waitOrCancel + Reset) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation. * loadgen(daily): skip integration test that needs full backend stack TestRunDaily_Integration_TinyPresetPasses passed vacuously before the recall-review fix wired emitters into prodEnvFactory (no actions = no errors = trivial PASS). With emitters live, the test now actually publishes to subjects like `chat.user.{acct}.room.{room}.{site}.msg.send` and request-replies on history/room-list/etc. — all of which time out in the CI integration job because the testutil NATS container has no gatekeeper, no room-service, and no broadcast-worker subscribed. Result: error_rate=7.69% > 0.1% threshold, step TRIPs, test FAILs. The wiring is correct (proven by unit + race tests); what's wrong is that this scenario fundamentally needs the full docker-compose stack to be meaningful. Skipping with a comment pointing operators at `make -C tools/loadgen/deploy run-daily` for real end-to-end coverage. * loadgen(daily): flush after Subscribe so server registers interest before Add returns CI hit: TestMultiplexPool_RoutesBroadcastToInbox 'Condition never satisfied'. Root cause: nc.Subscribe queues the SUB command client- side; it's not sent to the broker until the next Flush. A caller (or test) that publishes immediately after Add() can see the broadcast dropped because the server hasn't registered the subscription interest yet. Production emitters tick on a 1s schedule so they never publish during the registration race window, but the test does — explaining why the test was reliably failing only under CI's timing. Fix in both pools (direct + multiplex): call nc.Flush() after the Subscribe pass so Add only returns once the server knows about every new interest. One round-trip per pool conn at activation time, dominated by the Subscribe overhead already incurred. * loadgen(daily): pass NATS_CREDS_FILE to pool + publisher conns Bug: daily.go's prodEnvFactory and daily_pool.go's pool constructors called bare nats.Connect(url, nats.Name(...)) with no credentials. On operator-mode NATS servers (the chat-system standard, including the local docker-local stack) anonymous connections either fail outright or get default-deny perms — manifesting as 'permissions violation' on subscribe and the silent-zero-latency symptom reported here. Every other loadgen subcommand correctly used natsutil.Connect(cfg.NatsURL, cfg.NatsCredsFile); only the daily scenario was missing the second arg. Fix: - New connectWithCreds(url, name, credsFile) helper applies nats.UserCredentials when credsFile is non-empty, else falls back to anonymous (testcontainer NATS in unit/integration tests). - newDirectPool, newMultiplexPool, and the prodEnvFactory publisher conn all route through it. - baseCfg.NatsCredsFile (existing env var NATS_CREDS_FILE) flows in through prodEnvFactory; the docker-compose overlay already wires NATS_CREDS_FILE=/etc/nats/backend.creds. After this fix: 'make -C tools/loadgen/deploy run-daily PRESET=...' authenticates as the backend user, subscribe perms work, and latency samples populate — P50/P95/P99 should be non-zero. * loadgen(daily): use Request for readReceipt; room-service expects a reply room-service registers MessageReadWildcard via nc.QueueSubscribe and unconditionally calls m.Msg.Respond(resp) (room-service/handler.go:83). Respond requires msg.Reply to be set, which nc.Publish does not set — only nc.Request does. The daily readReceipt action was using a.Publish, so every read-receipt action triggered 'nats: message does not have a reply' in room-service's log. Switched to a.Request with the same defaultRequestTimeout the other request-shaped actions use, and renamed/updated the unit test to lock in the contract (must land in c.reqs, not c.pubs). Verified the other six action handlers (scrollHistory, refreshRoomList, muteToggle, roomCreate, memberAdd, threadReply) — they all use Request against the right routes, so readReceipt was the sole mismatch. * loadgen(daily): pre-flight check for un-seeded mongo + readReceipt uses Request Two unrelated bugs that bit during real runs: 1. readReceipt was using a.Publish for subject.MessageRead, but room-service registers that subject via QueueSubscribe and calls msg.Respond on every received message — which fails with 'nats: message does not have a reply' when the sender used fire-and-forget Publish. Switched to a.Request (matches the pattern of every other request-side action handler). Updated the test. 2. runDaily started straight into BuildFixtures + a full ramp even when the Mongo subscriptions collection was empty for the configured siteID. The gatekeeper then rejected every send with 'user X is not subscribed to room Y' and the operator burned ~30+ minutes before seeing the error_rate trip. Added a preflight that opens a short- lived Mongo connection, counts subscriptions for cfg.SiteID, and bails with an actionable message: ERROR daily pre-flight error="no subscriptions found in mongo for siteID="site-local"; run \`loadgen seed --workload=messages --preset=<your daily preset>\` first (or \`make -C tools/loadgen/deploy seed PRESET=<your preset>\`)" Exit code 2 (config error). Uses a 10s context independent of the run-level ctx so a transient Mongo blip doesn't burn the window before failing. * loadgen(daily): stop overriding preset.Users — caused fixture ID mismatch with seed Symptom: gatekeeper rejects every send with "user X is not subscribed to room Y" even though `loadgen seed` was just run successfully. Root cause: BuildFixtures is deterministic in (preset, seed, siteID). `loadgen seed --workload=messages --preset=daily-heavy` uses the preset's default Users (10000). But runDailyForTest was overriding preset.Users = slices.Max(cfg.Steps) before calling BuildFixtures. If max(steps) != preset.Users, the per-band stub shuffle produces DIFFERENT user/room/sub IDs at run time vs seed time — same prefixes ("u-000123", "room-small-000045"), different totals → different permutation → no overlap. Gatekeeper's GetSubscription misses every lookup and the whole ramp fails on error_rate. Fix: drop the override. runDailyForTest now uses preset.Users as-is, matching `loadgen seed`. activateUsers already caps at len(env.users), and the EffectiveN-shortfall guard surfaces "only X/Y users activated" as INCONCLUSIVE when --steps exceeds preset.Users — much clearer than silent ID drift. Side effect: --steps cannot exceed preset.Users for that preset. For daily-light/heavy/power that's 10000. A startup log line warns when max(steps) > preset.Users so the operator sees the cap before the run starts. No re-seed required after deploying the fix. * docs(loadgen): comprehensive Daily-IM operator guide Replace the 53-line stub section with a full operator guide that captures every lesson from real runs: - Quick start + prerequisites (seed step, NATS perms, env vars) - Preset comparison with rooms-per-user breakdown - Complete CLI flag reference (units, defaults, when to change) - Environment variables (NATS_CREDS_FILE, NATS_MONITORING_URL, SITE_ID, plus the mongo/valkey vars) - SLO signals with thresholds; PASS/TRIP/INCONCLUSIVE semantics - Sample console + CSV output, how to read 'ANSWER' and N(EffectiveN) annotations - Troubleshooting matrix covering every failure mode encountered: not-subscribed, siteID mismatch, large-room block, 'no reply' from room-service, NATS permissions, all-zero latency, EffectiveN cap, fixture build silence, flaky /jsz - Known limitations explicitly listed (large-room block, no-op auth stub, dormant service-error scraper, disabled CPU proxy) - Pointers to spec, plan, and source files Aims to make the tool self-serviceable: a new operator who has never used it should be able to get to a verdict by following the doc, and an experienced operator hitting an error should find the fix in the troubleshooting matrix without escalating. * loadgen(daily): per-action latency breakdown in console + CSV Adds observational per-handler latency stats so operators can see what each daily action (send / read_receipt / scroll_history / refresh_room_list / member_add / room_create / mute_toggle) actually costs, in addition to the system-wide publish→broadcast p50/p95/p99 the verdict already uses. - Collector gains per-action latency sample tracking (map[int][]Duration under its own mutex; cleared by Reset alongside the existing counters). - doAction wraps each handler call with time.Now/Since and records the elapsed via RecordActionLatency. - actionKind grows a String() helper for stable lowercase names and an allActionKinds canonical-order slice for the report. - stepInputs/StepResult carry ActionLatencies; evaluateStep computes Count + P50/P95/P99 per action via the existing percentile helper. - Console prints 'actions: send n=8920 p50=12 p95=180 p99=320 | ...' on the line after each step's verdict. - CSV adds 28 columns (<action>_count, _p50_ms, _p95_ms, _p99_ms × 7 actions) in stable allActionKinds order so downstream tools can column-index reliably. Verdict criteria unchanged: these stats are observational only — PASS / TRIP / INCONCLUSIVE still depends on the existing latency, error-rate, pending-growth, and self-metrics signals. Per the operator's request (too easy to confuse a single 'all actions sluggish' p95 with a real broadcast-pipeline regression if we trip on per-action numbers). * loadgen(daily,seed): add --users flag to override preset.Users The daily presets hard-code Users=10000. Operators wanting to ramp above that previously had to edit preset.go and rebuild. Now both `loadgen seed --workload=messages` and `loadgen daily` accept `--users=N` to override at invocation time. The two MUST be passed the same value — BuildFixtures is deterministic in (preset, seed, siteID), and the per-band stub shuffle uses totalUsers as length, so a mismatch produces entirely different room/sub IDs even though the user IDs (u-000000, u-000001, ...) overlap. The gatekeeper then rejects every send with "user not subscribed". To catch that misuse fast, the daily preflight now also counts users in Mongo and compares to preset.Users (or the --users override). On mismatch it errors with the exact teardown+re-seed commands needed. Usage at, say, N=50000: loadgen seed --workload=messages --preset=daily-heavy --users=50000 loadgen daily --preset=daily-heavy --users=50000 --steps=5000,10000,20000,50000 `--users=0` (default) means "use preset default", the safe path for normal runs and existing scripts. * loadgen(daily): set X-Request-ID header on every NATS publish/request room-service extracts X-Request-ID from inbound NATS headers via natsutil.ContextWithRequestIDFromHeaders (room-service/handler.go:60). The ID flows through ctx into the canonical event room-service publishes to room-worker's ROOMS stream. room-worker REQUIRES X-Request-ID on canonical events for idempotency (it's the dedup seed) and rejects with 'missing X-Request-ID' otherwise (room-worker/handler.go:752,1219). Before this fix, prodEnvFactory's publish/request closures used pubConn.Publish(subj,data) and pubConn.RequestWithContext(ctx,subj,data) — neither of which sets headers. Empty inbound header at room-service → empty ctx → empty canonical header → room-worker logs the error and drops the event. Visible in room-worker logs as 'missing-X-request-id' on subject 'chat.room.canonical.<site>.member.add'. Now both closures build *nats.Msg with a fresh natsutil.RequestIDHeader (X-Request-ID) per call and use PublishMsg / RequestMsgWithContext. Every emitter action now produces a properly traceable request through the canonical pipeline. The header ID is independent of any payload-level request ID (e.g. SendMessageRequest.RequestID for the gatekeeper) — both serve different layers. In production a real client typically uses the same value at both layers; we could plumb that through actionCtx as a follow-up if request-tracing across layers needs it. * loadgen(daily): memberAdd picks from channel rooms only room-service rejects member-add on DM rooms with 'cannot add members to a non-channel room' (DMs have a fixed 2-member shape). For daily-heavy ~25 DMs out of 56 rooms per user, picking uniformly from u.Rooms means ~45% of memberAdd actions error out — inflating error_rate noise and spamming the room-service log. Pre-filter at activation time: userState.ChannelRooms is the subset of Rooms that aren't DMs, computed once in newUserState by ID-prefix ('room-dm-' = DM band, everything else = channel band). memberAdd now picks from u.ChannelRooms instead of u.Rooms. Users with no channel rooms (DM-only) no-op the action rather than erroring. The other actions (sendMessage, readReceipt, scrollHistory, refreshRoomList, muteToggle, roomCreate, threadReply) work fine on DMs and continue using u.Rooms. Only memberAdd has the channel-only constraint at the room-service layer. Tests: - TestNewUserState_ChannelRoomsExcludesDMs locks the prefix-filter - TestMemberAdd_Requests now seeds ChannelRooms explicitly - TestMemberAdd_SkipsWhenNoChannelRooms covers the DM-only no-op path * loadgen(daily): roomCreate includes neighbor user; memberAdd uses real account Two related fixes for actions that hit room-service: roomCreate sent only {name, type}. classifyAndValidate at room-service/handler.go:193 lets that through (name is non-empty), but the channel branch at line 290 then rejects it: allUsers := stripAccount(...) if len(allUsers) == 0 && len(allOrgs) == 0 { return nil, errEmptyCreateRequest } A channel must have at least one member besides the creator. Now we include u.Neighbor in the 'users' array. errEmptyCreateRequest gone. memberAdd was passing the literal 'user-stub' as the target account. room-service validates accounts exist (errUserNotFound) — 'user-stub' isn't in Mongo, so every memberAdd that reached a channel was failing silently after the channel-only filter. Now uses u.Neighbor, which is guaranteed-seeded. userState.Neighbor is computed once in newUserState by shifting the requester's index by 1 ('user-N' → 'user-(N-1)' for N≥1, 'user-0' → 'user-1'). Always != Account; always within the seeded range for any preset with Users ≥ 2 (i.e. all daily-* presets). Tests: - TestNeighborOf covers shift / wrap / fallback - TestRoomCreate_Requests asserts the payload now has a users array - TestMemberAdd_Requests / SkipsWhenNoChannelRooms unchanged shape * loadgen(daily): apply per-request timeout in the request closure The 'timeout' argument to actionCtx.Request was silently ignored: RequestMsgWithContext uses the context's deadline, and the emitter's ctx is the run-level ctx with no deadline. So defaultRequestTimeout (5s) never bound anything; a slow room-service handler could hang arbitrarily long, showing up as huge per-action p50 (e.g. 25s for read_receipt) instead of cleanly timing out and contributing to error_rate. Wrap the incoming ctx with context.WithTimeout(timeout) so each call gets a real deadline. Action handlers continue to pass defaultRequestTimeout=5s; future per-action overrides can vary the value without code changes. This makes the per-action latency numbers a true latency signal: a healthy handler shows actual round-trip ms; a sick one shows a wall at the timeout value plus a matching jump in failed_ops / error_rate. * loadgen(daily): per-action latency gates the verdict Previously the verdict only gated on the publish→broadcast latency (the messaging pipeline). Operations like read_receipt, scroll_history, member_add, refresh_room_list — all of which exercise different parts of the system (room-service, history-service, room-worker) — only contributed to error_rate when they timed out, never to latency. A slow handler that stayed under the request timeout was invisible. Now Thresholds carries per-action p95/p99 maps; evaluateStep iterates ActionLatencies and trips if any gated action exceeds its cap. Default thresholds reflect typical chat-backend handler latencies: read_receipt: p95=100ms p99=250ms (mongo update + reply) refresh_room_list: p95=200ms p99=500ms (mongo query per user) scroll_history: p95=500ms p99=1500ms (cassandra bucket walk) member_add: p95=200ms p99=500ms (mongo + canonical pub) mute_toggle: p95=100ms p99=250ms (mongo update) room_create: p95=500ms p99=1500ms (mongo + canonical pub) sendMessage/threadReply are not gated here — they're already gated by the broadcast-pipeline P95LatencyMs / P99LatencyMs at the top of the verdict (the latency the existing p50/p95/p99 columns reflect). Operator can tune via --action-p95-ms / --action-p99-ms flags: loadgen daily --preset=daily-heavy \ --action-p95-ms=read_receipt:80,scroll_history:300 \ --action-p99-ms=read_receipt:200,scroll_history:800 Format: comma-separated name:N pairs. Unset actions keep defaults; unknown names are rejected at parse time. Set N to a very large number to effectively disable an action's gate. Trip reasons name the action explicitly: scroll_history p95=612ms > 500 read_receipt p99=340ms > 250 * loadgen(daily): rename read_receipt action to mark_read The action marks the user's subscription as read (msg.read handler). "read_receipt" invited confusion with the separate MessageReadReceipt notification path; "mark_read" names what the action actually does. Breaking for existing CSV consumers and CLI users: the per-action column names change from read_receipt_* to mark_read_*, and --action-p95-ms / --action-p99-ms must use "mark_read" as the key. * loadgen(history): stream per-room seed plan to avoid OOM on history-large history-large materialized a single 65M-row plannedMessage slice (~50 GB of strings), so seeding got SIGKILLed before reaching Cassandra. BuildHistoryFixtures now keeps only the Fixtures + ThreadParents map and hands the seeder an IterateRoomMessages iterator that builds one room's plan at a time (~50 MB peak). FullPlan stays available for tests and small/medium presets where materialization is bounded. Per-room randomness is split into two streams — structural (sender, jitter, parent-selection, reply offsets) and content (message body bytes) — so a cheap metadata-only walk can derive Room.LastMsgAt and ThreadParents without paying the O(MessagesPerRoom × ContentBytes) cost of regenerating content. CreatedAt and content bytes for a given (seed, preset) are different from before — no test asserts on either, but anyone diffing a re-seed against an old Cassandra dump will see the change. --------- Co-authored-by: Claude <noreply@anthropic.com>

claude added 16 commits May 28, 2026 15:50

feat(loadgen): add rps step verdict types and evaluateRPSStep

9d5de5a

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

fix(loadgen): address Task 1 review — pointer verdict param, String()…

59680c7

… and precedence tests https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

feat(loadgen): add rps ramp engine (parseRPSSteps, runRamp)

a7f35a2

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

fix(loadgen): address Task 2 review — cover runRamp branches, overflo…

fb5c768

…w guard, cleanups https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

feat(loadgen): add max-rps report renderer and CSV writer

1b1408c

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

test(loadgen): cover no-Next-limit path and multi-series CSV alignment

717e3fb

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

feat(loadgen): add Collector.Reset for per-step ramp windows

fa3d70e

feat(loadgen): add messages workload adapter for max-rps

a9da8d2

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

feat(loadgen): add history workload adapter for max-rps

6ff0fc3

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

feat(loadgen): wire max-rps subcommand into dispatch

38bd876

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

test(loadgen): integration coverage for max-rps messages ramp

867ff1e

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

docs(loadgen): document max-rps subcommand and add run-max-rps target

830cacf

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

hmchangw added the ready label Jun 1, 2026

mliu33 approved these changes Jun 1, 2026

View reviewed changes

mliu33 merged commit 6268a14 into main Jun 1, 2026
6 checks passed

coderabbitai Bot mentioned this pull request Jun 1, 2026

feat(loadgen): daily-IM load scenario to find sustainable N #234

Merged

3 tasks

This was referenced Jun 2, 2026

feat(loadgen): bottleneck attribution for max-rps(messages) #262

Merged

feat(loadgen): add read-receipt workload to max-rps #264

Open

coderabbitai Bot mentioned this pull request Jun 4, 2026

loadgen: batched pacer to remove the RPS ramp ceiling + actionable shortfall diagnosis #270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loadgen): max-rps SLO finder for message-send & history#240

feat(loadgen): max-rps SLO finder for message-send & history#240
mliu33 merged 16 commits into
mainfrom
claude/max-rps-slo-loadgen-ajKRi

hmchangw commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

mliu33 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() })
		require.NoError(t, err)

Conversation

hmchangw commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test plan

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

mliu33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hmchangw commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading