feat(loadgen): max-rps SLO finder for message-send & history#240
Conversation
Mirrors PR #234's step-up-and-hold-under-SLO control loop, ramping target RPS instead of user count N, to auto-find the max sustainable RPS for the messages and history loadgen workloads. Standalone on main with rps-prefixed identifiers so it neither depends on nor collides with #234. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
Task-by-task TDD plan: generic ramp engine + verdict + report, reusing the existing message-send and history open-loop generators, exposed as a max-rps subcommand. rps-prefixed identifiers avoid collision with PR #234. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
…tor error, log gather error - Add shutdownSrv helper and call it on all three subscription-error return paths so the metrics HTTP server is not leaked when construction fails. - Log the Gather() error in snapshotCounters instead of silently discarding it. - Strengthen TestDiffCounters to assert zero-delta for marshal/gatekeeper/bad_reply. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
Correct the README "Reading the output" section (INCONCLUSIVE does not stop the ramp and is not a pass; fix the no-pass ANSWER string), validate --page-limit/--request-timeout > 0, and document the deliberate straggler-exclusion in the messages hold-window counter snapshot. https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
📝 WalkthroughWalkthroughThis pull request introduces Changesloadgen max-rps feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tools/loadgen/integration_test.go (1)
142-142: ⚡ Quick winUse a bounded context for
runRampto avoid hanging integration runs.Running the ramp on
context.Background()can hang CI if a subscriber/workload path stalls. Wrap ramp execution incontext.WithTimeout.Proposed fix
- ctx := context.Background() + ctx := context.Background() @@ - results := runRamp(ctx, w, &rampConfig{ + rampCtx, cancel := context.WithTimeout(ctx, 20*time.Second) + defer cancel() + results := runRamp(rampCtx, w, &rampConfig{Also applies to: 194-202
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/integration_test.go` at line 142, Replace uses of context.Background() created for ramp runs with a bounded context via context.WithTimeout to prevent hanging tests: create ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute) and defer cancel() before calling runRamp (and any subscriber/workload invocations), pass that ctx into runRamp and related functions, and repeat this change for the other occurrences around lines where ctx is defined (also update imports to include time if needed).tools/loadgen/ramp.go (1)
32-64: 💤 Low valueOptional: Add explicit empty-steps check for consistency with plan.
The plan (line 597-599) includes a final check for
len(out) == 0, but the implementation returns directly without it. While unreachable due to the "empty step" error at line 38-40, adding the explicit check would match the plan and serve as defensive documentation.📋 Optional consistency fix
out = append(out, n) } + if len(out) == 0 { + return nil, fmt.Errorf("no steps parsed from %q", s) + } return out, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/ramp.go` around lines 32 - 64, In parseRPSSteps, after the loop and before returning, add an explicit defensive check for len(out) == 0 and return a clear error (e.g., fmt.Errorf("no steps provided") or similar) to match the plan; this should reference the out slice and be placed just before the final return out, nil in the parseRPSSteps function.tools/loadgen/README.md (1)
244-262: 💤 Low valueFlags table omits workload-specific flags.
runMaxRPSalso registers--inject(messages) and the history-only tunables (--mix,--before-mode,--scrollback-pages,--page-limit,--request-timeout). Consider documenting them so the reference is complete.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/README.md` around lines 244 - 262, Update the Flags section to include workload-specific flags registered by runMaxRPS: add a "messages-only" subsection row for --inject (and note default and meaning) and a "history-only" subsection with rows for --mix, --before-mode, --scrollback-pages, --page-limit, and --request-timeout; mark which flags apply only to history or messages, provide their defaults and short notes (e.g., `--inject` = messages-only, default empty; `--mix`, `--before-mode`, `--scrollback-pages`, `--page-limit`, `--request-timeout` = history-only), and place them near the existing Flags table so the README documents all tunables runMaxRPS registers.tools/loadgen/maxrps_messages.go (1)
237-252: 💤 Low valueReturn early on
holdErrto skip the unnecessary drain/snapshot on cancellation.When
ctxis cancelled mid-hold,holdErris non-nil but the code still snapshots end counters/pending (against the deadctx) and incurs the fixed 2s drain before discarding the result. CheckingholdErrright after the hold avoids the wasted work and shutdown delay.♻️ Proposed reorder
holdErr := waitOrCancel(ctx, hold) + if holdErr != nil { + cancel() + wg.Wait() + return rpsStepInputs{}, holdErr + } // Counters are snapshotted at hold-end, before the drain: gatekeeper/bad_reply // errors whose reply lands during the drain are deliberately excluded (see the // straggler-exclusion rationale on buildMessagesInputs). The drain only lets // trailing E1/E2 latency samples settle for the percentile signals. endCounts := w.snapshotCounters() endPending, perr2 := w.snapshotPending(ctx) cancel() wg.Wait() time.Sleep(2 * time.Second) // drain trailing replies/broadcasts w.collector.DiscardBefore(holdStart) - - if holdErr != nil { - return rpsStepInputs{}, holdErr - }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/maxrps_messages.go` around lines 237 - 252, After calling waitOrCancel(ctx, hold) capture holdErr and immediately return if holdErr != nil to avoid running w.snapshotCounters(), w.snapshotPending(ctx), cancel(), wg.Wait(), time.Sleep(2 * time.Second) and w.collector.DiscardBefore(holdStart) when the context was cancelled; move the holdErr check to just after holdErr := waitOrCancel(ctx, hold) so the expensive snapshot/drain/cleanup only runs on successful holds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/loadgen/integration_test.go`:
- Around line 165-166: The callback passed to cons.Consume currently discards
errors (e.g., func(msg jetstream.Msg) { _ = msg.Ack() }); change it to surface
failures instead of ignoring them: inside the callback check the returned error
from msg.Ack() (and any marshal/publish calls) and propagate it to the test
(either call t.Fatalf/require.NoError from the callback context or send the
error on a channel and assert it in the main test goroutine). Update both the
Consume callback at cons.Consume(...) and the similar callback at the later
occurrence (the ack/marshal/publish call on lines 181-182) to fail the test when
an error occurs so no ack/marshal/publish errors are silently dropped.
In `@tools/loadgen/README.md`:
- Around line 268-271: The fenced code block showing sample output in README.md
should be annotated with a language to satisfy markdownlint MD040; update the
block delimiter from ``` to ```text so the snippet starting with "ANSWER: max
RPS = 2000 (workload=messages, preset=medium)" is fenced as plain text.
---
Nitpick comments:
In `@tools/loadgen/integration_test.go`:
- Line 142: Replace uses of context.Background() created for ramp runs with a
bounded context via context.WithTimeout to prevent hanging tests: create ctx,
cancel := context.WithTimeout(context.Background(), 2*time.Minute) and defer
cancel() before calling runRamp (and any subscriber/workload invocations), pass
that ctx into runRamp and related functions, and repeat this change for the
other occurrences around lines where ctx is defined (also update imports to
include time if needed).
In `@tools/loadgen/maxrps_messages.go`:
- Around line 237-252: After calling waitOrCancel(ctx, hold) capture holdErr and
immediately return if holdErr != nil to avoid running w.snapshotCounters(),
w.snapshotPending(ctx), cancel(), wg.Wait(), time.Sleep(2 * time.Second) and
w.collector.DiscardBefore(holdStart) when the context was cancelled; move the
holdErr check to just after holdErr := waitOrCancel(ctx, hold) so the expensive
snapshot/drain/cleanup only runs on successful holds.
In `@tools/loadgen/ramp.go`:
- Around line 32-64: In parseRPSSteps, after the loop and before returning, add
an explicit defensive check for len(out) == 0 and return a clear error (e.g.,
fmt.Errorf("no steps provided") or similar) to match the plan; this should
reference the out slice and be placed just before the final return out, nil in
the parseRPSSteps function.
In `@tools/loadgen/README.md`:
- Around line 244-262: Update the Flags section to include workload-specific
flags registered by runMaxRPS: add a "messages-only" subsection row for --inject
(and note default and meaning) and a "history-only" subsection with rows for
--mix, --before-mode, --scrollback-pages, --page-limit, and --request-timeout;
mark which flags apply only to history or messages, provide their defaults and
short notes (e.g., `--inject` = messages-only, default empty; `--mix`,
`--before-mode`, `--scrollback-pages`, `--page-limit`, `--request-timeout` =
history-only), and place them near the existing Flags table so the README
documents all tunables runMaxRPS registers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9a7e8914-4979-46de-8426-64e9d2523ed6
📒 Files selected for processing (20)
docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.mddocs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.mdtools/loadgen/README.mdtools/loadgen/collector.gotools/loadgen/collector_test.gotools/loadgen/deploy/Makefiletools/loadgen/integration_test.gotools/loadgen/main.gotools/loadgen/maxrps.gotools/loadgen/maxrps_history.gotools/loadgen/maxrps_history_test.gotools/loadgen/maxrps_messages.gotools/loadgen/maxrps_messages_test.gotools/loadgen/maxrps_report.gotools/loadgen/maxrps_report_test.gotools/loadgen/maxrps_test.gotools/loadgen/ramp.gotools/loadgen/ramp_test.gotools/loadgen/verdict.gotools/loadgen/verdict_test.go
| cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() }) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
Stop silently discarding callback errors in the test path.
Ack/marshal/publish failures are currently ignored, which can make this integration test pass while message flow is broken. Capture and fail on these errors.
Proposed fix
- cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() })
+ cc, err := cons.Consume(func(msg jetstream.Msg) {
+ if err := msg.Ack(); err != nil {
+ t.Errorf("ack failed: %v", err)
+ }
+ })
@@
- data, _ := json.Marshal(evt)
- _, _ = js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data)
+ data, err := json.Marshal(evt)
+ if err != nil {
+ t.Errorf("marshal message event failed: %v", err)
+ return
+ }
+ if _, err := js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data); err != nil {
+ t.Errorf("publish canonical event failed: %v", err)
+ }As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded".
Also applies to: 181-182
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/integration_test.go` around lines 165 - 166, The callback
passed to cons.Consume currently discards errors (e.g., func(msg jetstream.Msg)
{ _ = msg.Ack() }); change it to surface failures instead of ignoring them:
inside the callback check the returned error from msg.Ack() (and any
marshal/publish calls) and propagate it to the test (either call
t.Fatalf/require.NoError from the callback context or send the error on a
channel and assert it in the main test goroutine). Update both the Consume
callback at cons.Consume(...) and the similar callback at the later occurrence
(the ack/marshal/publish call on lines 181-182) to fail the test when an error
occurs so no ack/marshal/publish errors are silently dropped.
| ``` | ||
| ANSWER: max RPS = 2000 (workload=messages, preset=medium) | ||
| Next limit: E2 p95=143ms > 100ms | ||
| ``` |
There was a problem hiding this comment.
Add a language to the fenced code block.
markdownlint flags this block (MD040). Use a plain text fence for the sample output.
📝 Proposed fix
-```
+```text
ANSWER: max RPS = 2000 (workload=messages, preset=medium)
Next limit: E2 p95=143ms > 100ms</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 268-268: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/README.md` around lines 268 - 271, The fenced code block
showing sample output in README.md should be annotated with a language to
satisfy markdownlint MD040; update the block delimiter from ``` to ```text so
the snippet starting with "ANSWER: max RPS = 2000 (workload=messages,
preset=medium)" is fenced as plain text.
…set) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation.
…set) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation.
…set) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation.
* docs: daily-IM load scenario design spec * docs: daily-IM load scenario implementation plan * loadgen: add DailyBands field and daily-light/heavy/power presets * loadgen: banded fixture build for daily-IM presets * loadgen: diurnal envelope for daily-IM scenario * loadgen: user state machine + action picker for daily-IM * loadgen: send/read-receipt/room-list action handlers * loadgen: history/mute/room-create/member-add/thread action handlers * loadgen: direct receiver pool for daily-IM scenario Add directPool which, for each simulated user, opens a dedicated nats.Conn and Subscribes to every room's broadcast subject. Each delivery is timestamped and correlated against the shared Collector via RecordBroadcast so end-to-end (E2) latency can be measured. Wired into the daily-IM runtime by the next task; symbols carry //nolint:unused until that wiring lands so this commit stands alone. https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: multiplex receiver pool with drop counting https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: SLO verdict evaluator for daily-IM steps * loadgen: pending poller + service scraper + self-metrics * loadgen: parseDailyConfig CLI flags + validation * loadgen: per-step lifecycle (warmup/hold/cooldown) https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: per-user emitter goroutines + runDaily control loop * loadgen: console + CSV report for daily-IM scenario * loadgen: production envFactory and runDaily entrypoint Wires the daily-IM runtime end-to-end: prodEnvFactory builds the real NATS direct/multiplex pools, pending poller, service scraper, and JWT minting; runDaily is the production entrypoint dispatched from main.go. Drops //nolint:unused suppressions on symbols now genuinely consumed via the production path (directPool/multiplexPool + constructors, pollPending, serviceScraper + Scrape, runStep/activateUsers/stepEnv). startEmitter and doAction remain suppressed pending the per-user emitter wiring task. https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt * loadgen: dispatch test + usage line for daily subcommand * loadgen: integration test for daily-IM scenario * loadgen: deploy/run-daily target for daily-IM scenario * docs(loadgen): document daily-IM scenario * loadgen: httptest unit coverage for pending poller + service scraper * loadgen: fixes from code review (recall pass) Apply correctness and cleanup fixes surfaced by /simplify high-effort review: Correctness: - runStep: return Inconclusive on ctx cancel and on pollPending error instead of zero-valued StepResult that renderConsole treats as PASS - multiplexPool.Add: subscribe BEFORE mutating dispatch/refcount, so partial-subscribe failures leave the pool consistent - multiplexPool.Close: stop closing per-user channels — race against in-flight route() could panic on send-on-closed; rely on Drain+GC - multiplexPool.route: hold lock through inbox send so Close can't race; drop counter incremented outside lock to avoid contention - newMultiplexPool: return error instead of panic on connect failure - runDailyForTest: defer Close on direct+multiplex pools to drain connections cleanly between/after runs - directPool/multiplexPool: also subscribe to UserRoomEvent(account) so DM broadcasts (the dominant receive traffic for daily presets) are observed - prodEnvFactory: wire publish/request fns through stepEnv; emitters now actually emit actions in production (previously startEmitter was dead code with //nolint:unused, making the verdict trivially pass) - doAction: use env.siteID instead of hardcoded "site-local" - runDailyForTest: propagate baseCfg.SiteID into fixtures and stepEnv - prodEnvFactory: drop the hardcoded :9100/metrics URLs — backend services don't expose HTTP endpoints, so the scrape was a permanent no-op masquerading as a real signal Cleanup: - 5s tick in runStep hold loop → single context-aware timer - maxInt helper → slices.Max - joinReasons body → strings.Join - scrapeErrorCounter manual read loop → io.ReadAll Known limitation documented in preset.go: large-band rooms have UserCount > LargeRoomThreshold in production fixtures, which causes message-gatekeeper to block non-thread sends from member-role users. Daily presets work around this by funneling sends to smaller rooms; large-band rooms still exercise fan-out via receive-side subscriptions. * loadgen: fix integration test call site after newMultiplexPool sig change * loadgen(daily): add CLI usage documentation with examples `loadgen daily -h` now prints a description of the scenario, SLO thresholds, receiver topology, preset summary, flag list with units, and three example invocations. Also exit cleanly on -h/--help instead of logging a 'parse daily config error=flag: help requested' line. * loadgen(daily): log fixture build start + elapsed time 100k-user fixtures take minutes to build (O(N*perUser*R) weighted picker in preset.go); without a log line up front, the silence looked like a hang. Now you see 'building fixtures users=N' immediately and 'fixtures built rooms=X subscriptions=Y elapsed=Zs' when the picker finishes, so it's clear whether the process is in fixture build, in the per-step warmup, or stuck. The picker itself remains the original weighted-by-remaining-capacity algorithm; an earlier attempt to swap it for a shuffled slot-bag preserved the asymptotic complexity win but broke the DM band's perfect-fit tail (any duplicate-skip burns a slot the last user needs) and was reverted. A proper fix (stub-pairing for DM band only) is deferred to a follow-up. * loadgen(daily): tolerate flaky /jsz; drop pending-growth signal vs aborting A misbehaving NATS monitoring endpoint was marking every step Inconclusive — too strict. Now: - pollPending retries up to 3x with linear backoff (200ms, 400ms) so transient failures are absorbed. - If the poll still fails after retries, runStep logs a warning and drops the pending-growth signal for that step (passes nil ConsumerPending to evaluateStep). The other four signals (latency, error rate, service errors, self-metrics) still produce a real verdict. - Only ctx.Canceled / DeadlineExceeded from the poller still maps to Inconclusive, since that indicates the run is shutting down, not a flaky server. * loadgen(daily): NatsMonitoringURL env var for the /jsz poller Adds NATS_MONITORING_URL (default http://nats:8222/jsz, the docker-compose default) to the base config. prodEnvFactory.Build reads it for the daily-IM pending-growth poller. Lets users running against non-default infra (host-port-mapped NATS, monitoring on a different port, etc.) point the poller at the right endpoint without recompiling. * loadgen(preset): stub-pair DM band for O(N*perUser) fixture build The DM band's weighted-by-capacity picker was the dominant cost in BuildFixtures: at N users with bands.DMs=perUser the band has R = N*perUser/2 rooms, making the per-pick O(R) scan quadratic in N. At N=100000 (the default --steps tip) the DM band alone was ~3e12 inner-loop iterations, taking hours. Stub-pairing (configuration model) generates a perUser-regular bipartite graph in O(N*perUser): 1. Each user contributes perUser stubs to a flat list. 2. Shuffle the list once. 3. Pair consecutive stubs into DM rooms. 4. Fix self-pairs (stubs[2k]==stubs[2k+1]) by swapping the second stub with a later position whose neighbours don't create a new self-loop. Self-loops are rare (~perUser expected over the whole list). By construction every user appears in exactly perUser DMs and every DM has 2 distinct users. The Small/Medium/Large bands keep the existing weighted picker (they're tractable at typical N -- O(R) per pick is fine when R is in the thousands). Locked in by TestBuildFixtures_DailyHeavy_FastAtScale: at N=2000 the whole BuildFixtures call now completes in ~2s on a developer box; the test fails the build if it ever regresses past 30s. Existing TestBuildFixtures_DailyBands (which already asserts the per-user membership count invariant at N=200) still passes. * loadgen(daily): correctness + operational fixes for core-tool use Three perf+correctness+ops reviews surfaced 14 findings. Applies the verdict-affecting and operationally-critical ones; flags the performance/memory ones as follow-ups in commit messages and code. Correctness: - **Deterministic per-user RNG seed.** startEmitter was seeded from time.Now().UnixNano() XOR len(u.ID) — len is constant for all users ("u-NNNNNN"), so same-nanosecond activations got identical seeds and emitted perfectly correlated action streams. Replaced with a splitmix mix of env.runSeed and the user index, so two runs at the same run-seed produce identical action streams (the whole point of a reproducible verdict). - **Per-step envelope re-anchor.** Emitters used to capture holdStart at activation time, so users from step N kept emitting against step N's curve all the way through step N+1 — by step 2 the early half of the user population was firing at envelope baseline (0.4×rate) instead of following the new step's hold curve, making later steps' "PASS" verdicts wrong. Now stepEnv carries holdStart and holdDuration as atomic.Int64 fields; emitters re-read them on every tick; runStep calls env.setHold(now, hold) right before Reset() at the start of each hold. - **Post-activation hold anchor.** holdStart was computed *before* activateUsers, which is rate-limited at 500 users/sec. At a +50k delta this meant 100s of activation, mostly during what was supposed to be warmup, with fresh users emitting at the envelope's clamped edge instead of ramping. The re-anchor above also fixes this — hold begins after warmup actually elapsed. - **percentile uses ceil-based nearest-rank.** Old floor-based indexing (int(p*(len-1))) returned p98 for p99 on ≤100-sample windows, so tight smoke tests false-PASSed. Now uses ceil(p*len)-1, matching the standard. - **diffPending walks both sides.** A consumer that disappears mid-hold (crashed/deleted) used to be silently dropped from the verdict. Now surfaces as End=0/negative-Delta and trips with reason "<durable> disappeared mid-hold". - **HTTP timeout on /jsz.** pollPendingOnce used http.DefaultClient with no timeout; a hung monitoring endpoint would wedge the whole run until the operator killed the process. Dedicated 5s-timeout client. - **ctx-aware ramp loop.** runDailyForTest now checks ctx.Err() between steps, so SIGINT mid-cooldown doesn't produce a junk trail of INCONCLUSIVE rows for steps that never started. Operational guards: - **Zero-traffic → INCONCLUSIVE.** Publisher conn failure or unwired emitters used to produce a silent PASS at every step. Now evaluateStep returns INCONCLUSIVE with "zero actions attempted" when AttemptedOps==0. - **Effective-N tracking + <95% threshold.** activateUsers increments env.activatedCount/skippedCount atomically; if EffectiveN/N < 95% the step returns INCONCLUSIVE with the shortfall in the reason. StepResult.EffectiveN is surfaced in console (e.g. "20000(10000)") and added as a CSV column. Stops "N=20000 PASS" silently meaning "10000 active". - **Goroutine-as-CPU proxy disabled.** readCPUPercent's heuristic (NumGoroutine/5000 × 100) tripped INCONCLUSIVE at any scale ≥4k users — exactly the regime this tool is built for. Now returns 0; INCONCLUSIVE relies on the GC pause signal alone until a real CPU sample (gopsutil/proc-self-stat) is wired. Follow-ups (not in this commit): - **Perf.** Collector single-mutex contention (~520k locks/sec at N=100k), unbounded byReqID/byMsgID growth (spec promised TTL+1M cap), per-user inbox channels that nobody reads, small/medium band O(N²) picker, one-goroutine-per-user (100k goroutines), per-tick actionCtx allocation. Each is a real win but a non-trivial refactor; queued for the next pass. - **Ergonomics.** Configurable SLO thresholds + seed flags, heartbeat between warmup and verdict, ABORTED marker on ctx-cancel, full per-durable CSV columns, --validate dry-run. * loadgen(perf): slot-bag picker for non-DM bands + shard Collector Two perf fixes flagged by the post-review pass. preset.go: small/medium/large band fixture picker switched from the O(N x perUser x R) weighted-scan to a configuration-model slot-bag walk with retry on per-user duplicates. Same shape as the DM band's stub-pairing but generalised for heterogeneous room capacities. N=10000 daily-heavy fixture build drops from minutes to ~1s; N=100k drops from ~30+ min to seconds. TestBuildFixtures_DailyHeavy_FastAtScale bumped to N=10000 to lock the speedup in. collector.go: byReqID/byMsgID/e1/e2 sharded by FNV-1a hash of the correlation key across 64 shards (reqShard for requestID-keyed, msgShard for messageID-keyed; each has its own mutex). At N=100k a busy daily-IM run produces ~150k+ Record* calls/sec, all of which previously serialised on a single sync.Mutex — loadgen became the bottleneck and inflated latency samples before the backend tripped. Per-shard contention drops by ~64x. Determinism preserved: emit-as-you-pick instead of range-over-map (Go map iteration order is randomised), and the slot-bag uses the same seeded r.Intn sequence both runs. TestBuildFixtures_DailyBands still passes its same-seed -> equal Fixtures invariant. * loadgen: reconcile with main after rebase (drop dup waitOrCancel + Reset) PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go, error-returning) and a Reset() method on the unsharded Collector. Both collided with the daily-IM scenario's versions: - daily.go's waitOrCancel returned bool; replaced its callers to use the error-returning ramp.go helper. - collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu, c.byReqID, etc.) that no longer exist; the sharded Reset further down in the file does the same job correctly under per-shard locks. Build, lint, and unit tests green after the reconciliation. * loadgen(daily): skip integration test that needs full backend stack TestRunDaily_Integration_TinyPresetPasses passed vacuously before the recall-review fix wired emitters into prodEnvFactory (no actions = no errors = trivial PASS). With emitters live, the test now actually publishes to subjects like `chat.user.{acct}.room.{room}.{site}.msg.send` and request-replies on history/room-list/etc. — all of which time out in the CI integration job because the testutil NATS container has no gatekeeper, no room-service, and no broadcast-worker subscribed. Result: error_rate=7.69% > 0.1% threshold, step TRIPs, test FAILs. The wiring is correct (proven by unit + race tests); what's wrong is that this scenario fundamentally needs the full docker-compose stack to be meaningful. Skipping with a comment pointing operators at `make -C tools/loadgen/deploy run-daily` for real end-to-end coverage. * loadgen(daily): flush after Subscribe so server registers interest before Add returns CI hit: TestMultiplexPool_RoutesBroadcastToInbox 'Condition never satisfied'. Root cause: nc.Subscribe queues the SUB command client- side; it's not sent to the broker until the next Flush. A caller (or test) that publishes immediately after Add() can see the broadcast dropped because the server hasn't registered the subscription interest yet. Production emitters tick on a 1s schedule so they never publish during the registration race window, but the test does — explaining why the test was reliably failing only under CI's timing. Fix in both pools (direct + multiplex): call nc.Flush() after the Subscribe pass so Add only returns once the server knows about every new interest. One round-trip per pool conn at activation time, dominated by the Subscribe overhead already incurred. * loadgen(daily): pass NATS_CREDS_FILE to pool + publisher conns Bug: daily.go's prodEnvFactory and daily_pool.go's pool constructors called bare nats.Connect(url, nats.Name(...)) with no credentials. On operator-mode NATS servers (the chat-system standard, including the local docker-local stack) anonymous connections either fail outright or get default-deny perms — manifesting as 'permissions violation' on subscribe and the silent-zero-latency symptom reported here. Every other loadgen subcommand correctly used natsutil.Connect(cfg.NatsURL, cfg.NatsCredsFile); only the daily scenario was missing the second arg. Fix: - New connectWithCreds(url, name, credsFile) helper applies nats.UserCredentials when credsFile is non-empty, else falls back to anonymous (testcontainer NATS in unit/integration tests). - newDirectPool, newMultiplexPool, and the prodEnvFactory publisher conn all route through it. - baseCfg.NatsCredsFile (existing env var NATS_CREDS_FILE) flows in through prodEnvFactory; the docker-compose overlay already wires NATS_CREDS_FILE=/etc/nats/backend.creds. After this fix: 'make -C tools/loadgen/deploy run-daily PRESET=...' authenticates as the backend user, subscribe perms work, and latency samples populate — P50/P95/P99 should be non-zero. * loadgen(daily): use Request for readReceipt; room-service expects a reply room-service registers MessageReadWildcard via nc.QueueSubscribe and unconditionally calls m.Msg.Respond(resp) (room-service/handler.go:83). Respond requires msg.Reply to be set, which nc.Publish does not set — only nc.Request does. The daily readReceipt action was using a.Publish, so every read-receipt action triggered 'nats: message does not have a reply' in room-service's log. Switched to a.Request with the same defaultRequestTimeout the other request-shaped actions use, and renamed/updated the unit test to lock in the contract (must land in c.reqs, not c.pubs). Verified the other six action handlers (scrollHistory, refreshRoomList, muteToggle, roomCreate, memberAdd, threadReply) — they all use Request against the right routes, so readReceipt was the sole mismatch. * loadgen(daily): pre-flight check for un-seeded mongo + readReceipt uses Request Two unrelated bugs that bit during real runs: 1. readReceipt was using a.Publish for subject.MessageRead, but room-service registers that subject via QueueSubscribe and calls msg.Respond on every received message — which fails with 'nats: message does not have a reply' when the sender used fire-and-forget Publish. Switched to a.Request (matches the pattern of every other request-side action handler). Updated the test. 2. runDaily started straight into BuildFixtures + a full ramp even when the Mongo subscriptions collection was empty for the configured siteID. The gatekeeper then rejected every send with 'user X is not subscribed to room Y' and the operator burned ~30+ minutes before seeing the error_rate trip. Added a preflight that opens a short- lived Mongo connection, counts subscriptions for cfg.SiteID, and bails with an actionable message: ERROR daily pre-flight error="no subscriptions found in mongo for siteID="site-local"; run \`loadgen seed --workload=messages --preset=<your daily preset>\` first (or \`make -C tools/loadgen/deploy seed PRESET=<your preset>\`)" Exit code 2 (config error). Uses a 10s context independent of the run-level ctx so a transient Mongo blip doesn't burn the window before failing. * loadgen(daily): stop overriding preset.Users — caused fixture ID mismatch with seed Symptom: gatekeeper rejects every send with "user X is not subscribed to room Y" even though `loadgen seed` was just run successfully. Root cause: BuildFixtures is deterministic in (preset, seed, siteID). `loadgen seed --workload=messages --preset=daily-heavy` uses the preset's default Users (10000). But runDailyForTest was overriding preset.Users = slices.Max(cfg.Steps) before calling BuildFixtures. If max(steps) != preset.Users, the per-band stub shuffle produces DIFFERENT user/room/sub IDs at run time vs seed time — same prefixes ("u-000123", "room-small-000045"), different totals → different permutation → no overlap. Gatekeeper's GetSubscription misses every lookup and the whole ramp fails on error_rate. Fix: drop the override. runDailyForTest now uses preset.Users as-is, matching `loadgen seed`. activateUsers already caps at len(env.users), and the EffectiveN-shortfall guard surfaces "only X/Y users activated" as INCONCLUSIVE when --steps exceeds preset.Users — much clearer than silent ID drift. Side effect: --steps cannot exceed preset.Users for that preset. For daily-light/heavy/power that's 10000. A startup log line warns when max(steps) > preset.Users so the operator sees the cap before the run starts. No re-seed required after deploying the fix. * docs(loadgen): comprehensive Daily-IM operator guide Replace the 53-line stub section with a full operator guide that captures every lesson from real runs: - Quick start + prerequisites (seed step, NATS perms, env vars) - Preset comparison with rooms-per-user breakdown - Complete CLI flag reference (units, defaults, when to change) - Environment variables (NATS_CREDS_FILE, NATS_MONITORING_URL, SITE_ID, plus the mongo/valkey vars) - SLO signals with thresholds; PASS/TRIP/INCONCLUSIVE semantics - Sample console + CSV output, how to read 'ANSWER' and N(EffectiveN) annotations - Troubleshooting matrix covering every failure mode encountered: not-subscribed, siteID mismatch, large-room block, 'no reply' from room-service, NATS permissions, all-zero latency, EffectiveN cap, fixture build silence, flaky /jsz - Known limitations explicitly listed (large-room block, no-op auth stub, dormant service-error scraper, disabled CPU proxy) - Pointers to spec, plan, and source files Aims to make the tool self-serviceable: a new operator who has never used it should be able to get to a verdict by following the doc, and an experienced operator hitting an error should find the fix in the troubleshooting matrix without escalating. * loadgen(daily): per-action latency breakdown in console + CSV Adds observational per-handler latency stats so operators can see what each daily action (send / read_receipt / scroll_history / refresh_room_list / member_add / room_create / mute_toggle) actually costs, in addition to the system-wide publish→broadcast p50/p95/p99 the verdict already uses. - Collector gains per-action latency sample tracking (map[int][]Duration under its own mutex; cleared by Reset alongside the existing counters). - doAction wraps each handler call with time.Now/Since and records the elapsed via RecordActionLatency. - actionKind grows a String() helper for stable lowercase names and an allActionKinds canonical-order slice for the report. - stepInputs/StepResult carry ActionLatencies; evaluateStep computes Count + P50/P95/P99 per action via the existing percentile helper. - Console prints 'actions: send n=8920 p50=12 p95=180 p99=320 | ...' on the line after each step's verdict. - CSV adds 28 columns (<action>_count, _p50_ms, _p95_ms, _p99_ms × 7 actions) in stable allActionKinds order so downstream tools can column-index reliably. Verdict criteria unchanged: these stats are observational only — PASS / TRIP / INCONCLUSIVE still depends on the existing latency, error-rate, pending-growth, and self-metrics signals. Per the operator's request (too easy to confuse a single 'all actions sluggish' p95 with a real broadcast-pipeline regression if we trip on per-action numbers). * loadgen(daily,seed): add --users flag to override preset.Users The daily presets hard-code Users=10000. Operators wanting to ramp above that previously had to edit preset.go and rebuild. Now both `loadgen seed --workload=messages` and `loadgen daily` accept `--users=N` to override at invocation time. The two MUST be passed the same value — BuildFixtures is deterministic in (preset, seed, siteID), and the per-band stub shuffle uses totalUsers as length, so a mismatch produces entirely different room/sub IDs even though the user IDs (u-000000, u-000001, ...) overlap. The gatekeeper then rejects every send with "user not subscribed". To catch that misuse fast, the daily preflight now also counts users in Mongo and compares to preset.Users (or the --users override). On mismatch it errors with the exact teardown+re-seed commands needed. Usage at, say, N=50000: loadgen seed --workload=messages --preset=daily-heavy --users=50000 loadgen daily --preset=daily-heavy --users=50000 --steps=5000,10000,20000,50000 `--users=0` (default) means "use preset default", the safe path for normal runs and existing scripts. * loadgen(daily): set X-Request-ID header on every NATS publish/request room-service extracts X-Request-ID from inbound NATS headers via natsutil.ContextWithRequestIDFromHeaders (room-service/handler.go:60). The ID flows through ctx into the canonical event room-service publishes to room-worker's ROOMS stream. room-worker REQUIRES X-Request-ID on canonical events for idempotency (it's the dedup seed) and rejects with 'missing X-Request-ID' otherwise (room-worker/handler.go:752,1219). Before this fix, prodEnvFactory's publish/request closures used pubConn.Publish(subj,data) and pubConn.RequestWithContext(ctx,subj,data) — neither of which sets headers. Empty inbound header at room-service → empty ctx → empty canonical header → room-worker logs the error and drops the event. Visible in room-worker logs as 'missing-X-request-id' on subject 'chat.room.canonical.<site>.member.add'. Now both closures build *nats.Msg with a fresh natsutil.RequestIDHeader (X-Request-ID) per call and use PublishMsg / RequestMsgWithContext. Every emitter action now produces a properly traceable request through the canonical pipeline. The header ID is independent of any payload-level request ID (e.g. SendMessageRequest.RequestID for the gatekeeper) — both serve different layers. In production a real client typically uses the same value at both layers; we could plumb that through actionCtx as a follow-up if request-tracing across layers needs it. * loadgen(daily): memberAdd picks from channel rooms only room-service rejects member-add on DM rooms with 'cannot add members to a non-channel room' (DMs have a fixed 2-member shape). For daily-heavy ~25 DMs out of 56 rooms per user, picking uniformly from u.Rooms means ~45% of memberAdd actions error out — inflating error_rate noise and spamming the room-service log. Pre-filter at activation time: userState.ChannelRooms is the subset of Rooms that aren't DMs, computed once in newUserState by ID-prefix ('room-dm-' = DM band, everything else = channel band). memberAdd now picks from u.ChannelRooms instead of u.Rooms. Users with no channel rooms (DM-only) no-op the action rather than erroring. The other actions (sendMessage, readReceipt, scrollHistory, refreshRoomList, muteToggle, roomCreate, threadReply) work fine on DMs and continue using u.Rooms. Only memberAdd has the channel-only constraint at the room-service layer. Tests: - TestNewUserState_ChannelRoomsExcludesDMs locks the prefix-filter - TestMemberAdd_Requests now seeds ChannelRooms explicitly - TestMemberAdd_SkipsWhenNoChannelRooms covers the DM-only no-op path * loadgen(daily): roomCreate includes neighbor user; memberAdd uses real account Two related fixes for actions that hit room-service: roomCreate sent only {name, type}. classifyAndValidate at room-service/handler.go:193 lets that through (name is non-empty), but the channel branch at line 290 then rejects it: allUsers := stripAccount(...) if len(allUsers) == 0 && len(allOrgs) == 0 { return nil, errEmptyCreateRequest } A channel must have at least one member besides the creator. Now we include u.Neighbor in the 'users' array. errEmptyCreateRequest gone. memberAdd was passing the literal 'user-stub' as the target account. room-service validates accounts exist (errUserNotFound) — 'user-stub' isn't in Mongo, so every memberAdd that reached a channel was failing silently after the channel-only filter. Now uses u.Neighbor, which is guaranteed-seeded. userState.Neighbor is computed once in newUserState by shifting the requester's index by 1 ('user-N' → 'user-(N-1)' for N≥1, 'user-0' → 'user-1'). Always != Account; always within the seeded range for any preset with Users ≥ 2 (i.e. all daily-* presets). Tests: - TestNeighborOf covers shift / wrap / fallback - TestRoomCreate_Requests asserts the payload now has a users array - TestMemberAdd_Requests / SkipsWhenNoChannelRooms unchanged shape * loadgen(daily): apply per-request timeout in the request closure The 'timeout' argument to actionCtx.Request was silently ignored: RequestMsgWithContext uses the context's deadline, and the emitter's ctx is the run-level ctx with no deadline. So defaultRequestTimeout (5s) never bound anything; a slow room-service handler could hang arbitrarily long, showing up as huge per-action p50 (e.g. 25s for read_receipt) instead of cleanly timing out and contributing to error_rate. Wrap the incoming ctx with context.WithTimeout(timeout) so each call gets a real deadline. Action handlers continue to pass defaultRequestTimeout=5s; future per-action overrides can vary the value without code changes. This makes the per-action latency numbers a true latency signal: a healthy handler shows actual round-trip ms; a sick one shows a wall at the timeout value plus a matching jump in failed_ops / error_rate. * loadgen(daily): per-action latency gates the verdict Previously the verdict only gated on the publish→broadcast latency (the messaging pipeline). Operations like read_receipt, scroll_history, member_add, refresh_room_list — all of which exercise different parts of the system (room-service, history-service, room-worker) — only contributed to error_rate when they timed out, never to latency. A slow handler that stayed under the request timeout was invisible. Now Thresholds carries per-action p95/p99 maps; evaluateStep iterates ActionLatencies and trips if any gated action exceeds its cap. Default thresholds reflect typical chat-backend handler latencies: read_receipt: p95=100ms p99=250ms (mongo update + reply) refresh_room_list: p95=200ms p99=500ms (mongo query per user) scroll_history: p95=500ms p99=1500ms (cassandra bucket walk) member_add: p95=200ms p99=500ms (mongo + canonical pub) mute_toggle: p95=100ms p99=250ms (mongo update) room_create: p95=500ms p99=1500ms (mongo + canonical pub) sendMessage/threadReply are not gated here — they're already gated by the broadcast-pipeline P95LatencyMs / P99LatencyMs at the top of the verdict (the latency the existing p50/p95/p99 columns reflect). Operator can tune via --action-p95-ms / --action-p99-ms flags: loadgen daily --preset=daily-heavy \ --action-p95-ms=read_receipt:80,scroll_history:300 \ --action-p99-ms=read_receipt:200,scroll_history:800 Format: comma-separated name:N pairs. Unset actions keep defaults; unknown names are rejected at parse time. Set N to a very large number to effectively disable an action's gate. Trip reasons name the action explicitly: scroll_history p95=612ms > 500 read_receipt p99=340ms > 250 * loadgen(daily): rename read_receipt action to mark_read The action marks the user's subscription as read (msg.read handler). "read_receipt" invited confusion with the separate MessageReadReceipt notification path; "mark_read" names what the action actually does. Breaking for existing CSV consumers and CLI users: the per-action column names change from read_receipt_* to mark_read_*, and --action-p95-ms / --action-p99-ms must use "mark_read" as the key. * loadgen(history): stream per-room seed plan to avoid OOM on history-large history-large materialized a single 65M-row plannedMessage slice (~50 GB of strings), so seeding got SIGKILLed before reaching Cassandra. BuildHistoryFixtures now keeps only the Fixtures + ThreadParents map and hands the seeder an IterateRoomMessages iterator that builds one room's plan at a time (~50 MB peak). FullPlan stays available for tests and small/medium presets where materialization is bounded. Per-room randomness is split into two streams — structural (sender, jitter, parent-selection, reply offsets) and content (message body bytes) — so a cheap metadata-only walk can derive Room.LastMsgAt and ThreadParents without paying the O(MessagesPerRoom × ContentBytes) cost of regenerating content. CreatedAt and content bytes for a given (seed, preset) are different from before — no test asserts on either, but anyone diffing a re-seed against an old Cassandra dump will see the change. --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Adds a
loadgen max-rps --workload=messages|historysubcommand that ramps target RPS through an explicit--stepslist, holds at each step, evaluates an SLO verdict, and reports the largest RPS that held:This applies the same step-up-and-hold-under-SLO control loop as PR #234 ("daily-IM, find sustainable N"), but the ramped axis is RPS and the load is the existing open-loop generators. It reuses the message-send (
run) and history (history-sustained) workloads rather than introducing new ones.How it works
ramp.go,verdict.go,maxrps_report.go):parseRPSSteps(k-suffix), a step lifecycle (warmup → reset → hold → measure → cooldown),evaluateRPSStep, and a console + CSV report. All identifiers arerps-prefixed so there is nopackage mainsymbol collision with feat(loadgen): daily-IM load scenario to find sustainable N #234'sdaily_*code regardless of merge order, and convergence later is mechanical.rpsWorkloadinterface:maxrps_messages.go(E1/E2 latency + JetStream consumer-pending deltas, via a reset-per-stepCollectorand Prometheus counter diffs) andmaxrps_history.go(per-endpoint latency via two sequential fresh-collector runs; no consumer queue).(1−tolerance)×targetwhile healthy (load box was the limit, not the service); else PASS. TRIP is checked before the shortfall guard so server-induced backpressure isn't misread as a harness limit.500,1k,2k,5k,10k, history200,500,1k,2k,5k.Design spec:
docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.mdImplementation plan:
docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.mdTest plan
make lint— 0 issuesmake test SERVICE=tools/loadgen— green under-race(pure logic at/near 100%:verdict.go/ramp.go100%, report ~96/86%)make fmt— no changesgosec— clean (note: project gosec config excludestools/)make test-integration SERVICE=tools/loadgen— compiles & links under theintegrationtag; runs in CI (Docker/testcontainers unavailable in the dev sandbox). Covers a 2-step messages ramp end-to-end.govulncheck/semgrep— run in CI (network/pipx unavailable locally; no new dependencies were added)make -C tools/loadgen/deploy up && make -C tools/loadgen/deploy run-max-rps WORKLOAD=messages PRESET=mediumNotes
docs/client-api.mdis unchanged.https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
Generated by Claude Code
Summary by CodeRabbit
Release Notes
New Features
max-rpsload testing subcommand that automatically discovers maximum achievable RPS for supported workloads while evaluating SLO compliance at each step.Documentation
Tests