Skip to content

feat(loadgen): max-rps SLO finder for message-send & history#240

Merged
mliu33 merged 16 commits into
mainfrom
claude/max-rps-slo-loadgen-ajKRi
Jun 1, 2026
Merged

feat(loadgen): max-rps SLO finder for message-send & history#240
mliu33 merged 16 commits into
mainfrom
claude/max-rps-slo-loadgen-ajKRi

Conversation

@hmchangw
Copy link
Copy Markdown
Owner

@hmchangw hmchangw commented May 28, 2026

Summary

Adds a loadgen max-rps --workload=messages|history subcommand that ramps target RPS through an explicit --steps list, holds at each step, evaluates an SLO verdict, and reports the largest RPS that held:

ANSWER: max RPS = 2000 (workload=messages, preset=medium)
        Next limit: E2 p95=143ms > 100ms

This applies the same step-up-and-hold-under-SLO control loop as PR #234 ("daily-IM, find sustainable N"), but the ramped axis is RPS and the load is the existing open-loop generators. It reuses the message-send (run) and history (history-sustained) workloads rather than introducing new ones.

How it works

  • Generic engine (ramp.go, verdict.go, maxrps_report.go): parseRPSSteps (k-suffix), a step lifecycle (warmup → reset → hold → measure → cooldown), evaluateRPSStep, and a console + CSV report. All identifiers are rps-prefixed so there is no package main symbol collision with feat(loadgen): daily-IM load scenario to find sustainable N #234's daily_* code regardless of merge order, and convergence later is mechanical.
  • Per-workload adapters behind a small rpsWorkload interface: maxrps_messages.go (E1/E2 latency + JetStream consumer-pending deltas, via a reset-per-step Collector and Prometheus counter diffs) and maxrps_history.go (per-endpoint latency via two sequential fresh-collector runs; no consumer queue).
  • Verdict precedence (differs from feat(loadgen): daily-IM load scenario to find sustainable N #234, deliberately): explicit measurement-failure → INCONCLUSIVE; else TRIP if any gated latency p95/p99, error rate, or consumer-pending-growth exceeds threshold; else INCONCLUSIVE if achieved RPS fell below (1−tolerance)×target while healthy (load box was the limit, not the service); else PASS. TRIP is checked before the shortfall guard so server-induced backpressure isn't misread as a harness limit.
  • Defaults (all flag-tunable): p95≤100ms, p99≤250ms, error-rate≤0.1%, pending-growth≤+1000, rate-tolerance 5%, warmup 10s / hold 30s / cooldown 5s, stop-on-trip on. Default steps: messages 500,1k,2k,5k,10k, history 200,500,1k,2k,5k.

Design spec: docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md
Implementation plan: docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md

Test plan

  • make lint — 0 issues
  • make test SERVICE=tools/loadgen — green under -race (pure logic at/near 100%: verdict.go/ramp.go 100%, report ~96/86%)
  • make fmt — no changes
  • gosec — clean (note: project gosec config excludes tools/)
  • make test-integration SERVICE=tools/loadgen — compiles & links under the integration tag; runs in CI (Docker/testcontainers unavailable in the dev sandbox). Covers a 2-step messages ramp end-to-end.
  • govulncheck / semgreprun in CI (network/pipx unavailable locally; no new dependencies were added)
  • Manual: make -C tools/loadgen/deploy up && make -C tools/loadgen/deploy run-max-rps WORKLOAD=messages PRESET=medium

Notes

  • Tooling-only change — no client-facing handler touched, so docs/client-api.md is unchanged.
  • Out of scope (per spec §9): binary-search refinement, auto-geometric ramp, members workload, gopsutil box-health.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha


Generated by Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added max-rps load testing subcommand that automatically discovers maximum achievable RPS for supported workloads while evaluating SLO compliance at each step.
    • Configurable workload targets (messages, history) with step-based RPS ramping.
    • Verdict reporting (PASS/TRIP/INCONCLUSIVE) with CSV export capability.
  • Documentation

    • Added comprehensive usage guide with examples and CLI flag reference.
    • Included interpretation guidance for test results.
  • Tests

    • Added unit and integration tests for RPS evaluation and reporting logic.

Review Change Stack

claude added 16 commits May 28, 2026 15:50
Mirrors PR #234's step-up-and-hold-under-SLO control loop, ramping target
RPS instead of user count N, to auto-find the max sustainable RPS for the
messages and history loadgen workloads. Standalone on main with
rps-prefixed identifiers so it neither depends on nor collides with #234.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
Task-by-task TDD plan: generic ramp engine + verdict + report, reusing
the existing message-send and history open-loop generators, exposed as a
max-rps subcommand. rps-prefixed identifiers avoid collision with PR #234.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
…tor error, log gather error

- Add shutdownSrv helper and call it on all three subscription-error return
  paths so the metrics HTTP server is not leaked when construction fails.
- Log the Gather() error in snapshotCounters instead of silently discarding it.
- Strengthen TestDiffCounters to assert zero-delta for marshal/gatekeeper/bad_reply.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
Correct the README "Reading the output" section (INCONCLUSIVE does not
stop the ramp and is not a pass; fix the no-pass ANSWER string), validate
--page-limit/--request-timeout > 0, and document the deliberate
straggler-exclusion in the messages hold-window counter snapshot.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

This pull request introduces loadgen max-rps, a complete tooling feature that discovers the maximum RPS achievable for messages and history workloads under SLO constraints. The implementation includes a workload-agnostic ramp control engine, per-workload adapters, SLO-based verdict classification (PASS/TRIP/INCONCLUSIVE), and both tabular and CSV reporting.

Changes

loadgen max-rps feature

Layer / File(s) Summary
Design & specification documentation
docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md, docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md
Plans and formal specification documents defining the max-RPS feature: workload-agnostic ramp engine architecture, per-workload adapter seam, CLI surface, verdict precedence rules (latency/error/pending thresholds → TRIP, then achieved-rate shortfall → INCONCLUSIVE), per-step lifecycle with warmup/hold/cooldown, reporting outputs (console + CSV), test coverage requirements (unit/integration), and non-code deliverables (README and Makefile).
RPS verdict evaluation engine
tools/loadgen/verdict.go, tools/loadgen/verdict_test.go
Core classification logic that evaluates each RPS step into PASS/TRIP/INCONCLUSIVE by checking latency percentiles (p95/p99), error rate, pending-growth, and achieved-rate-vs-target thresholds in strict precedence order; includes comprehensive table-driven tests covering boundary conditions, latency overrides, and verdict reasons.
Ramp control engine & orchestration
tools/loadgen/ramp.go, tools/loadgen/ramp_test.go
Orchestrates per-step RPS ramp execution: parses step sequences with k-suffix scaling and validates ascending order, manages warmup/hold/cooldown timing, evaluates each step via verdict logic, supports stop-on-trip or run-all modes, and handles context cancellation; includes tests for parsing constraints, ramp semantics, and exit code mapping.
Report generation & output formatting
tools/loadgen/maxrps_report.go, tools/loadgen/maxrps_report_test.go
Generates human-readable per-step tables (target/achieved RPS, p95/p99 per latency series, error rate, worst pending, verdict) and CSV rows; emits "ANSWER" line with the highest passing RPS, "Next limit" guidance when a trip occurs, and handles multi-series alignment with zero-filling for missing data.
Collector.Reset() for per-step measurement isolation
tools/loadgen/collector.go, tools/loadgen/collector_test.go
Adds a Reset() method that clears correlation state and latency samples while preserving active subscriptions, enabling clean per-step hold-window measurements without subscription recreation.
Messages workload adapter
tools/loadgen/maxrps_messages.go, tools/loadgen/maxrps_messages_test.go
Implements messages-specific RPS step execution: snapshots metric counters (published/error reasons), queries JetStream consumer pending counts, manages per-step generator lifecycle with collector reset at hold boundaries, and normalizes deltas/samples into rpsStepInputs; includes pure-logic tests for counter delta computation and input normalization.
History workload adapter
tools/loadgen/maxrps_history.go, tools/loadgen/maxrps_history_test.go
Implements history-specific RPS step execution: runs sequential warmup/hold generator phases, extracts latency samples into named series (history/thread), computes attempted/failed/saturation counts, returns normalized inputs; includes test validating sample counts and structure.
CLI control flow & main dispatcher
tools/loadgen/maxrps.go, tools/loadgen/maxrps_test.go, tools/loadgen/main.go
runMaxRPS command implementation: parses flags, applies workload-dependent default steps, validates workload-specific parameters, constructs and initializes the appropriate workload adapter, runs ramp orchestration, renders console report and optional CSV, returns exit code; dispatcher wires max-rps subcommand to CLI entry point.
Deployment wiring & documentation
tools/loadgen/README.md, tools/loadgen/deploy/Makefile
Makefile adds configurable WORKLOAD and STEPS variables plus run-max-rps target to invoke max-rps with parameterized flags; README documents the workflow, SLO signal gating, verdict interpretation, and full CLI flag reference.
End-to-end integration test
tools/loadgen/integration_test.go
Sets up JetStream stream with ack-only durable consumers, creates a fake gatekeeper for message injection, runs a two-step RPS ramp via messages workload, asserts non-trip verdicts and positive achieved metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hmchangw/chat#117: Extends the existing tools/loadgen/Collector by adding Collector.Reset() to support per-step ramp hold-window measurement logic.
  • hmchangw/chat#230: Implements the history workload adapter using the HistoryCollector/HistoryGenerator machinery introduced in that PR.
  • hmchangw/chat#208: The new TestMaxRPS_Messages_TwoStepRamp integration test is built on top of the loadgen integration test harness refactored in that PR.

Suggested reviewers

  • mliu33
  • ngangwar962

Poem

🐰 A ramp climbs high toward the RPS sky,
With verdicts passed and failures nigh,
Each step holds steady, SLOs guide,
Till max is found—the truth inside!
Reports emerge with answers clear,
The load has spoken, crystal-sheer. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(loadgen): max-rps SLO finder for message-send & history' clearly and concisely summarizes the main change: adding a new max-rps subcommand that discovers maximum achievable RPS under SLO constraints for two specific workloads.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/max-rps-slo-loadgen-ajKRi

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tools/loadgen/integration_test.go (1)

142-142: ⚡ Quick win

Use a bounded context for runRamp to avoid hanging integration runs.

Running the ramp on context.Background() can hang CI if a subscriber/workload path stalls. Wrap ramp execution in context.WithTimeout.

Proposed fix
-	ctx := context.Background()
+	ctx := context.Background()
@@
-	results := runRamp(ctx, w, &rampConfig{
+	rampCtx, cancel := context.WithTimeout(ctx, 20*time.Second)
+	defer cancel()
+	results := runRamp(rampCtx, w, &rampConfig{

Also applies to: 194-202

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/integration_test.go` at line 142, Replace uses of
context.Background() created for ramp runs with a bounded context via
context.WithTimeout to prevent hanging tests: create ctx, cancel :=
context.WithTimeout(context.Background(), 2*time.Minute) and defer cancel()
before calling runRamp (and any subscriber/workload invocations), pass that ctx
into runRamp and related functions, and repeat this change for the other
occurrences around lines where ctx is defined (also update imports to include
time if needed).
tools/loadgen/ramp.go (1)

32-64: 💤 Low value

Optional: Add explicit empty-steps check for consistency with plan.

The plan (line 597-599) includes a final check for len(out) == 0, but the implementation returns directly without it. While unreachable due to the "empty step" error at line 38-40, adding the explicit check would match the plan and serve as defensive documentation.

📋 Optional consistency fix
 		out = append(out, n)
 	}
+	if len(out) == 0 {
+		return nil, fmt.Errorf("no steps parsed from %q", s)
+	}
 	return out, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/ramp.go` around lines 32 - 64, In parseRPSSteps, after the loop
and before returning, add an explicit defensive check for len(out) == 0 and
return a clear error (e.g., fmt.Errorf("no steps provided") or similar) to match
the plan; this should reference the out slice and be placed just before the
final return out, nil in the parseRPSSteps function.
tools/loadgen/README.md (1)

244-262: 💤 Low value

Flags table omits workload-specific flags.

runMaxRPS also registers --inject (messages) and the history-only tunables (--mix, --before-mode, --scrollback-pages, --page-limit, --request-timeout). Consider documenting them so the reference is complete.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/README.md` around lines 244 - 262, Update the Flags section to
include workload-specific flags registered by runMaxRPS: add a "messages-only"
subsection row for --inject (and note default and meaning) and a "history-only"
subsection with rows for --mix, --before-mode, --scrollback-pages, --page-limit,
and --request-timeout; mark which flags apply only to history or messages,
provide their defaults and short notes (e.g., `--inject` = messages-only,
default empty; `--mix`, `--before-mode`, `--scrollback-pages`, `--page-limit`,
`--request-timeout` = history-only), and place them near the existing Flags
table so the README documents all tunables runMaxRPS registers.
tools/loadgen/maxrps_messages.go (1)

237-252: 💤 Low value

Return early on holdErr to skip the unnecessary drain/snapshot on cancellation.

When ctx is cancelled mid-hold, holdErr is non-nil but the code still snapshots end counters/pending (against the dead ctx) and incurs the fixed 2s drain before discarding the result. Checking holdErr right after the hold avoids the wasted work and shutdown delay.

♻️ Proposed reorder
 	holdErr := waitOrCancel(ctx, hold)
+	if holdErr != nil {
+		cancel()
+		wg.Wait()
+		return rpsStepInputs{}, holdErr
+	}
 
 	// Counters are snapshotted at hold-end, before the drain: gatekeeper/bad_reply
 	// errors whose reply lands during the drain are deliberately excluded (see the
 	// straggler-exclusion rationale on buildMessagesInputs). The drain only lets
 	// trailing E1/E2 latency samples settle for the percentile signals.
 	endCounts := w.snapshotCounters()
 	endPending, perr2 := w.snapshotPending(ctx)
 	cancel()
 	wg.Wait()
 	time.Sleep(2 * time.Second) // drain trailing replies/broadcasts
 	w.collector.DiscardBefore(holdStart)
-
-	if holdErr != nil {
-		return rpsStepInputs{}, holdErr
-	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_messages.go` around lines 237 - 252, After calling
waitOrCancel(ctx, hold) capture holdErr and immediately return if holdErr != nil
to avoid running w.snapshotCounters(), w.snapshotPending(ctx), cancel(),
wg.Wait(), time.Sleep(2 * time.Second) and w.collector.DiscardBefore(holdStart)
when the context was cancelled; move the holdErr check to just after holdErr :=
waitOrCancel(ctx, hold) so the expensive snapshot/drain/cleanup only runs on
successful holds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/loadgen/integration_test.go`:
- Around line 165-166: The callback passed to cons.Consume currently discards
errors (e.g., func(msg jetstream.Msg) { _ = msg.Ack() }); change it to surface
failures instead of ignoring them: inside the callback check the returned error
from msg.Ack() (and any marshal/publish calls) and propagate it to the test
(either call t.Fatalf/require.NoError from the callback context or send the
error on a channel and assert it in the main test goroutine). Update both the
Consume callback at cons.Consume(...) and the similar callback at the later
occurrence (the ack/marshal/publish call on lines 181-182) to fail the test when
an error occurs so no ack/marshal/publish errors are silently dropped.

In `@tools/loadgen/README.md`:
- Around line 268-271: The fenced code block showing sample output in README.md
should be annotated with a language to satisfy markdownlint MD040; update the
block delimiter from ``` to ```text so the snippet starting with "ANSWER: max
RPS = 2000 (workload=messages, preset=medium)" is fenced as plain text.

---

Nitpick comments:
In `@tools/loadgen/integration_test.go`:
- Line 142: Replace uses of context.Background() created for ramp runs with a
bounded context via context.WithTimeout to prevent hanging tests: create ctx,
cancel := context.WithTimeout(context.Background(), 2*time.Minute) and defer
cancel() before calling runRamp (and any subscriber/workload invocations), pass
that ctx into runRamp and related functions, and repeat this change for the
other occurrences around lines where ctx is defined (also update imports to
include time if needed).

In `@tools/loadgen/maxrps_messages.go`:
- Around line 237-252: After calling waitOrCancel(ctx, hold) capture holdErr and
immediately return if holdErr != nil to avoid running w.snapshotCounters(),
w.snapshotPending(ctx), cancel(), wg.Wait(), time.Sleep(2 * time.Second) and
w.collector.DiscardBefore(holdStart) when the context was cancelled; move the
holdErr check to just after holdErr := waitOrCancel(ctx, hold) so the expensive
snapshot/drain/cleanup only runs on successful holds.

In `@tools/loadgen/ramp.go`:
- Around line 32-64: In parseRPSSteps, after the loop and before returning, add
an explicit defensive check for len(out) == 0 and return a clear error (e.g.,
fmt.Errorf("no steps provided") or similar) to match the plan; this should
reference the out slice and be placed just before the final return out, nil in
the parseRPSSteps function.

In `@tools/loadgen/README.md`:
- Around line 244-262: Update the Flags section to include workload-specific
flags registered by runMaxRPS: add a "messages-only" subsection row for --inject
(and note default and meaning) and a "history-only" subsection with rows for
--mix, --before-mode, --scrollback-pages, --page-limit, and --request-timeout;
mark which flags apply only to history or messages, provide their defaults and
short notes (e.g., `--inject` = messages-only, default empty; `--mix`,
`--before-mode`, `--scrollback-pages`, `--page-limit`, `--request-timeout` =
history-only), and place them near the existing Flags table so the README
documents all tunables runMaxRPS registers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a7e8914-4979-46de-8426-64e9d2523ed6

📥 Commits

Reviewing files that changed from the base of the PR and between ad10203 and bd2a863.

📒 Files selected for processing (20)
  • docs/superpowers/plans/2026-05-28-max-rps-slo-loadgen.md
  • docs/superpowers/specs/2026-05-28-max-rps-slo-loadgen-design.md
  • tools/loadgen/README.md
  • tools/loadgen/collector.go
  • tools/loadgen/collector_test.go
  • tools/loadgen/deploy/Makefile
  • tools/loadgen/integration_test.go
  • tools/loadgen/main.go
  • tools/loadgen/maxrps.go
  • tools/loadgen/maxrps_history.go
  • tools/loadgen/maxrps_history_test.go
  • tools/loadgen/maxrps_messages.go
  • tools/loadgen/maxrps_messages_test.go
  • tools/loadgen/maxrps_report.go
  • tools/loadgen/maxrps_report_test.go
  • tools/loadgen/maxrps_test.go
  • tools/loadgen/ramp.go
  • tools/loadgen/ramp_test.go
  • tools/loadgen/verdict.go
  • tools/loadgen/verdict_test.go

Comment on lines +165 to +166
cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() })
require.NoError(t, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop silently discarding callback errors in the test path.

Ack/marshal/publish failures are currently ignored, which can make this integration test pass while message flow is broken. Capture and fail on these errors.

Proposed fix
-		cc, err := cons.Consume(func(msg jetstream.Msg) { _ = msg.Ack() })
+		cc, err := cons.Consume(func(msg jetstream.Msg) {
+			if err := msg.Ack(); err != nil {
+				t.Errorf("ack failed: %v", err)
+			}
+		})
@@
-		data, _ := json.Marshal(evt)
-		_, _ = js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data)
+		data, err := json.Marshal(evt)
+		if err != nil {
+			t.Errorf("marshal message event failed: %v", err)
+			return
+		}
+		if _, err := js.Publish(ctx, subject.MsgCanonicalCreated(siteID), data); err != nil {
+			t.Errorf("publish canonical event failed: %v", err)
+		}

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded".

Also applies to: 181-182

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/integration_test.go` around lines 165 - 166, The callback
passed to cons.Consume currently discards errors (e.g., func(msg jetstream.Msg)
{ _ = msg.Ack() }); change it to surface failures instead of ignoring them:
inside the callback check the returned error from msg.Ack() (and any
marshal/publish calls) and propagate it to the test (either call
t.Fatalf/require.NoError from the callback context or send the error on a
channel and assert it in the main test goroutine). Update both the Consume
callback at cons.Consume(...) and the similar callback at the later occurrence
(the ack/marshal/publish call on lines 181-182) to fail the test when an error
occurs so no ack/marshal/publish errors are silently dropped.

Comment thread tools/loadgen/README.md
Comment on lines +268 to +271
```
ANSWER: max RPS = 2000 (workload=messages, preset=medium)
Next limit: E2 p95=143ms > 100ms
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language to the fenced code block.

markdownlint flags this block (MD040). Use a plain text fence for the sample output.

📝 Proposed fix
-```
+```text
 ANSWER: max RPS = 2000 (workload=messages, preset=medium)
         Next limit: E2 p95=143ms > 100ms
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 268-268: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/README.md` around lines 268 - 271, The fenced code block
showing sample output in README.md should be annotated with a language to
satisfy markdownlint MD040; update the block delimiter from ``` to ```text so
the snippet starting with "ANSWER: max RPS = 2000 (workload=messages,
preset=medium)" is fenced as plain text.

@hmchangw hmchangw added the ready label Jun 1, 2026
Copy link
Copy Markdown
Collaborator

@mliu33 mliu33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@mliu33 mliu33 merged commit 6268a14 into main Jun 1, 2026
6 checks passed
hmchangw pushed a commit that referenced this pull request Jun 2, 2026
…set)

PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go,
error-returning) and a Reset() method on the unsharded Collector. Both
collided with the daily-IM scenario's versions:

- daily.go's waitOrCancel returned bool; replaced its callers to use the
  error-returning ramp.go helper.
- collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu,
  c.byReqID, etc.) that no longer exist; the sharded Reset further down
  in the file does the same job correctly under per-shard locks.

Build, lint, and unit tests green after the reconciliation.
hmchangw pushed a commit that referenced this pull request Jun 2, 2026
…set)

PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go,
error-returning) and a Reset() method on the unsharded Collector. Both
collided with the daily-IM scenario's versions:

- daily.go's waitOrCancel returned bool; replaced its callers to use the
  error-returning ramp.go helper.
- collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu,
  c.byReqID, etc.) that no longer exist; the sharded Reset further down
  in the file does the same job correctly under per-shard locks.

Build, lint, and unit tests green after the reconciliation.
hmchangw pushed a commit that referenced this pull request Jun 3, 2026
…set)

PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go,
error-returning) and a Reset() method on the unsharded Collector. Both
collided with the daily-IM scenario's versions:

- daily.go's waitOrCancel returned bool; replaced its callers to use the
  error-returning ramp.go helper.
- collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu,
  c.byReqID, etc.) that no longer exist; the sharded Reset further down
  in the file does the same job correctly under per-shard locks.

Build, lint, and unit tests green after the reconciliation.
mliu33 pushed a commit that referenced this pull request Jun 3, 2026
* docs: daily-IM load scenario design spec

* docs: daily-IM load scenario implementation plan

* loadgen: add DailyBands field and daily-light/heavy/power presets

* loadgen: banded fixture build for daily-IM presets

* loadgen: diurnal envelope for daily-IM scenario

* loadgen: user state machine + action picker for daily-IM

* loadgen: send/read-receipt/room-list action handlers

* loadgen: history/mute/room-create/member-add/thread action handlers

* loadgen: direct receiver pool for daily-IM scenario

Add directPool which, for each simulated user, opens a dedicated
nats.Conn and Subscribes to every room's broadcast subject. Each
delivery is timestamped and correlated against the shared Collector
via RecordBroadcast so end-to-end (E2) latency can be measured.

Wired into the daily-IM runtime by the next task; symbols carry
//nolint:unused until that wiring lands so this commit stands alone.

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt

* loadgen: multiplex receiver pool with drop counting

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt

* loadgen: SLO verdict evaluator for daily-IM steps

* loadgen: pending poller + service scraper + self-metrics

* loadgen: parseDailyConfig CLI flags + validation

* loadgen: per-step lifecycle (warmup/hold/cooldown)

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt

* loadgen: per-user emitter goroutines + runDaily control loop

* loadgen: console + CSV report for daily-IM scenario

* loadgen: production envFactory and runDaily entrypoint

Wires the daily-IM runtime end-to-end: prodEnvFactory builds the real
NATS direct/multiplex pools, pending poller, service scraper, and JWT
minting; runDaily is the production entrypoint dispatched from main.go.

Drops //nolint:unused suppressions on symbols now genuinely consumed via
the production path (directPool/multiplexPool + constructors, pollPending,
serviceScraper + Scrape, runStep/activateUsers/stepEnv). startEmitter and
doAction remain suppressed pending the per-user emitter wiring task.

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt

* loadgen: dispatch test + usage line for daily subcommand

* loadgen: integration test for daily-IM scenario

* loadgen: deploy/run-daily target for daily-IM scenario

* docs(loadgen): document daily-IM scenario

* loadgen: httptest unit coverage for pending poller + service scraper

* loadgen: fixes from code review (recall pass)

Apply correctness and cleanup fixes surfaced by /simplify high-effort review:

Correctness:
- runStep: return Inconclusive on ctx cancel and on pollPending error
  instead of zero-valued StepResult that renderConsole treats as PASS
- multiplexPool.Add: subscribe BEFORE mutating dispatch/refcount, so
  partial-subscribe failures leave the pool consistent
- multiplexPool.Close: stop closing per-user channels — race against
  in-flight route() could panic on send-on-closed; rely on Drain+GC
- multiplexPool.route: hold lock through inbox send so Close can't
  race; drop counter incremented outside lock to avoid contention
- newMultiplexPool: return error instead of panic on connect failure
- runDailyForTest: defer Close on direct+multiplex pools to drain
  connections cleanly between/after runs
- directPool/multiplexPool: also subscribe to UserRoomEvent(account)
  so DM broadcasts (the dominant receive traffic for daily presets)
  are observed
- prodEnvFactory: wire publish/request fns through stepEnv; emitters
  now actually emit actions in production (previously startEmitter was
  dead code with //nolint:unused, making the verdict trivially pass)
- doAction: use env.siteID instead of hardcoded "site-local"
- runDailyForTest: propagate baseCfg.SiteID into fixtures and stepEnv
- prodEnvFactory: drop the hardcoded :9100/metrics URLs — backend
  services don't expose HTTP endpoints, so the scrape was a permanent
  no-op masquerading as a real signal

Cleanup:
- 5s tick in runStep hold loop → single context-aware timer
- maxInt helper → slices.Max
- joinReasons body → strings.Join
- scrapeErrorCounter manual read loop → io.ReadAll

Known limitation documented in preset.go: large-band rooms have
UserCount > LargeRoomThreshold in production fixtures, which causes
message-gatekeeper to block non-thread sends from member-role users.
Daily presets work around this by funneling sends to smaller rooms;
large-band rooms still exercise fan-out via receive-side subscriptions.

* loadgen: fix integration test call site after newMultiplexPool sig change

* loadgen(daily): add CLI usage documentation with examples

`loadgen daily -h` now prints a description of the scenario, SLO
thresholds, receiver topology, preset summary, flag list with units,
and three example invocations.

Also exit cleanly on -h/--help instead of logging a 'parse daily
config error=flag: help requested' line.

* loadgen(daily): log fixture build start + elapsed time

100k-user fixtures take minutes to build (O(N*perUser*R) weighted
picker in preset.go); without a log line up front, the silence looked
like a hang. Now you see 'building fixtures users=N' immediately and
'fixtures built rooms=X subscriptions=Y elapsed=Zs' when the picker
finishes, so it's clear whether the process is in fixture build, in
the per-step warmup, or stuck.

The picker itself remains the original weighted-by-remaining-capacity
algorithm; an earlier attempt to swap it for a shuffled slot-bag
preserved the asymptotic complexity win but broke the DM band's
perfect-fit tail (any duplicate-skip burns a slot the last user needs)
and was reverted. A proper fix (stub-pairing for DM band only) is
deferred to a follow-up.

* loadgen(daily): tolerate flaky /jsz; drop pending-growth signal vs aborting

A misbehaving NATS monitoring endpoint was marking every step
Inconclusive — too strict. Now:

- pollPending retries up to 3x with linear backoff (200ms, 400ms)
  so transient failures are absorbed.
- If the poll still fails after retries, runStep logs a warning and
  drops the pending-growth signal for that step (passes nil
  ConsumerPending to evaluateStep). The other four signals
  (latency, error rate, service errors, self-metrics) still produce
  a real verdict.
- Only ctx.Canceled / DeadlineExceeded from the poller still maps
  to Inconclusive, since that indicates the run is shutting down,
  not a flaky server.

* loadgen(daily): NatsMonitoringURL env var for the /jsz poller

Adds NATS_MONITORING_URL (default http://nats:8222/jsz, the
docker-compose default) to the base config. prodEnvFactory.Build
reads it for the daily-IM pending-growth poller. Lets users running
against non-default infra (host-port-mapped NATS, monitoring on a
different port, etc.) point the poller at the right endpoint without
recompiling.

* loadgen(preset): stub-pair DM band for O(N*perUser) fixture build

The DM band's weighted-by-capacity picker was the dominant cost in
BuildFixtures: at N users with bands.DMs=perUser the band has
R = N*perUser/2 rooms, making the per-pick O(R) scan quadratic in N.
At N=100000 (the default --steps tip) the DM band alone was ~3e12
inner-loop iterations, taking hours.

Stub-pairing (configuration model) generates a perUser-regular
bipartite graph in O(N*perUser):

  1. Each user contributes perUser stubs to a flat list.
  2. Shuffle the list once.
  3. Pair consecutive stubs into DM rooms.
  4. Fix self-pairs (stubs[2k]==stubs[2k+1]) by swapping the second
     stub with a later position whose neighbours don't create a new
     self-loop. Self-loops are rare (~perUser expected over the
     whole list).

By construction every user appears in exactly perUser DMs and every
DM has 2 distinct users. The Small/Medium/Large bands keep the
existing weighted picker (they're tractable at typical N -- O(R) per
pick is fine when R is in the thousands).

Locked in by TestBuildFixtures_DailyHeavy_FastAtScale: at N=2000 the
whole BuildFixtures call now completes in ~2s on a developer box; the
test fails the build if it ever regresses past 30s. Existing
TestBuildFixtures_DailyBands (which already asserts the per-user
membership count invariant at N=200) still passes.

* loadgen(daily): correctness + operational fixes for core-tool use

Three perf+correctness+ops reviews surfaced 14 findings. Applies the
verdict-affecting and operationally-critical ones; flags the
performance/memory ones as follow-ups in commit messages and code.

Correctness:

- **Deterministic per-user RNG seed.** startEmitter was seeded from
  time.Now().UnixNano() XOR len(u.ID) — len is constant for all users
  ("u-NNNNNN"), so same-nanosecond activations got identical seeds
  and emitted perfectly correlated action streams. Replaced with a
  splitmix mix of env.runSeed and the user index, so two runs at the
  same run-seed produce identical action streams (the whole point of
  a reproducible verdict).

- **Per-step envelope re-anchor.** Emitters used to capture holdStart
  at activation time, so users from step N kept emitting against
  step N's curve all the way through step N+1 — by step 2 the early
  half of the user population was firing at envelope baseline
  (0.4×rate) instead of following the new step's hold curve, making
  later steps' "PASS" verdicts wrong. Now stepEnv carries holdStart
  and holdDuration as atomic.Int64 fields; emitters re-read them on
  every tick; runStep calls env.setHold(now, hold) right before
  Reset() at the start of each hold.

- **Post-activation hold anchor.** holdStart was computed *before*
  activateUsers, which is rate-limited at 500 users/sec. At a +50k
  delta this meant 100s of activation, mostly during what was
  supposed to be warmup, with fresh users emitting at the envelope's
  clamped edge instead of ramping. The re-anchor above also fixes
  this — hold begins after warmup actually elapsed.

- **percentile uses ceil-based nearest-rank.** Old floor-based
  indexing (int(p*(len-1))) returned p98 for p99 on ≤100-sample
  windows, so tight smoke tests false-PASSed. Now uses
  ceil(p*len)-1, matching the standard.

- **diffPending walks both sides.** A consumer that disappears
  mid-hold (crashed/deleted) used to be silently dropped from the
  verdict. Now surfaces as End=0/negative-Delta and trips with
  reason "&lt;durable&gt; disappeared mid-hold".

- **HTTP timeout on /jsz.** pollPendingOnce used http.DefaultClient
  with no timeout; a hung monitoring endpoint would wedge the whole
  run until the operator killed the process. Dedicated 5s-timeout
  client.

- **ctx-aware ramp loop.** runDailyForTest now checks ctx.Err()
  between steps, so SIGINT mid-cooldown doesn't produce a junk trail
  of INCONCLUSIVE rows for steps that never started.

Operational guards:

- **Zero-traffic → INCONCLUSIVE.** Publisher conn failure or
  unwired emitters used to produce a silent PASS at every step. Now
  evaluateStep returns INCONCLUSIVE with "zero actions attempted"
  when AttemptedOps==0.

- **Effective-N tracking + &lt;95% threshold.** activateUsers
  increments env.activatedCount/skippedCount atomically; if
  EffectiveN/N &lt; 95% the step returns INCONCLUSIVE with the
  shortfall in the reason. StepResult.EffectiveN is surfaced in
  console (e.g. "20000(10000)") and added as a CSV column. Stops
  "N=20000 PASS" silently meaning "10000 active".

- **Goroutine-as-CPU proxy disabled.** readCPUPercent's heuristic
  (NumGoroutine/5000 × 100) tripped INCONCLUSIVE at any scale ≥4k
  users — exactly the regime this tool is built for. Now returns 0;
  INCONCLUSIVE relies on the GC pause signal alone until a real CPU
  sample (gopsutil/proc-self-stat) is wired.

Follow-ups (not in this commit):

- **Perf.** Collector single-mutex contention (~520k locks/sec at
  N=100k), unbounded byReqID/byMsgID growth (spec promised TTL+1M
  cap), per-user inbox channels that nobody reads, small/medium band
  O(N²) picker, one-goroutine-per-user (100k goroutines), per-tick
  actionCtx allocation. Each is a real win but a non-trivial
  refactor; queued for the next pass.

- **Ergonomics.** Configurable SLO thresholds + seed flags,
  heartbeat between warmup and verdict, ABORTED marker on ctx-cancel,
  full per-durable CSV columns, --validate dry-run.

* loadgen(perf): slot-bag picker for non-DM bands + shard Collector

Two perf fixes flagged by the post-review pass.

preset.go: small/medium/large band fixture picker switched from the
O(N x perUser x R) weighted-scan to a configuration-model slot-bag
walk with retry on per-user duplicates. Same shape as the DM band's
stub-pairing but generalised for heterogeneous room capacities.
N=10000 daily-heavy fixture build drops from minutes to ~1s; N=100k
drops from ~30+ min to seconds. TestBuildFixtures_DailyHeavy_FastAtScale
bumped to N=10000 to lock the speedup in.

collector.go: byReqID/byMsgID/e1/e2 sharded by FNV-1a hash of the
correlation key across 64 shards (reqShard for requestID-keyed,
msgShard for messageID-keyed; each has its own mutex). At N=100k a
busy daily-IM run produces ~150k+ Record* calls/sec, all of which
previously serialised on a single sync.Mutex — loadgen became the
bottleneck and inflated latency samples before the backend tripped.
Per-shard contention drops by ~64x.

Determinism preserved: emit-as-you-pick instead of range-over-map
(Go map iteration order is randomised), and the slot-bag uses the
same seeded r.Intn sequence both runs. TestBuildFixtures_DailyBands
still passes its same-seed -> equal Fixtures invariant.

* loadgen: reconcile with main after rebase (drop dup waitOrCancel + Reset)

PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go,
error-returning) and a Reset() method on the unsharded Collector. Both
collided with the daily-IM scenario's versions:

- daily.go's waitOrCancel returned bool; replaced its callers to use the
  error-returning ramp.go helper.
- collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu,
  c.byReqID, etc.) that no longer exist; the sharded Reset further down
  in the file does the same job correctly under per-shard locks.

Build, lint, and unit tests green after the reconciliation.

* loadgen(daily): skip integration test that needs full backend stack

TestRunDaily_Integration_TinyPresetPasses passed vacuously before the
recall-review fix wired emitters into prodEnvFactory (no actions = no
errors = trivial PASS). With emitters live, the test now actually
publishes to subjects like `chat.user.{acct}.room.{room}.{site}.msg.send`
and request-replies on history/room-list/etc. — all of which time out
in the CI integration job because the testutil NATS container has no
gatekeeper, no room-service, and no broadcast-worker subscribed.
Result: error_rate=7.69% > 0.1% threshold, step TRIPs, test FAILs.

The wiring is correct (proven by unit + race tests); what's wrong is
that this scenario fundamentally needs the full docker-compose stack
to be meaningful. Skipping with a comment pointing operators at
`make -C tools/loadgen/deploy run-daily` for real end-to-end coverage.

* loadgen(daily): flush after Subscribe so server registers interest before Add returns

CI hit: TestMultiplexPool_RoutesBroadcastToInbox 'Condition never
satisfied'. Root cause: nc.Subscribe queues the SUB command client-
side; it's not sent to the broker until the next Flush. A caller (or
test) that publishes immediately after Add() can see the broadcast
dropped because the server hasn't registered the subscription
interest yet.

Production emitters tick on a 1s schedule so they never publish during
the registration race window, but the test does — explaining why the
test was reliably failing only under CI's timing.

Fix in both pools (direct + multiplex): call nc.Flush() after the
Subscribe pass so Add only returns once the server knows about every
new interest. One round-trip per pool conn at activation time,
dominated by the Subscribe overhead already incurred.

* loadgen(daily): pass NATS_CREDS_FILE to pool + publisher conns

Bug: daily.go's prodEnvFactory and daily_pool.go's pool constructors
called bare nats.Connect(url, nats.Name(...)) with no credentials. On
operator-mode NATS servers (the chat-system standard, including the
local docker-local stack) anonymous connections either fail outright
or get default-deny perms — manifesting as 'permissions violation' on
subscribe and the silent-zero-latency symptom reported here.

Every other loadgen subcommand correctly used
natsutil.Connect(cfg.NatsURL, cfg.NatsCredsFile); only the daily
scenario was missing the second arg.

Fix:
- New connectWithCreds(url, name, credsFile) helper applies
  nats.UserCredentials when credsFile is non-empty, else falls back
  to anonymous (testcontainer NATS in unit/integration tests).
- newDirectPool, newMultiplexPool, and the prodEnvFactory publisher
  conn all route through it.
- baseCfg.NatsCredsFile (existing env var NATS_CREDS_FILE) flows in
  through prodEnvFactory; the docker-compose overlay already wires
  NATS_CREDS_FILE=/etc/nats/backend.creds.

After this fix: 'make -C tools/loadgen/deploy run-daily PRESET=...'
authenticates as the backend user, subscribe perms work, and latency
samples populate — P50/P95/P99 should be non-zero.

* loadgen(daily): use Request for readReceipt; room-service expects a reply

room-service registers MessageReadWildcard via nc.QueueSubscribe and
unconditionally calls m.Msg.Respond(resp) (room-service/handler.go:83).
Respond requires msg.Reply to be set, which nc.Publish does not set —
only nc.Request does.

The daily readReceipt action was using a.Publish, so every read-receipt
action triggered 'nats: message does not have a reply' in room-service's
log. Switched to a.Request with the same defaultRequestTimeout the other
request-shaped actions use, and renamed/updated the unit test to lock
in the contract (must land in c.reqs, not c.pubs).

Verified the other six action handlers (scrollHistory, refreshRoomList,
muteToggle, roomCreate, memberAdd, threadReply) — they all use Request
against the right routes, so readReceipt was the sole mismatch.

* loadgen(daily): pre-flight check for un-seeded mongo + readReceipt uses Request

Two unrelated bugs that bit during real runs:

1. readReceipt was using a.Publish for subject.MessageRead, but
   room-service registers that subject via QueueSubscribe and calls
   msg.Respond on every received message — which fails with 'nats:
   message does not have a reply' when the sender used fire-and-forget
   Publish. Switched to a.Request (matches the pattern of every other
   request-side action handler). Updated the test.

2. runDaily started straight into BuildFixtures + a full ramp even
   when the Mongo subscriptions collection was empty for the configured
   siteID. The gatekeeper then rejected every send with 'user X is not
   subscribed to room Y' and the operator burned ~30+ minutes before
   seeing the error_rate trip. Added a preflight that opens a short-
   lived Mongo connection, counts subscriptions for cfg.SiteID, and
   bails with an actionable message:

     ERROR daily pre-flight error="no subscriptions found in mongo
     for siteID="site-local"; run \`loadgen seed --workload=messages
     --preset=<your daily preset>\` first (or \`make -C
     tools/loadgen/deploy seed PRESET=<your preset>\`)"

   Exit code 2 (config error). Uses a 10s context independent of the
   run-level ctx so a transient Mongo blip doesn't burn the window
   before failing.

* loadgen(daily): stop overriding preset.Users — caused fixture ID mismatch with seed

Symptom: gatekeeper rejects every send with "user X is not subscribed
to room Y" even though `loadgen seed` was just run successfully.

Root cause: BuildFixtures is deterministic in (preset, seed, siteID).
`loadgen seed --workload=messages --preset=daily-heavy` uses the
preset's default Users (10000). But runDailyForTest was overriding
preset.Users = slices.Max(cfg.Steps) before calling BuildFixtures.
If max(steps) != preset.Users, the per-band stub shuffle produces
DIFFERENT user/room/sub IDs at run time vs seed time — same prefixes
("u-000123", "room-small-000045"), different totals → different
permutation → no overlap. Gatekeeper's GetSubscription misses every
lookup and the whole ramp fails on error_rate.

Fix: drop the override. runDailyForTest now uses preset.Users as-is,
matching `loadgen seed`. activateUsers already caps at
len(env.users), and the EffectiveN-shortfall guard surfaces
"only X/Y users activated" as INCONCLUSIVE when --steps exceeds
preset.Users — much clearer than silent ID drift.

Side effect: --steps cannot exceed preset.Users for that preset.
For daily-light/heavy/power that's 10000. A startup log line
warns when max(steps) > preset.Users so the operator sees the cap
before the run starts.

No re-seed required after deploying the fix.

* docs(loadgen): comprehensive Daily-IM operator guide

Replace the 53-line stub section with a full operator guide that
captures every lesson from real runs:

- Quick start + prerequisites (seed step, NATS perms, env vars)
- Preset comparison with rooms-per-user breakdown
- Complete CLI flag reference (units, defaults, when to change)
- Environment variables (NATS_CREDS_FILE, NATS_MONITORING_URL,
  SITE_ID, plus the mongo/valkey vars)
- SLO signals with thresholds; PASS/TRIP/INCONCLUSIVE semantics
- Sample console + CSV output, how to read 'ANSWER' and
  N(EffectiveN) annotations
- Troubleshooting matrix covering every failure mode encountered:
  not-subscribed, siteID mismatch, large-room block,
  'no reply' from room-service, NATS permissions, all-zero
  latency, EffectiveN cap, fixture build silence, flaky /jsz
- Known limitations explicitly listed (large-room block, no-op
  auth stub, dormant service-error scraper, disabled CPU proxy)
- Pointers to spec, plan, and source files

Aims to make the tool self-serviceable: a new operator who has
never used it should be able to get to a verdict by following
the doc, and an experienced operator hitting an error should
find the fix in the troubleshooting matrix without escalating.

* loadgen(daily): per-action latency breakdown in console + CSV

Adds observational per-handler latency stats so operators can see what
each daily action (send / read_receipt / scroll_history / refresh_room_list
/ member_add / room_create / mute_toggle) actually costs, in addition to
the system-wide publish→broadcast p50/p95/p99 the verdict already uses.

- Collector gains per-action latency sample tracking (map[int][]Duration
  under its own mutex; cleared by Reset alongside the existing counters).
- doAction wraps each handler call with time.Now/Since and records the
  elapsed via RecordActionLatency.
- actionKind grows a String() helper for stable lowercase names and
  an allActionKinds canonical-order slice for the report.
- stepInputs/StepResult carry ActionLatencies; evaluateStep computes
  Count + P50/P95/P99 per action via the existing percentile helper.
- Console prints 'actions: send n=8920 p50=12 p95=180 p99=320 | ...'
  on the line after each step's verdict.
- CSV adds 28 columns (<action>_count, _p50_ms, _p95_ms, _p99_ms × 7
  actions) in stable allActionKinds order so downstream tools can
  column-index reliably.

Verdict criteria unchanged: these stats are observational only — PASS
/ TRIP / INCONCLUSIVE still depends on the existing latency, error-rate,
pending-growth, and self-metrics signals. Per the operator's request
(too easy to confuse a single 'all actions sluggish' p95 with a real
broadcast-pipeline regression if we trip on per-action numbers).

* loadgen(daily,seed): add --users flag to override preset.Users

The daily presets hard-code Users=10000. Operators wanting to ramp
above that previously had to edit preset.go and rebuild. Now both
`loadgen seed --workload=messages` and `loadgen daily` accept
`--users=N` to override at invocation time.

The two MUST be passed the same value — BuildFixtures is deterministic
in (preset, seed, siteID), and the per-band stub shuffle uses totalUsers
as length, so a mismatch produces entirely different room/sub IDs even
though the user IDs (u-000000, u-000001, ...) overlap. The gatekeeper
then rejects every send with "user not subscribed".

To catch that misuse fast, the daily preflight now also counts users
in Mongo and compares to preset.Users (or the --users override). On
mismatch it errors with the exact teardown+re-seed commands needed.

Usage at, say, N=50000:

  loadgen seed  --workload=messages --preset=daily-heavy --users=50000
  loadgen daily --preset=daily-heavy --users=50000 --steps=5000,10000,20000,50000

`--users=0` (default) means "use preset default", the safe path for
normal runs and existing scripts.

* loadgen(daily): set X-Request-ID header on every NATS publish/request

room-service extracts X-Request-ID from inbound NATS headers via
natsutil.ContextWithRequestIDFromHeaders (room-service/handler.go:60).
The ID flows through ctx into the canonical event room-service publishes
to room-worker's ROOMS stream. room-worker REQUIRES X-Request-ID on
canonical events for idempotency (it's the dedup seed) and rejects with
'missing X-Request-ID' otherwise (room-worker/handler.go:752,1219).

Before this fix, prodEnvFactory's publish/request closures used
pubConn.Publish(subj,data) and pubConn.RequestWithContext(ctx,subj,data) —
neither of which sets headers. Empty inbound header at room-service →
empty ctx → empty canonical header → room-worker logs the error and
drops the event. Visible in room-worker logs as 'missing-X-request-id'
on subject 'chat.room.canonical.<site>.member.add'.

Now both closures build *nats.Msg with a fresh
natsutil.RequestIDHeader (X-Request-ID) per call and use PublishMsg /
RequestMsgWithContext. Every emitter action now produces a properly
traceable request through the canonical pipeline.

The header ID is independent of any payload-level request ID
(e.g. SendMessageRequest.RequestID for the gatekeeper) — both serve
different layers. In production a real client typically uses the same
value at both layers; we could plumb that through actionCtx as a
follow-up if request-tracing across layers needs it.

* loadgen(daily): memberAdd picks from channel rooms only

room-service rejects member-add on DM rooms with 'cannot add members to
a non-channel room' (DMs have a fixed 2-member shape). For daily-heavy
~25 DMs out of 56 rooms per user, picking uniformly from u.Rooms means
~45% of memberAdd actions error out — inflating error_rate noise and
spamming the room-service log.

Pre-filter at activation time: userState.ChannelRooms is the subset of
Rooms that aren't DMs, computed once in newUserState by ID-prefix
('room-dm-' = DM band, everything else = channel band). memberAdd now
picks from u.ChannelRooms instead of u.Rooms. Users with no channel
rooms (DM-only) no-op the action rather than erroring.

The other actions (sendMessage, readReceipt, scrollHistory,
refreshRoomList, muteToggle, roomCreate, threadReply) work fine on DMs
and continue using u.Rooms. Only memberAdd has the channel-only
constraint at the room-service layer.

Tests:
- TestNewUserState_ChannelRoomsExcludesDMs locks the prefix-filter
- TestMemberAdd_Requests now seeds ChannelRooms explicitly
- TestMemberAdd_SkipsWhenNoChannelRooms covers the DM-only no-op path

* loadgen(daily): roomCreate includes neighbor user; memberAdd uses real account

Two related fixes for actions that hit room-service:

roomCreate sent only {name, type}. classifyAndValidate at
room-service/handler.go:193 lets that through (name is non-empty), but
the channel branch at line 290 then rejects it:
    allUsers := stripAccount(...)
    if len(allUsers) == 0 && len(allOrgs) == 0 {
        return nil, errEmptyCreateRequest
    }
A channel must have at least one member besides the creator. Now we
include u.Neighbor in the 'users' array. errEmptyCreateRequest gone.

memberAdd was passing the literal 'user-stub' as the target account.
room-service validates accounts exist (errUserNotFound) — 'user-stub'
isn't in Mongo, so every memberAdd that reached a channel was failing
silently after the channel-only filter. Now uses u.Neighbor, which is
guaranteed-seeded.

userState.Neighbor is computed once in newUserState by shifting the
requester's index by 1 ('user-N' → 'user-(N-1)' for N≥1, 'user-0' →
'user-1'). Always != Account; always within the seeded range for any
preset with Users ≥ 2 (i.e. all daily-* presets).

Tests:
- TestNeighborOf covers shift / wrap / fallback
- TestRoomCreate_Requests asserts the payload now has a users array
- TestMemberAdd_Requests / SkipsWhenNoChannelRooms unchanged shape

* loadgen(daily): apply per-request timeout in the request closure

The 'timeout' argument to actionCtx.Request was silently ignored:
RequestMsgWithContext uses the context's deadline, and the emitter's
ctx is the run-level ctx with no deadline. So defaultRequestTimeout
(5s) never bound anything; a slow room-service handler could hang
arbitrarily long, showing up as huge per-action p50 (e.g. 25s for
read_receipt) instead of cleanly timing out and contributing to
error_rate.

Wrap the incoming ctx with context.WithTimeout(timeout) so each call
gets a real deadline. Action handlers continue to pass
defaultRequestTimeout=5s; future per-action overrides can vary the
value without code changes.

This makes the per-action latency numbers a true latency signal: a
healthy handler shows actual round-trip ms; a sick one shows a wall
at the timeout value plus a matching jump in failed_ops / error_rate.

* loadgen(daily): per-action latency gates the verdict

Previously the verdict only gated on the publish→broadcast latency (the
messaging pipeline). Operations like read_receipt, scroll_history,
member_add, refresh_room_list — all of which exercise different parts
of the system (room-service, history-service, room-worker) — only
contributed to error_rate when they timed out, never to latency.
A slow handler that stayed under the request timeout was invisible.

Now Thresholds carries per-action p95/p99 maps; evaluateStep iterates
ActionLatencies and trips if any gated action exceeds its cap. Default
thresholds reflect typical chat-backend handler latencies:

  read_receipt:      p95=100ms  p99=250ms   (mongo update + reply)
  refresh_room_list: p95=200ms  p99=500ms   (mongo query per user)
  scroll_history:    p95=500ms  p99=1500ms  (cassandra bucket walk)
  member_add:        p95=200ms  p99=500ms   (mongo + canonical pub)
  mute_toggle:       p95=100ms  p99=250ms   (mongo update)
  room_create:       p95=500ms  p99=1500ms  (mongo + canonical pub)

sendMessage/threadReply are not gated here — they're already gated by
the broadcast-pipeline P95LatencyMs / P99LatencyMs at the top of the
verdict (the latency the existing p50/p95/p99 columns reflect).

Operator can tune via --action-p95-ms / --action-p99-ms flags:
  loadgen daily --preset=daily-heavy \
    --action-p95-ms=read_receipt:80,scroll_history:300 \
    --action-p99-ms=read_receipt:200,scroll_history:800
Format: comma-separated name:N pairs. Unset actions keep defaults;
unknown names are rejected at parse time. Set N to a very large
number to effectively disable an action's gate.

Trip reasons name the action explicitly:
  scroll_history p95=612ms > 500
  read_receipt p99=340ms > 250

* loadgen(daily): rename read_receipt action to mark_read

The action marks the user's subscription as read (msg.read handler).
"read_receipt" invited confusion with the separate MessageReadReceipt
notification path; "mark_read" names what the action actually does.

Breaking for existing CSV consumers and CLI users: the per-action
column names change from read_receipt_* to mark_read_*, and
--action-p95-ms / --action-p99-ms must use "mark_read" as the key.

* loadgen(history): stream per-room seed plan to avoid OOM on history-large

history-large materialized a single 65M-row plannedMessage slice (~50 GB
of strings), so seeding got SIGKILLed before reaching Cassandra.
BuildHistoryFixtures now keeps only the Fixtures + ThreadParents map and
hands the seeder an IterateRoomMessages iterator that builds one room's
plan at a time (~50 MB peak). FullPlan stays available for tests and
small/medium presets where materialization is bounded.

Per-room randomness is split into two streams — structural (sender,
jitter, parent-selection, reply offsets) and content (message body
bytes) — so a cheap metadata-only walk can derive Room.LastMsgAt and
ThreadParents without paying the O(MessagesPerRoom × ContentBytes) cost
of regenerating content.

CreatedAt and content bytes for a given (seed, preset) are different
from before — no test asserts on either, but anyone diffing a re-seed
against an old Cassandra dump will see the change.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants