Skip to content

feat(loadgen): daily-IM load scenario to find sustainable N#234

Merged
mliu33 merged 48 commits into
mainfrom
claude/gifted-rubin-ry8HI
Jun 3, 2026
Merged

feat(loadgen): daily-IM load scenario to find sustainable N#234
mliu33 merged 48 commits into
mainfrom
claude/gifted-rubin-ry8HI

Conversation

@hmchangw
Copy link
Copy Markdown
Owner

@hmchangw hmchangw commented May 27, 2026

Summary

Adds a loadgen daily subcommand that simulates N users using the chat system as their primary IM over a workday, ramps N geometrically, and reports the largest N at which all SLO signals held over a steady-state hold window. Single-site only.

Design spec: docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md
Implementation plan: docs/superpowers/plans/2026-05-27-daily-im-load-scenario.md

What's implemented

  • Presets (preset.go): daily-light / daily-heavy / daily-power with banded per-user room membership (DMs + small + medium + large rooms, Zipf sizing). Headline daily-heavy ≈ 56 rooms/user.
  • User behavior (daily_user.go, daily_envelope.go): two-state idle/active Markov chain, weighted action picker, and a diurnal envelope (two Gaussians at 1/3 and 2/3 of the hold) scaling the per-user Poisson rate.
  • Action handlers (daily_actions.go): send, read-receipt, room-list refresh, history scroll, mute toggle, room create, member add, thread reply — all going through the frontdoor (message-gatekeeper).
  • Hybrid receiver (daily_pool.go): one nats.Conn per user up to --max-direct-users, then a multiplexed pool of shared conns. Both subscribe to channel (RoomEvent) and DM (UserRoomEvent) broadcasts.
  • SLO verdict (daily_verdict.go): p95/p99 latency, error rate, JetStream consumer pending-growth, plus loadgen self-metrics that mark a step INCONCLUSIVE when the load box itself is saturated.
  • Ramp + report (daily.go, daily_report.go): step-up-and-hold control loop, console table + CSV output, "ANSWER: N = …" line naming the first limit hit.
  • Wiring: loadgen daily subcommand, make -C tools/loadgen/deploy run-daily target, README section.

Status

  • lint, test, sast CI jobs green.
  • test-integration (tools) fixed (the newMultiplexPool signature change broke a call site; corrected in c9187d0).
  • Unit coverage for pure logic (envelope, verdict, picker, fixtures, report, pollers via httptest). Pool code is exercised by integration tests behind //go:build integration.

Known limitations / follow-ups

  • Large-room sends are gatekeeper-blocked. Large-band rooms have UserCount > message-gatekeeper's LargeRoomThreshold (default 500), so member-role top-level sends to them are rejected. Daily presets funnel sends to smaller rooms; large rooms are exercised for receive-side fan-out only. A clean fix (thread-routing or role escalation in fixtures) is deferred.
  • Auth is a placeholder. mintJWT is a no-op stub — per-user auth-service login is not yet wired (spec §10 deferred it as a light touch).
  • Service-error signal is dormant. Backend services don't expose /metrics HTTP endpoints, so the slog_errors_total scraper passes an empty URL map and that SLO signal is currently a no-op. The other four signals are live.
  • Reconnect/presence storms and cross-site federation are explicitly out of scope (separate PRs).

Test plan

  • make -C tools/loadgen/deploy up && make -C tools/loadgen/deploy seed PRESET=daily-heavy && make -C tools/loadgen/deploy run-daily PRESET=daily-heavy STEPS=1000,2000 HOLD=30s produces a CSV + console summary against the local stack.
  • make lint, make test SERVICE=tools/loadgen pass.
  • make test-integration SERVICE=tools/loadgen passes (Docker required; verified compiling under the integration tag).

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt

Summary by CodeRabbit

  • New Features

    • Added loadgen daily: ramps concurrent “daily‑IM” users with warmup/hold/cooldown, SLO evaluation, stop‑on‑trip, and CSV output.
    • Presets: daily‑light / daily‑heavy / daily‑power; CLI controls for steps, warmup/hold/cooldown, connection sizing, and CSV.
    • Console and CSV reports now include latency percentiles (p50/p95/p99), error rate, pending deltas, multiplex-drop and action attempt/failure metrics, and an ANSWER line.
  • Documentation

    • New design spec and quick‑start guide covering scenario, presets, SLOs, and outputs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough
📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(loadgen): daily-IM load scenario to find sustainable N' clearly and concisely summarizes the primary change—a new loadgen scenario for simulating daily IM workload and finding the maximum sustainable user count (N).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/gifted-rubin-ry8HI

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (1)
tools/loadgen/daily_user_test.go (1)

38-40: ⚡ Quick win

Strengthen the distribution assertion for rare actions.

The current check (mute < roomCreate+memberAdd+10) is very permissive and can still pass if weights regress. Consider asserting sampled ratios against expected probabilities (with tolerance) so this test actually guards weight drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily_user_test.go` around lines 38 - 40, The current
permissive assertion using require.Less for actionMuteToggle should be replaced
by asserting the empirical probability of the rare action is within a tolerance
of its expected probability: compute totalSamples (sum of counts map), compute
empiricalProb := float64(counts[actionMuteToggle]) / float64(totalSamples) and
expectedProb based on the configured weights for actionMuteToggle relative to
sum of weights (or derive from actionRoomCreate+actionMemberAdd if that's the
intended baseline), then use a numeric comparison (e.g., require.InDelta or
require.True with math.Abs(empiricalProb-expectedProb) < tolerance) to enforce
that empiricalProb is within an acceptable tolerance (e.g., 2–5%) so weight
regressions are caught; update the assertion that currently references
actionMuteToggle, actionRoomCreate, and actionMemberAdd accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md`:
- Around line 212-222: The fenced output block containing the performance table
and ANSWER (the block that starts with "N        p50    p95..." and ends with
"Next limit: broadcast-worker consumer (pending growth)") needs a language tag
to satisfy MD040; change the opening fence from ``` to ```text so the block is
labeled (e.g., ```text ... ```). Ensure only the opening fence is updated and
keep the block content unchanged.

In `@tools/loadgen/collector.go`:
- Around line 170-181: Reset currently clears per-step counters but leaves
multiplexDrops untouched; update Collector.Reset to clear multiplexDrops as well
(under the existing c.mu lock) by setting it to its zero value—e.g., if
multiplexDrops is an atomic counter call c.multiplexDrops.Store(0), or if it’s a
plain integer or map assign the appropriate zero/nil (c.multiplexDrops = 0 or
c.multiplexDrops = nil/make(...)) so per-step drop counts are not cumulative;
place this change alongside the other resets in Reset().

In `@tools/loadgen/daily_actions_test.go`:
- Around line 43-125: Add negative and boundary tests for the action handlers
(sendMessage, readReceipt, refreshRoomList, scrollHistory, muteToggle,
roomCreate, memberAdd, threadReply) by extending captured and actionCtx to
simulate Publish and Request failures and marshal errors, plus empty-room no-op
cases: write tests that (1) substitute captured.publish/captured.request to
return errors and assert handlers return wrapped errors, (2) force JSON
marshal/unmarshal failures for sendMessage/threadReply (e.g., use a type or hook
that causes json.Marshal to fail) and assert error propagation, and (3) pass
userState with empty Rooms to handlers that should be no-ops and assert no
publishes/requests occur; ensure each new test checks error messages are
preserved (wrapped) and that failure counts/lengths on
captured.pubs/captured.reqs are as expected.

In `@tools/loadgen/daily_actions.go`:
- Line 77: Replace the wrong publish subject in the read-receipt publish call:
in the a.Publish invocation that currently uses subject.MessageRead(u.Account,
roomID, a.SiteID), change the subject to the read-receipt subject (e.g.,
subject.ReadReceipt(u.Account, roomID, a.SiteID)) so the event is sent on the
correct RPC path; update the single a.Publish call accordingly in
daily_actions.go.

In `@tools/loadgen/daily_pool.go`:
- Around line 108-120: newMultiplexPool currently panics on nats.Connect
failure; change its signature from newMultiplexPool(natsURL string, c
*Collector, size int) *multiplexPool to return (*multiplexPool, error) so
bootstrap errors can be returned; inside newMultiplexPool, on connection error
call p.Close() (as already done) and return nil and a wrapped error (e.g.,
fmt.Errorf("multiplex conn %d: %w", i, err)); update all callers of
newMultiplexPool to handle the returned error and propagate a controlled CLI
failure instead of letting the process crash.
- Around line 129-143: The code currently inserts into p.userInbox and appends
to p.dispatch before calling nc.Subscribe, which leaks state if nc.Subscribe
fails; change the logic in the Add method so you either (a) perform Subscribe
for each room first (while holding p.mu) and only after all subscribes succeed
mutate p.userInbox and p.dispatch, or (b) keep the current ordering but on
Subscribe error roll back the partial mutations: remove p.userInbox[u.ID],
remove the recently appended inbox entries from p.dispatch[roomID], and
decrement p.roomRefs[roomID] accordingly before returning the error; ensure you
still unlock p.mu in all paths and use the existing symbols (p.mu, p.userInbox,
p.dispatch, p.roomRefs, p.nextConn, nc.Subscribe, p.route) to locate where to
apply the fix.

In `@tools/loadgen/daily_report.go`:
- Around line 84-86: writeDailyCSV currently defers w.Flush() which discards
w.Error(); change the function to use a named error return (e.g., err error) and
replace defer w.Flush() with a defer func() { w.Flush(); if ferr := w.Error();
ferr != nil && err == nil { err = ferr } } so any flush/write errors from the
csv.Writer (w) are surfaced as the function's returned error; ensure the
function signature and any callers still work with the named error return.

In `@tools/loadgen/daily_verdict.go`:
- Around line 193-199: The P99 target calculation can be zero for total<100,
causing the loop in the code that iterates h.Counts to return h.Buckets[0] even
when that bucket has zero count; change the target computation (the variable
target used alongside total, h.Counts and h.Buckets) to ensure it is at least 1
(e.g., compute target := total*99/100 and if target == 0 { target = 1 } or use a
math-based ceil to guarantee at least one sample) so the accumulator logic (acc
+= c; if acc >= target { return h.Buckets[i] * 1000 }) never returns due to a
zero target and correctly reflects GCPauseP99Ms and step conclusiveness.
- Around line 239-267: The pollPending and scrapeErrorCounter HTTP calls need
robust error handling: don't ignore errors from http.NewRequestWithContext in
pollPending (and the analogous call in scrapeErrorCounter) — return the error;
use a configured http.Client with a sensible timeout instead of
http.DefaultClient; after Do(), check resp.StatusCode and return an error for
non-2xx responses before attempting to decode/read the body; ensure JSON
decoding in pollPending returns any decode error; and in scrapeErrorCounter,
propagate any io.Read or scan errors instead of breaking the loop and returning
a potentially underreported sum. Update the functions pollPending and
scrapeErrorCounter to perform these checks and return errors when they occur.

In `@tools/loadgen/daily.go`:
- Around line 378-387: The auth-service login call uses http.DefaultClient.Do
which has no timeout; change it to use a client with a deadline (or derive a
context with timeout) so the call can't block indefinitely: create an
http.Client with a sensible Timeout (e.g., client := &http.Client{Timeout:
10*time.Second}) and replace http.DefaultClient.Do(req) with client.Do(req), or
wrap ctx with context.WithTimeout before calling http.NewRequestWithContext and
cancel after use; update imports to include time (or context) as needed and
ensure errors/cancel are handled in the same block where req and resp are
created.
- Around line 143-149: The hold loop in the routine that waits until holdEnd
currently always sleeps a fixed 5s (the for/select block) and can overshoot
short holds; change the logic in that loop (the goroutine using ctx.Done() and
time.After(5 * time.Second)) to compute the remaining duration (d :=
time.Until(holdEnd)) and wait only for min(d, 5*time.Second) using a timer
(time.NewTimer / timer.Reset) or time.After with that computed duration so the
loop exits promptly when holdEnd is reached and still polls at most every 5s.

In `@tools/loadgen/README.md`:
- Around line 163-167: Add missing fenced code block language identifiers by
changing the two command examples that start with "make -C tools/loadgen/deploy
up" and "loadgen daily \\" to use bash fences (```bash) instead of plain ``` so
the blocks are explicitly marked as Bash shells; update both occurrences in the
README so the first block (make -C tools/loadgen/deploy ...) and the second
block (loadgen daily ...) have ```bash opening fences.

---

Nitpick comments:
In `@tools/loadgen/daily_user_test.go`:
- Around line 38-40: The current permissive assertion using require.Less for
actionMuteToggle should be replaced by asserting the empirical probability of
the rare action is within a tolerance of its expected probability: compute
totalSamples (sum of counts map), compute empiricalProb :=
float64(counts[actionMuteToggle]) / float64(totalSamples) and expectedProb based
on the configured weights for actionMuteToggle relative to sum of weights (or
derive from actionRoomCreate+actionMemberAdd if that's the intended baseline),
then use a numeric comparison (e.g., require.InDelta or require.True with
math.Abs(empiricalProb-expectedProb) < tolerance) to enforce that empiricalProb
is within an acceptable tolerance (e.g., 2–5%) so weight regressions are caught;
update the assertion that currently references actionMuteToggle,
actionRoomCreate, and actionMemberAdd accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01513166-7f5b-44d1-b9c3-33088365baff

📥 Commits

Reviewing files that changed from the base of the PR and between 416285d and 700a83b.

📒 Files selected for processing (24)
  • docs/superpowers/plans/2026-05-27-daily-im-load-scenario.md
  • docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md
  • tools/loadgen/README.md
  • tools/loadgen/collector.go
  • tools/loadgen/daily.go
  • tools/loadgen/daily_actions.go
  • tools/loadgen/daily_actions_test.go
  • tools/loadgen/daily_envelope.go
  • tools/loadgen/daily_envelope_test.go
  • tools/loadgen/daily_integration_test.go
  • tools/loadgen/daily_pool.go
  • tools/loadgen/daily_pool_test.go
  • tools/loadgen/daily_report.go
  • tools/loadgen/daily_report_test.go
  • tools/loadgen/daily_test.go
  • tools/loadgen/daily_user.go
  • tools/loadgen/daily_user_test.go
  • tools/loadgen/daily_verdict.go
  • tools/loadgen/daily_verdict_test.go
  • tools/loadgen/deploy/Makefile
  • tools/loadgen/main.go
  • tools/loadgen/main_test.go
  • tools/loadgen/preset.go
  • tools/loadgen/preset_test.go

Comment on lines +212 to +222
```
N p50 p95 p99 err% worst-pending-delta verdict
1000 12 45 89 0.00% broadcast-worker +12 PASS
2000 14 58 112 0.00% broadcast-worker +34 PASS
5000 22 94 180 0.01% broadcast-worker +180 PASS
10000 38 210 430 0.02% broadcast-worker +890 PASS
20000 71 480 980 0.04% broadcast-worker +1240 TRIP

ANSWER: N = 10000 (last passing step)
Next limit: broadcast-worker consumer (pending growth)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced output block.

The fenced block starting at Line 212 is missing a language identifier (MD040). Please label it (e.g., text) to keep markdown lint clean.

Proposed fix
-```
+```text
 N        p50    p95    p99    err%    worst-pending-delta             verdict
 1000     12     45     89     0.00%   broadcast-worker +12             PASS
 2000     14     58     112    0.00%   broadcast-worker +34             PASS
@@
 ANSWER: N = 10000 (last passing step)
         Next limit: broadcast-worker consumer (pending growth)
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 212-212: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md` around
lines 212 - 222, The fenced output block containing the performance table and
ANSWER (the block that starts with "N        p50    p95..." and ends with "Next
limit: broadcast-worker consumer (pending growth)") needs a language tag to
satisfy MD040; change the opening fence from ``` to ```text so the block is
labeled (e.g., ```text ... ```). Ensure only the opening fence is updated and
keep the block content unchanged.

Comment on lines +170 to +181
// Reset clears all per-step counters and sample slices.
// Called at the end of warmup so the hold window starts fresh.
func (c *Collector) Reset() {
c.mu.Lock()
defer c.mu.Unlock()
c.e1 = c.e1[:0]
c.e2 = c.e2[:0]
c.byReqID = make(map[string]publishEntry)
c.byMsgID = make(map[string]publishEntry)
c.attempted.Store(0)
c.failed.Store(0)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reset should also clear multiplexDrops.

Reset() is documented as clearing per-step counters, but multiplexDrops is left untouched. This makes drop counts cumulative across steps and skews per-step metrics after the first hold window.

[suggested fix]

Patch
 func (c *Collector) Reset() {
 	c.mu.Lock()
 	defer c.mu.Unlock()
 	c.e1 = c.e1[:0]
 	c.e2 = c.e2[:0]
 	c.byReqID = make(map[string]publishEntry)
 	c.byMsgID = make(map[string]publishEntry)
+	c.multiplexDrops.Store(0)
 	c.attempted.Store(0)
 	c.failed.Store(0)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Reset clears all per-step counters and sample slices.
// Called at the end of warmup so the hold window starts fresh.
func (c *Collector) Reset() {
c.mu.Lock()
defer c.mu.Unlock()
c.e1 = c.e1[:0]
c.e2 = c.e2[:0]
c.byReqID = make(map[string]publishEntry)
c.byMsgID = make(map[string]publishEntry)
c.attempted.Store(0)
c.failed.Store(0)
}
// Reset clears all per-step counters and sample slices.
// Called at the end of warmup so the hold window starts fresh.
func (c *Collector) Reset() {
c.mu.Lock()
defer c.mu.Unlock()
c.e1 = c.e1[:0]
c.e2 = c.e2[:0]
c.byReqID = make(map[string]publishEntry)
c.byMsgID = make(map[string]publishEntry)
c.multiplexDrops.Store(0)
c.attempted.Store(0)
c.failed.Store(0)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/collector.go` around lines 170 - 181, Reset currently clears
per-step counters but leaves multiplexDrops untouched; update Collector.Reset to
clear multiplexDrops as well (under the existing c.mu lock) by setting it to its
zero value—e.g., if multiplexDrops is an atomic counter call
c.multiplexDrops.Store(0), or if it’s a plain integer or map assign the
appropriate zero/nil (c.multiplexDrops = 0 or c.multiplexDrops = nil/make(...))
so per-step drop counts are not cumulative; place this change alongside the
other resets in Reset().

Comment on lines +43 to +125
func TestSendMessage_PublishesToFrontdoor(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a", "room-b"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
err := sendMessage(ctx, u, "hello")
require.NoError(t, err)
require.Len(t, c.pubs, 1)
got := c.pubs[0]
require.True(t, got.Subj == subject.MsgSend("user-1", "room-a", "site-test") ||
got.Subj == subject.MsgSend("user-1", "room-b", "site-test"))
var req model.SendMessageRequest
require.NoError(t, json.Unmarshal(got.Data, &req))
require.Equal(t, "hello", req.Content)
}

func TestReadReceipt_Publishes(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
err := readReceipt(ctx, u, "msg-1")
require.NoError(t, err)
require.Len(t, c.pubs, 1)
require.Equal(t, subject.MessageRead("user-1", "room-a", "site-test"), c.pubs[0].Subj)
}

func TestRefreshRoomList_Requests(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1"}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
err := refreshRoomList(ctx, u)
require.NoError(t, err)
require.Len(t, c.reqs, 1)
require.Equal(t, subject.UserSubscriptionGetRooms("user-1", "site-test"), c.reqs[0].Subj)
}

func TestScrollHistory_Requests(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
require.NoError(t, scrollHistory(ctx, u))
require.Len(t, c.reqs, 1)
// History fetch goes through MsgGet-style subject — check it includes the roomID.
require.Contains(t, c.reqs[0].Subj, "room-a")
}

func TestMuteToggle_Publishes(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
require.NoError(t, muteToggle(ctx, u))
require.Len(t, c.reqs, 1)
require.Equal(t, subject.MuteToggle("user-1", "room-a", "site-test"), c.reqs[0].Subj)
}

func TestRoomCreate_Requests(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1"}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
require.NoError(t, roomCreate(ctx, u))
require.Len(t, c.reqs, 1)
require.Equal(t, subject.RoomCreate("user-1", "site-test"), c.reqs[0].Subj)
}

func TestMemberAdd_Requests(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
require.NoError(t, memberAdd(ctx, u, "user-2"))
require.Len(t, c.reqs, 1)
require.Equal(t, subject.MemberAdd("user-1", "room-a", "site-test"), c.reqs[0].Subj)
}

func TestThreadReply_Publishes(t *testing.T) {
c := &captured{}
u := &userState{ID: "u-1", Account: "user-1", Rooms: []string{"room-a"}}
ctx := actionCtx{Ctx: context.Background(), Publish: c.publish, Request: c.request, SiteID: "site-test"}
require.NoError(t, threadReply(ctx, u, "parent-msg-1", "reply text"))
require.Len(t, c.pubs, 1)
require.Equal(t, subject.MsgSend("user-1", "room-a", "site-test"), c.pubs[0].Subj)
var req model.SendMessageRequest
require.NoError(t, json.Unmarshal(c.pubs[0].Data, &req))
require.Equal(t, "parent-msg-1", req.ThreadParentMessageID)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Add error-path and boundary-case tests for action handlers.

Current tests are mostly happy path. Please add cases for publish/request failures, marshal failures (where applicable), and empty-room no-op behavior so failure accounting and wrapped errors stay protected.

As per coding guidelines, “Tests must cover: happy path, error paths, edge cases (empty collections, boundary conditions), and invalid input”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily_actions_test.go` around lines 43 - 125, Add negative and
boundary tests for the action handlers (sendMessage, readReceipt,
refreshRoomList, scrollHistory, muteToggle, roomCreate, memberAdd, threadReply)
by extending captured and actionCtx to simulate Publish and Request failures and
marshal errors, plus empty-room no-op cases: write tests that (1) substitute
captured.publish/captured.request to return errors and assert handlers return
wrapped errors, (2) force JSON marshal/unmarshal failures for
sendMessage/threadReply (e.g., use a type or hook that causes json.Marshal to
fail) and assert error propagation, and (3) pass userState with empty Rooms to
handlers that should be no-ops and assert no publishes/requests occur; ensure
each new test checks error messages are preserved (wrapped) and that failure
counts/lengths on captured.pubs/captured.reqs are as expected.

return nil
}
roomID := u.Rooms[a.rand().Intn(len(u.Rooms))]
payload, err := json.Marshal(map[string]string{"messageId": lastMsgID})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Replace map-based JSON payloads with typed pkg/model structs.

These request/publish payloads are built from map[...]..., which makes schema drift easy and bypasses compile-time checks. Please switch them to typed model payloads for read-receipt, room-create, and member-add requests.

As per coding guidelines, “All NATS payloads are JSON — use encoding/json with typed structs from pkg/model, never map[string]interface{}”.

Also applies to: 123-126, 143-143

Comment thread tools/loadgen/daily_actions.go Outdated
if err != nil {
return fmt.Errorf("marshal read-receipt: %w", err)
}
if err := a.Publish(a.Ctx, subject.MessageRead(u.Account, roomID, a.SiteID), payload); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the read-receipt subject, not the message-read subject.

readReceipt currently publishes to subject.MessageRead(...), which is a different RPC path than read-receipt. This can send the event to the wrong consumer contract.

Suggested fix
-	if err := a.Publish(a.Ctx, subject.MessageRead(u.Account, roomID, a.SiteID), payload); err != nil {
+	if err := a.Publish(a.Ctx, subject.MessageReadReceipt(u.Account, roomID, a.SiteID), payload); err != nil {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := a.Publish(a.Ctx, subject.MessageRead(u.Account, roomID, a.SiteID), payload); err != nil {
if err := a.Publish(a.Ctx, subject.MessageReadReceipt(u.Account, roomID, a.SiteID), payload); err != nil {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily_actions.go` at line 77, Replace the wrong publish subject
in the read-receipt publish call: in the a.Publish invocation that currently
uses subject.MessageRead(u.Account, roomID, a.SiteID), change the subject to the
read-receipt subject (e.g., subject.ReadReceipt(u.Account, roomID, a.SiteID)) so
the event is sent on the correct RPC path; update the single a.Publish call
accordingly in daily_actions.go.

Comment on lines +193 to +199
target := total * 99 / 100
var acc uint64
for i, c := range h.Counts {
acc += c
if acc >= target {
return h.Buckets[i] * 1000
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

P99 target calculation can select an empty bucket for small totals.

target := total * 99 / 100 becomes 0 when total < 100, so the loop may return the first bucket even if its count is zero. That can distort GCPauseP99Ms and incorrectly mark steps inconclusive.

Patch
-	target := total * 99 / 100
+	target := (total*99 + 99) / 100 // ceil(0.99 * total), minimum 1 when total > 0
 	var acc uint64
 	for i, c := range h.Counts {
 		acc += c
 		if acc >= target {
 			return h.Buckets[i] * 1000
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
target := total * 99 / 100
var acc uint64
for i, c := range h.Counts {
acc += c
if acc >= target {
return h.Buckets[i] * 1000
}
target := (total*99 + 99) / 100 // ceil(0.99 * total), minimum 1 when total > 0
var acc uint64
for i, c := range h.Counts {
acc += c
if acc >= target {
return h.Buckets[i] * 1000
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily_verdict.go` around lines 193 - 199, The P99 target
calculation can be zero for total<100, causing the loop in the code that
iterates h.Counts to return h.Buckets[0] even when that bucket has zero count;
change the target computation (the variable target used alongside total,
h.Counts and h.Buckets) to ensure it is at least 1 (e.g., compute target :=
total*99/100 and if target == 0 { target = 1 } or use a math-based ceil to
guarantee at least one sample) so the accumulator logic (acc += c; if acc >=
target { return h.Buckets[i] * 1000 }) never returns due to a zero target and
correctly reflects GCPauseP99Ms and step conclusiveness.

Comment thread tools/loadgen/daily_verdict.go
Comment thread tools/loadgen/daily.go Outdated
Comment thread tools/loadgen/daily.go Outdated
Comment thread tools/loadgen/README.md
@hmchangw hmchangw changed the title docs: daily-IM load scenario design spec feat(loadgen): daily-IM load scenario to find sustainable N May 28, 2026
@hmchangw hmchangw force-pushed the claude/gifted-rubin-ry8HI branch 2 times, most recently from 6669ca2 to 07c9239 Compare May 28, 2026 14:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md (1)

212-222: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the console-output fenced block.

Use ```text for this block to satisfy markdown lint (MD040).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md` around
lines 212 - 222, The fenced console-output block showing the table (the
triple-backtick block containing "N p50 p95 p99 err% worst-pending-delta verdict
..." and the ANSWER lines) needs a language tag to satisfy MD040; change the
opening fence from ``` to ```text so the block starts with ```text and leave the
rest unchanged.
tools/loadgen/README.md (1)

224-228: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mark both daily command fences as Bash.

Both new command examples should use ```bash to satisfy MD040.

Also applies to: 240-248

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/README.md` around lines 224 - 228, Update the markdown command
fences in the README examples so they are explicitly Bash: change the two
triple-backtick blocks that contain the make commands (the blocks shown with
make -C tools/loadgen/deploy up / seed PRESET=daily-heavy / run-daily
PRESET=daily-heavy and the other block at lines 240-248) from plain ``` to
```bash so they satisfy MD040; locate the command examples in
tools/loadgen/README.md and replace their opening fences accordingly.
tools/loadgen/daily.go (1)

406-413: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fail the run when multiplex pool initialization fails.

Continuing without multiplex means higher configured N steps may run far below target load, which can produce a false “sustainable N” answer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily.go` around lines 406 - 413, The current code swallows
newMultiplexPool errors (when cfg.MultiplexPoolSize > 0) by logging and
continuing, which must instead cause the run to fail; replace the slog.Error +
mux=nil branch with a hard failure: if this code is inside main, call slog.Error
then os.Exit(1) or use slog.Fatal to terminate; if inside a function that
returns error, return a wrapped error (e.g., fmt.Errorf("multiplex pool init
failed: %w", err)) so the caller stops execution. Ensure you update the branch
that handles newMultiplexPool (the mux variable and error handling) to stop the
run rather than proceed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/loadgen/daily_pool.go`:
- Around line 148-173: The Add flow can leave provisional subscriptions active
if a later Subscribe fails; update Add to track the provisional subscriptions
created in the first pass (store the nc and subject strings created when calling
nc.Subscribe for subject.RoomEvent(roomID) and for
subject.UserRoomEvent(u.Account)), and if any subsequent Subscribe returns an
error, iterate that tracking slice and call Unsubscribe (or Unsubscribe/Drain as
appropriate on nc) for each stored subscription before returning the error; only
after the second pass completes successfully should you increment p.roomRefs and
add entries to p.dispatch so there is no gap between creating a subscription and
tracking it (refer to Add, p.roomRefs, p.conns, p.nextConn, subject.RoomEvent,
subject.UserRoomEvent, and p.route).

In `@tools/loadgen/daily.go`:
- Around line 81-100: In parseStepList ensure each parsed step is >0 and
strictly increasing by validating (n*mult) > 0 and comparing each computed value
against the previous appended value, returning a descriptive error if a
non-positive or non-increasing value is encountered; update the same validation
in the duplicate parsing block referenced elsewhere so any code that parses
--steps enforces positive, strictly increasing steps and returns an error
instead of silently accepting invalid sequences.
- Around line 433-435: The bare return of err (the snippet with "if err != nil {
return nil, err }") must be replaced with an error wrap that adds descriptive
operation context; change the return to wrap err using fmt.Errorf with a clear
message about the failing operation (for example: fmt.Errorf("failed to
<describe request/operation>: %w", err)), keeping the existing nil return value
and preserving the original err via %w so callers can inspect it.

---

Duplicate comments:
In `@docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md`:
- Around line 212-222: The fenced console-output block showing the table (the
triple-backtick block containing "N p50 p95 p99 err% worst-pending-delta verdict
..." and the ANSWER lines) needs a language tag to satisfy MD040; change the
opening fence from ``` to ```text so the block starts with ```text and leave the
rest unchanged.

In `@tools/loadgen/daily.go`:
- Around line 406-413: The current code swallows newMultiplexPool errors (when
cfg.MultiplexPoolSize > 0) by logging and continuing, which must instead cause
the run to fail; replace the slog.Error + mux=nil branch with a hard failure: if
this code is inside main, call slog.Error then os.Exit(1) or use slog.Fatal to
terminate; if inside a function that returns error, return a wrapped error
(e.g., fmt.Errorf("multiplex pool init failed: %w", err)) so the caller stops
execution. Ensure you update the branch that handles newMultiplexPool (the mux
variable and error handling) to stop the run rather than proceed.

In `@tools/loadgen/README.md`:
- Around line 224-228: Update the markdown command fences in the README examples
so they are explicitly Bash: change the two triple-backtick blocks that contain
the make commands (the blocks shown with make -C tools/loadgen/deploy up / seed
PRESET=daily-heavy / run-daily PRESET=daily-heavy and the other block at lines
240-248) from plain ``` to ```bash so they satisfy MD040; locate the command
examples in tools/loadgen/README.md and replace their opening fences
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8241471-7084-40cc-b1cf-d2d8b6ac364c

📥 Commits

Reviewing files that changed from the base of the PR and between 700a83b and 07c9239.

📒 Files selected for processing (25)
  • .gitignore
  • docs/superpowers/plans/2026-05-27-daily-im-load-scenario.md
  • docs/superpowers/specs/2026-05-27-daily-im-load-scenario-design.md
  • tools/loadgen/README.md
  • tools/loadgen/collector.go
  • tools/loadgen/daily.go
  • tools/loadgen/daily_actions.go
  • tools/loadgen/daily_actions_test.go
  • tools/loadgen/daily_envelope.go
  • tools/loadgen/daily_envelope_test.go
  • tools/loadgen/daily_integration_test.go
  • tools/loadgen/daily_pool.go
  • tools/loadgen/daily_pool_test.go
  • tools/loadgen/daily_report.go
  • tools/loadgen/daily_report_test.go
  • tools/loadgen/daily_test.go
  • tools/loadgen/daily_user.go
  • tools/loadgen/daily_user_test.go
  • tools/loadgen/daily_verdict.go
  • tools/loadgen/daily_verdict_test.go
  • tools/loadgen/deploy/Makefile
  • tools/loadgen/main.go
  • tools/loadgen/main_test.go
  • tools/loadgen/preset.go
  • tools/loadgen/preset_test.go
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (18)
  • tools/loadgen/daily_integration_test.go
  • tools/loadgen/daily_report_test.go
  • tools/loadgen/main_test.go
  • tools/loadgen/daily_envelope.go
  • tools/loadgen/main.go
  • tools/loadgen/daily_user_test.go
  • tools/loadgen/preset_test.go
  • tools/loadgen/daily_test.go
  • tools/loadgen/daily_pool_test.go
  • tools/loadgen/daily_report.go
  • tools/loadgen/daily_envelope_test.go
  • tools/loadgen/daily_user.go
  • tools/loadgen/daily_actions_test.go
  • tools/loadgen/preset.go
  • tools/loadgen/daily_actions.go
  • tools/loadgen/collector.go
  • tools/loadgen/daily_verdict.go
  • tools/loadgen/daily_verdict_test.go

Comment on lines +148 to +173
// First pass: subscribe to any new room subjects via round-robin conn.
// Track which rooms we subscribed *in this Add* so partial failures can
// be undone. (roomRefs already > 0 means an earlier user already
// subscribed — no new sub needed.)
for _, roomID := range u.Rooms {
if p.roomRefs[roomID] > 0 || len(p.conns) == 0 {
continue
}
nc := p.conns[p.nextConn%len(p.conns)]
p.nextConn++
if _, err := nc.Subscribe(subject.RoomEvent(roomID), p.route); err != nil {
return fmt.Errorf("multiplex subscribe %s: %w", roomID, err)
}
// Mark provisionally with refcount 0 — the second pass below will
// increment it. We don't increment here so a subsequent Subscribe
// failure doesn't leave a dangling subscription.
}

// User-scoped subject for DM broadcasts. Subscribed per-user (no
// refcount needed since UserRoomEvent is scoped to the account).
if len(p.conns) > 0 {
nc := p.conns[p.nextConn%len(p.conns)]
p.nextConn++
if _, err := nc.Subscribe(subject.UserRoomEvent(u.Account), p.route); err != nil {
return fmt.Errorf("multiplex subscribe user %s: %w", u.ID, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback successful subscriptions when Add fails mid-flight.

If any later Subscribe fails, earlier successful subscriptions from this same call remain active but are never tracked in roomRefs/dispatch. Retries can then double-subscribe the same room and over-route messages.

Proposed fix
 func (p *multiplexPool) Add(u *userState) error {
 	inbox := make(chan *nats.Msg, 128)
 	p.mu.Lock()
 	defer p.mu.Unlock()
+	createdSubs := make([]*nats.Subscription, 0, len(u.Rooms)+1)
+	rollback := func() {
+		for _, s := range createdSubs {
+			_ = s.Unsubscribe()
+		}
+	}

 	for _, roomID := range u.Rooms {
 		if p.roomRefs[roomID] > 0 || len(p.conns) == 0 {
 			continue
 		}
 		nc := p.conns[p.nextConn%len(p.conns)]
 		p.nextConn++
-		if _, err := nc.Subscribe(subject.RoomEvent(roomID), p.route); err != nil {
+		sub, err := nc.Subscribe(subject.RoomEvent(roomID), p.route)
+		if err != nil {
+			rollback()
 			return fmt.Errorf("multiplex subscribe %s: %w", roomID, err)
 		}
+		createdSubs = append(createdSubs, sub)
 	}

 	if len(p.conns) > 0 {
 		nc := p.conns[p.nextConn%len(p.conns)]
 		p.nextConn++
-		if _, err := nc.Subscribe(subject.UserRoomEvent(u.Account), p.route); err != nil {
+		sub, err := nc.Subscribe(subject.UserRoomEvent(u.Account), p.route)
+		if err != nil {
+			rollback()
 			return fmt.Errorf("multiplex subscribe user %s: %w", u.ID, err)
 		}
+		createdSubs = append(createdSubs, sub)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily_pool.go` around lines 148 - 173, The Add flow can leave
provisional subscriptions active if a later Subscribe fails; update Add to track
the provisional subscriptions created in the first pass (store the nc and
subject strings created when calling nc.Subscribe for subject.RoomEvent(roomID)
and for subject.UserRoomEvent(u.Account)), and if any subsequent Subscribe
returns an error, iterate that tracking slice and call Unsubscribe (or
Unsubscribe/Drain as appropriate on nc) for each stored subscription before
returning the error; only after the second pass completes successfully should
you increment p.roomRefs and add entries to p.dispatch so there is no gap
between creating a subscription and tracking it (refer to Add, p.roomRefs,
p.conns, p.nextConn, subject.RoomEvent, subject.UserRoomEvent, and p.route).

Comment thread tools/loadgen/daily.go
Comment thread tools/loadgen/daily.go
Comment on lines +433 to +435
if err != nil {
return nil, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Wrap request errors with operation context.

return nil, err drops call-site context and violates the repo’s error-wrapping rule.

Proposed fix
 		msg, err := pubConn.RequestWithContext(ctx, subj, data)
 		if err != nil {
-			return nil, err
+			return nil, fmt.Errorf("nats request %s: %w", subj, err)
 		}

As per coding guidelines: "Never return bare err or fmt.Errorf("error: %w", err) — always provide descriptive context".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil {
return nil, err
}
msg, err := pubConn.RequestWithContext(ctx, subj, data)
if err != nil {
return nil, fmt.Errorf("nats request %s: %w", subj, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/daily.go` around lines 433 - 435, The bare return of err (the
snippet with "if err != nil { return nil, err }") must be replaced with an error
wrap that adds descriptive operation context; change the return to wrap err
using fmt.Errorf with a clear message about the failing operation (for example:
fmt.Errorf("failed to <describe request/operation>: %w", err)), keeping the
existing nil return value and preserving the original err via %w so callers can
inspect it.

hmchangw pushed a commit that referenced this pull request May 28, 2026
Task-by-task TDD plan: generic ramp engine + verdict + report, reusing
the existing message-send and history open-loop generators, exposed as a
max-rps subcommand. rps-prefixed identifiers avoid collision with PR #234.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha
mliu33 pushed a commit that referenced this pull request Jun 1, 2026
* docs: design spec for loadgen max-rps SLO finder

Mirrors PR #234's step-up-and-hold-under-SLO control loop, ramping target
RPS instead of user count N, to auto-find the max sustainable RPS for the
messages and history loadgen workloads. Standalone on main with
rps-prefixed identifiers so it neither depends on nor collides with #234.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* docs: implementation plan for loadgen max-rps SLO finder

Task-by-task TDD plan: generic ramp engine + verdict + report, reusing
the existing message-send and history open-loop generators, exposed as a
max-rps subcommand. rps-prefixed identifiers avoid collision with PR #234.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): add rps step verdict types and evaluateRPSStep

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* fix(loadgen): address Task 1 review — pointer verdict param, String() and precedence tests

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): add rps ramp engine (parseRPSSteps, runRamp)

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* fix(loadgen): address Task 2 review — cover runRamp branches, overflow guard, cleanups

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): add max-rps report renderer and CSV writer

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* test(loadgen): cover no-Next-limit path and multi-series CSV alignment

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): add Collector.Reset for per-step ramp windows

* feat(loadgen): add messages workload adapter for max-rps

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* fix(loadgen): address Task 5 review — shut metrics server on constructor error, log gather error

- Add shutdownSrv helper and call it on all three subscription-error return
  paths so the metrics HTTP server is not leaked when construction fails.
- Log the Gather() error in snapshotCounters instead of silently discarding it.
- Strengthen TestDiffCounters to assert zero-delta for marshal/gatekeeper/bad_reply.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): add history workload adapter for max-rps

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* feat(loadgen): wire max-rps subcommand into dispatch

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* test(loadgen): integration coverage for max-rps messages ramp

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* docs(loadgen): document max-rps subcommand and add run-max-rps target

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

* docs(loadgen): fix max-rps README + validate history flags

Correct the README "Reading the output" section (INCONCLUSIVE does not
stop the ramp and is not a pass; fix the no-pass ANSWER string), validate
--page-limit/--request-timeout > 0, and document the deliberate
straggler-exclusion in the messages hold-window counter snapshot.

https://claude.ai/code/session_01EdwhSB725x7E4SMLPg4Dha

---------

Co-authored-by: Claude <noreply@anthropic.com>
@hmchangw hmchangw force-pushed the claude/gifted-rubin-ry8HI branch 2 times, most recently from 354715f to a76f537 Compare June 2, 2026 19:15
claude added 19 commits June 3, 2026 01:28
Add directPool which, for each simulated user, opens a dedicated
nats.Conn and Subscribes to every room's broadcast subject. Each
delivery is timestamped and correlated against the shared Collector
via RecordBroadcast so end-to-end (E2) latency can be measured.

Wired into the daily-IM runtime by the next task; symbols carry
//nolint:unused until that wiring lands so this commit stands alone.

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt
Wires the daily-IM runtime end-to-end: prodEnvFactory builds the real
NATS direct/multiplex pools, pending poller, service scraper, and JWT
minting; runDaily is the production entrypoint dispatched from main.go.

Drops //nolint:unused suppressions on symbols now genuinely consumed via
the production path (directPool/multiplexPool + constructors, pollPending,
serviceScraper + Scrape, runStep/activateUsers/stepEnv). startEmitter and
doAction remain suppressed pending the per-user emitter wiring task.

https://claude.ai/code/session_014KJfz9ZA7cCe4rAmfSHHmt
claude added 27 commits June 3, 2026 01:28
Apply correctness and cleanup fixes surfaced by /simplify high-effort review:

Correctness:
- runStep: return Inconclusive on ctx cancel and on pollPending error
  instead of zero-valued StepResult that renderConsole treats as PASS
- multiplexPool.Add: subscribe BEFORE mutating dispatch/refcount, so
  partial-subscribe failures leave the pool consistent
- multiplexPool.Close: stop closing per-user channels — race against
  in-flight route() could panic on send-on-closed; rely on Drain+GC
- multiplexPool.route: hold lock through inbox send so Close can't
  race; drop counter incremented outside lock to avoid contention
- newMultiplexPool: return error instead of panic on connect failure
- runDailyForTest: defer Close on direct+multiplex pools to drain
  connections cleanly between/after runs
- directPool/multiplexPool: also subscribe to UserRoomEvent(account)
  so DM broadcasts (the dominant receive traffic for daily presets)
  are observed
- prodEnvFactory: wire publish/request fns through stepEnv; emitters
  now actually emit actions in production (previously startEmitter was
  dead code with //nolint:unused, making the verdict trivially pass)
- doAction: use env.siteID instead of hardcoded "site-local"
- runDailyForTest: propagate baseCfg.SiteID into fixtures and stepEnv
- prodEnvFactory: drop the hardcoded :9100/metrics URLs — backend
  services don't expose HTTP endpoints, so the scrape was a permanent
  no-op masquerading as a real signal

Cleanup:
- 5s tick in runStep hold loop → single context-aware timer
- maxInt helper → slices.Max
- joinReasons body → strings.Join
- scrapeErrorCounter manual read loop → io.ReadAll

Known limitation documented in preset.go: large-band rooms have
UserCount > LargeRoomThreshold in production fixtures, which causes
message-gatekeeper to block non-thread sends from member-role users.
Daily presets work around this by funneling sends to smaller rooms;
large-band rooms still exercise fan-out via receive-side subscriptions.
`loadgen daily -h` now prints a description of the scenario, SLO
thresholds, receiver topology, preset summary, flag list with units,
and three example invocations.

Also exit cleanly on -h/--help instead of logging a 'parse daily
config error=flag: help requested' line.
100k-user fixtures take minutes to build (O(N*perUser*R) weighted
picker in preset.go); without a log line up front, the silence looked
like a hang. Now you see 'building fixtures users=N' immediately and
'fixtures built rooms=X subscriptions=Y elapsed=Zs' when the picker
finishes, so it's clear whether the process is in fixture build, in
the per-step warmup, or stuck.

The picker itself remains the original weighted-by-remaining-capacity
algorithm; an earlier attempt to swap it for a shuffled slot-bag
preserved the asymptotic complexity win but broke the DM band's
perfect-fit tail (any duplicate-skip burns a slot the last user needs)
and was reverted. A proper fix (stub-pairing for DM band only) is
deferred to a follow-up.
…orting

A misbehaving NATS monitoring endpoint was marking every step
Inconclusive — too strict. Now:

- pollPending retries up to 3x with linear backoff (200ms, 400ms)
  so transient failures are absorbed.
- If the poll still fails after retries, runStep logs a warning and
  drops the pending-growth signal for that step (passes nil
  ConsumerPending to evaluateStep). The other four signals
  (latency, error rate, service errors, self-metrics) still produce
  a real verdict.
- Only ctx.Canceled / DeadlineExceeded from the poller still maps
  to Inconclusive, since that indicates the run is shutting down,
  not a flaky server.
Adds NATS_MONITORING_URL (default http://nats:8222/jsz, the
docker-compose default) to the base config. prodEnvFactory.Build
reads it for the daily-IM pending-growth poller. Lets users running
against non-default infra (host-port-mapped NATS, monitoring on a
different port, etc.) point the poller at the right endpoint without
recompiling.
The DM band's weighted-by-capacity picker was the dominant cost in
BuildFixtures: at N users with bands.DMs=perUser the band has
R = N*perUser/2 rooms, making the per-pick O(R) scan quadratic in N.
At N=100000 (the default --steps tip) the DM band alone was ~3e12
inner-loop iterations, taking hours.

Stub-pairing (configuration model) generates a perUser-regular
bipartite graph in O(N*perUser):

  1. Each user contributes perUser stubs to a flat list.
  2. Shuffle the list once.
  3. Pair consecutive stubs into DM rooms.
  4. Fix self-pairs (stubs[2k]==stubs[2k+1]) by swapping the second
     stub with a later position whose neighbours don't create a new
     self-loop. Self-loops are rare (~perUser expected over the
     whole list).

By construction every user appears in exactly perUser DMs and every
DM has 2 distinct users. The Small/Medium/Large bands keep the
existing weighted picker (they're tractable at typical N -- O(R) per
pick is fine when R is in the thousands).

Locked in by TestBuildFixtures_DailyHeavy_FastAtScale: at N=2000 the
whole BuildFixtures call now completes in ~2s on a developer box; the
test fails the build if it ever regresses past 30s. Existing
TestBuildFixtures_DailyBands (which already asserts the per-user
membership count invariant at N=200) still passes.
Three perf+correctness+ops reviews surfaced 14 findings. Applies the
verdict-affecting and operationally-critical ones; flags the
performance/memory ones as follow-ups in commit messages and code.

Correctness:

- **Deterministic per-user RNG seed.** startEmitter was seeded from
  time.Now().UnixNano() XOR len(u.ID) — len is constant for all users
  ("u-NNNNNN"), so same-nanosecond activations got identical seeds
  and emitted perfectly correlated action streams. Replaced with a
  splitmix mix of env.runSeed and the user index, so two runs at the
  same run-seed produce identical action streams (the whole point of
  a reproducible verdict).

- **Per-step envelope re-anchor.** Emitters used to capture holdStart
  at activation time, so users from step N kept emitting against
  step N's curve all the way through step N+1 — by step 2 the early
  half of the user population was firing at envelope baseline
  (0.4×rate) instead of following the new step's hold curve, making
  later steps' "PASS" verdicts wrong. Now stepEnv carries holdStart
  and holdDuration as atomic.Int64 fields; emitters re-read them on
  every tick; runStep calls env.setHold(now, hold) right before
  Reset() at the start of each hold.

- **Post-activation hold anchor.** holdStart was computed *before*
  activateUsers, which is rate-limited at 500 users/sec. At a +50k
  delta this meant 100s of activation, mostly during what was
  supposed to be warmup, with fresh users emitting at the envelope's
  clamped edge instead of ramping. The re-anchor above also fixes
  this — hold begins after warmup actually elapsed.

- **percentile uses ceil-based nearest-rank.** Old floor-based
  indexing (int(p*(len-1))) returned p98 for p99 on ≤100-sample
  windows, so tight smoke tests false-PASSed. Now uses
  ceil(p*len)-1, matching the standard.

- **diffPending walks both sides.** A consumer that disappears
  mid-hold (crashed/deleted) used to be silently dropped from the
  verdict. Now surfaces as End=0/negative-Delta and trips with
  reason "&lt;durable&gt; disappeared mid-hold".

- **HTTP timeout on /jsz.** pollPendingOnce used http.DefaultClient
  with no timeout; a hung monitoring endpoint would wedge the whole
  run until the operator killed the process. Dedicated 5s-timeout
  client.

- **ctx-aware ramp loop.** runDailyForTest now checks ctx.Err()
  between steps, so SIGINT mid-cooldown doesn't produce a junk trail
  of INCONCLUSIVE rows for steps that never started.

Operational guards:

- **Zero-traffic → INCONCLUSIVE.** Publisher conn failure or
  unwired emitters used to produce a silent PASS at every step. Now
  evaluateStep returns INCONCLUSIVE with "zero actions attempted"
  when AttemptedOps==0.

- **Effective-N tracking + &lt;95% threshold.** activateUsers
  increments env.activatedCount/skippedCount atomically; if
  EffectiveN/N &lt; 95% the step returns INCONCLUSIVE with the
  shortfall in the reason. StepResult.EffectiveN is surfaced in
  console (e.g. "20000(10000)") and added as a CSV column. Stops
  "N=20000 PASS" silently meaning "10000 active".

- **Goroutine-as-CPU proxy disabled.** readCPUPercent's heuristic
  (NumGoroutine/5000 × 100) tripped INCONCLUSIVE at any scale ≥4k
  users — exactly the regime this tool is built for. Now returns 0;
  INCONCLUSIVE relies on the GC pause signal alone until a real CPU
  sample (gopsutil/proc-self-stat) is wired.

Follow-ups (not in this commit):

- **Perf.** Collector single-mutex contention (~520k locks/sec at
  N=100k), unbounded byReqID/byMsgID growth (spec promised TTL+1M
  cap), per-user inbox channels that nobody reads, small/medium band
  O(N²) picker, one-goroutine-per-user (100k goroutines), per-tick
  actionCtx allocation. Each is a real win but a non-trivial
  refactor; queued for the next pass.

- **Ergonomics.** Configurable SLO thresholds + seed flags,
  heartbeat between warmup and verdict, ABORTED marker on ctx-cancel,
  full per-durable CSV columns, --validate dry-run.
Two perf fixes flagged by the post-review pass.

preset.go: small/medium/large band fixture picker switched from the
O(N x perUser x R) weighted-scan to a configuration-model slot-bag
walk with retry on per-user duplicates. Same shape as the DM band's
stub-pairing but generalised for heterogeneous room capacities.
N=10000 daily-heavy fixture build drops from minutes to ~1s; N=100k
drops from ~30+ min to seconds. TestBuildFixtures_DailyHeavy_FastAtScale
bumped to N=10000 to lock the speedup in.

collector.go: byReqID/byMsgID/e1/e2 sharded by FNV-1a hash of the
correlation key across 64 shards (reqShard for requestID-keyed,
msgShard for messageID-keyed; each has its own mutex). At N=100k a
busy daily-IM run produces ~150k+ Record* calls/sec, all of which
previously serialised on a single sync.Mutex — loadgen became the
bottleneck and inflated latency samples before the backend tripped.
Per-shard contention drops by ~64x.

Determinism preserved: emit-as-you-pick instead of range-over-map
(Go map iteration order is randomised), and the slot-bag uses the
same seeded r.Intn sequence both runs. TestBuildFixtures_DailyBands
still passes its same-seed -> equal Fixtures invariant.
…set)

PR #240 (max-rps) landed on main with its own waitOrCancel (in ramp.go,
error-returning) and a Reset() method on the unsharded Collector. Both
collided with the daily-IM scenario's versions:

- daily.go's waitOrCancel returned bool; replaced its callers to use the
  error-returning ramp.go helper.
- collector.go's PR #240 Reset referenced the pre-sharded fields (c.mu,
  c.byReqID, etc.) that no longer exist; the sharded Reset further down
  in the file does the same job correctly under per-shard locks.

Build, lint, and unit tests green after the reconciliation.
TestRunDaily_Integration_TinyPresetPasses passed vacuously before the
recall-review fix wired emitters into prodEnvFactory (no actions = no
errors = trivial PASS). With emitters live, the test now actually
publishes to subjects like `chat.user.{acct}.room.{room}.{site}.msg.send`
and request-replies on history/room-list/etc. — all of which time out
in the CI integration job because the testutil NATS container has no
gatekeeper, no room-service, and no broadcast-worker subscribed.
Result: error_rate=7.69% > 0.1% threshold, step TRIPs, test FAILs.

The wiring is correct (proven by unit + race tests); what's wrong is
that this scenario fundamentally needs the full docker-compose stack
to be meaningful. Skipping with a comment pointing operators at
`make -C tools/loadgen/deploy run-daily` for real end-to-end coverage.
…fore Add returns

CI hit: TestMultiplexPool_RoutesBroadcastToInbox 'Condition never
satisfied'. Root cause: nc.Subscribe queues the SUB command client-
side; it's not sent to the broker until the next Flush. A caller (or
test) that publishes immediately after Add() can see the broadcast
dropped because the server hasn't registered the subscription
interest yet.

Production emitters tick on a 1s schedule so they never publish during
the registration race window, but the test does — explaining why the
test was reliably failing only under CI's timing.

Fix in both pools (direct + multiplex): call nc.Flush() after the
Subscribe pass so Add only returns once the server knows about every
new interest. One round-trip per pool conn at activation time,
dominated by the Subscribe overhead already incurred.
Bug: daily.go's prodEnvFactory and daily_pool.go's pool constructors
called bare nats.Connect(url, nats.Name(...)) with no credentials. On
operator-mode NATS servers (the chat-system standard, including the
local docker-local stack) anonymous connections either fail outright
or get default-deny perms — manifesting as 'permissions violation' on
subscribe and the silent-zero-latency symptom reported here.

Every other loadgen subcommand correctly used
natsutil.Connect(cfg.NatsURL, cfg.NatsCredsFile); only the daily
scenario was missing the second arg.

Fix:
- New connectWithCreds(url, name, credsFile) helper applies
  nats.UserCredentials when credsFile is non-empty, else falls back
  to anonymous (testcontainer NATS in unit/integration tests).
- newDirectPool, newMultiplexPool, and the prodEnvFactory publisher
  conn all route through it.
- baseCfg.NatsCredsFile (existing env var NATS_CREDS_FILE) flows in
  through prodEnvFactory; the docker-compose overlay already wires
  NATS_CREDS_FILE=/etc/nats/backend.creds.

After this fix: 'make -C tools/loadgen/deploy run-daily PRESET=...'
authenticates as the backend user, subscribe perms work, and latency
samples populate — P50/P95/P99 should be non-zero.
…eply

room-service registers MessageReadWildcard via nc.QueueSubscribe and
unconditionally calls m.Msg.Respond(resp) (room-service/handler.go:83).
Respond requires msg.Reply to be set, which nc.Publish does not set —
only nc.Request does.

The daily readReceipt action was using a.Publish, so every read-receipt
action triggered 'nats: message does not have a reply' in room-service's
log. Switched to a.Request with the same defaultRequestTimeout the other
request-shaped actions use, and renamed/updated the unit test to lock
in the contract (must land in c.reqs, not c.pubs).

Verified the other six action handlers (scrollHistory, refreshRoomList,
muteToggle, roomCreate, memberAdd, threadReply) — they all use Request
against the right routes, so readReceipt was the sole mismatch.
…es Request

Two unrelated bugs that bit during real runs:

1. readReceipt was using a.Publish for subject.MessageRead, but
   room-service registers that subject via QueueSubscribe and calls
   msg.Respond on every received message — which fails with 'nats:
   message does not have a reply' when the sender used fire-and-forget
   Publish. Switched to a.Request (matches the pattern of every other
   request-side action handler). Updated the test.

2. runDaily started straight into BuildFixtures + a full ramp even
   when the Mongo subscriptions collection was empty for the configured
   siteID. The gatekeeper then rejected every send with 'user X is not
   subscribed to room Y' and the operator burned ~30+ minutes before
   seeing the error_rate trip. Added a preflight that opens a short-
   lived Mongo connection, counts subscriptions for cfg.SiteID, and
   bails with an actionable message:

     ERROR daily pre-flight error="no subscriptions found in mongo
     for siteID="site-local"; run \`loadgen seed --workload=messages
     --preset=<your daily preset>\` first (or \`make -C
     tools/loadgen/deploy seed PRESET=<your preset>\`)"

   Exit code 2 (config error). Uses a 10s context independent of the
   run-level ctx so a transient Mongo blip doesn't burn the window
   before failing.
…atch with seed

Symptom: gatekeeper rejects every send with "user X is not subscribed
to room Y" even though `loadgen seed` was just run successfully.

Root cause: BuildFixtures is deterministic in (preset, seed, siteID).
`loadgen seed --workload=messages --preset=daily-heavy` uses the
preset's default Users (10000). But runDailyForTest was overriding
preset.Users = slices.Max(cfg.Steps) before calling BuildFixtures.
If max(steps) != preset.Users, the per-band stub shuffle produces
DIFFERENT user/room/sub IDs at run time vs seed time — same prefixes
("u-000123", "room-small-000045"), different totals → different
permutation → no overlap. Gatekeeper's GetSubscription misses every
lookup and the whole ramp fails on error_rate.

Fix: drop the override. runDailyForTest now uses preset.Users as-is,
matching `loadgen seed`. activateUsers already caps at
len(env.users), and the EffectiveN-shortfall guard surfaces
"only X/Y users activated" as INCONCLUSIVE when --steps exceeds
preset.Users — much clearer than silent ID drift.

Side effect: --steps cannot exceed preset.Users for that preset.
For daily-light/heavy/power that's 10000. A startup log line
warns when max(steps) > preset.Users so the operator sees the cap
before the run starts.

No re-seed required after deploying the fix.
Replace the 53-line stub section with a full operator guide that
captures every lesson from real runs:

- Quick start + prerequisites (seed step, NATS perms, env vars)
- Preset comparison with rooms-per-user breakdown
- Complete CLI flag reference (units, defaults, when to change)
- Environment variables (NATS_CREDS_FILE, NATS_MONITORING_URL,
  SITE_ID, plus the mongo/valkey vars)
- SLO signals with thresholds; PASS/TRIP/INCONCLUSIVE semantics
- Sample console + CSV output, how to read 'ANSWER' and
  N(EffectiveN) annotations
- Troubleshooting matrix covering every failure mode encountered:
  not-subscribed, siteID mismatch, large-room block,
  'no reply' from room-service, NATS permissions, all-zero
  latency, EffectiveN cap, fixture build silence, flaky /jsz
- Known limitations explicitly listed (large-room block, no-op
  auth stub, dormant service-error scraper, disabled CPU proxy)
- Pointers to spec, plan, and source files

Aims to make the tool self-serviceable: a new operator who has
never used it should be able to get to a verdict by following
the doc, and an experienced operator hitting an error should
find the fix in the troubleshooting matrix without escalating.
Adds observational per-handler latency stats so operators can see what
each daily action (send / read_receipt / scroll_history / refresh_room_list
/ member_add / room_create / mute_toggle) actually costs, in addition to
the system-wide publish→broadcast p50/p95/p99 the verdict already uses.

- Collector gains per-action latency sample tracking (map[int][]Duration
  under its own mutex; cleared by Reset alongside the existing counters).
- doAction wraps each handler call with time.Now/Since and records the
  elapsed via RecordActionLatency.
- actionKind grows a String() helper for stable lowercase names and
  an allActionKinds canonical-order slice for the report.
- stepInputs/StepResult carry ActionLatencies; evaluateStep computes
  Count + P50/P95/P99 per action via the existing percentile helper.
- Console prints 'actions: send n=8920 p50=12 p95=180 p99=320 | ...'
  on the line after each step's verdict.
- CSV adds 28 columns (<action>_count, _p50_ms, _p95_ms, _p99_ms × 7
  actions) in stable allActionKinds order so downstream tools can
  column-index reliably.

Verdict criteria unchanged: these stats are observational only — PASS
/ TRIP / INCONCLUSIVE still depends on the existing latency, error-rate,
pending-growth, and self-metrics signals. Per the operator's request
(too easy to confuse a single 'all actions sluggish' p95 with a real
broadcast-pipeline regression if we trip on per-action numbers).
The daily presets hard-code Users=10000. Operators wanting to ramp
above that previously had to edit preset.go and rebuild. Now both
`loadgen seed --workload=messages` and `loadgen daily` accept
`--users=N` to override at invocation time.

The two MUST be passed the same value — BuildFixtures is deterministic
in (preset, seed, siteID), and the per-band stub shuffle uses totalUsers
as length, so a mismatch produces entirely different room/sub IDs even
though the user IDs (u-000000, u-000001, ...) overlap. The gatekeeper
then rejects every send with "user not subscribed".

To catch that misuse fast, the daily preflight now also counts users
in Mongo and compares to preset.Users (or the --users override). On
mismatch it errors with the exact teardown+re-seed commands needed.

Usage at, say, N=50000:

  loadgen seed  --workload=messages --preset=daily-heavy --users=50000
  loadgen daily --preset=daily-heavy --users=50000 --steps=5000,10000,20000,50000

`--users=0` (default) means "use preset default", the safe path for
normal runs and existing scripts.
room-service extracts X-Request-ID from inbound NATS headers via
natsutil.ContextWithRequestIDFromHeaders (room-service/handler.go:60).
The ID flows through ctx into the canonical event room-service publishes
to room-worker's ROOMS stream. room-worker REQUIRES X-Request-ID on
canonical events for idempotency (it's the dedup seed) and rejects with
'missing X-Request-ID' otherwise (room-worker/handler.go:752,1219).

Before this fix, prodEnvFactory's publish/request closures used
pubConn.Publish(subj,data) and pubConn.RequestWithContext(ctx,subj,data) —
neither of which sets headers. Empty inbound header at room-service →
empty ctx → empty canonical header → room-worker logs the error and
drops the event. Visible in room-worker logs as 'missing-X-request-id'
on subject 'chat.room.canonical.<site>.member.add'.

Now both closures build *nats.Msg with a fresh
natsutil.RequestIDHeader (X-Request-ID) per call and use PublishMsg /
RequestMsgWithContext. Every emitter action now produces a properly
traceable request through the canonical pipeline.

The header ID is independent of any payload-level request ID
(e.g. SendMessageRequest.RequestID for the gatekeeper) — both serve
different layers. In production a real client typically uses the same
value at both layers; we could plumb that through actionCtx as a
follow-up if request-tracing across layers needs it.
room-service rejects member-add on DM rooms with 'cannot add members to
a non-channel room' (DMs have a fixed 2-member shape). For daily-heavy
~25 DMs out of 56 rooms per user, picking uniformly from u.Rooms means
~45% of memberAdd actions error out — inflating error_rate noise and
spamming the room-service log.

Pre-filter at activation time: userState.ChannelRooms is the subset of
Rooms that aren't DMs, computed once in newUserState by ID-prefix
('room-dm-' = DM band, everything else = channel band). memberAdd now
picks from u.ChannelRooms instead of u.Rooms. Users with no channel
rooms (DM-only) no-op the action rather than erroring.

The other actions (sendMessage, readReceipt, scrollHistory,
refreshRoomList, muteToggle, roomCreate, threadReply) work fine on DMs
and continue using u.Rooms. Only memberAdd has the channel-only
constraint at the room-service layer.

Tests:
- TestNewUserState_ChannelRoomsExcludesDMs locks the prefix-filter
- TestMemberAdd_Requests now seeds ChannelRooms explicitly
- TestMemberAdd_SkipsWhenNoChannelRooms covers the DM-only no-op path
…l account

Two related fixes for actions that hit room-service:

roomCreate sent only {name, type}. classifyAndValidate at
room-service/handler.go:193 lets that through (name is non-empty), but
the channel branch at line 290 then rejects it:
    allUsers := stripAccount(...)
    if len(allUsers) == 0 && len(allOrgs) == 0 {
        return nil, errEmptyCreateRequest
    }
A channel must have at least one member besides the creator. Now we
include u.Neighbor in the 'users' array. errEmptyCreateRequest gone.

memberAdd was passing the literal 'user-stub' as the target account.
room-service validates accounts exist (errUserNotFound) — 'user-stub'
isn't in Mongo, so every memberAdd that reached a channel was failing
silently after the channel-only filter. Now uses u.Neighbor, which is
guaranteed-seeded.

userState.Neighbor is computed once in newUserState by shifting the
requester's index by 1 ('user-N' → 'user-(N-1)' for N≥1, 'user-0' →
'user-1'). Always != Account; always within the seeded range for any
preset with Users ≥ 2 (i.e. all daily-* presets).

Tests:
- TestNeighborOf covers shift / wrap / fallback
- TestRoomCreate_Requests asserts the payload now has a users array
- TestMemberAdd_Requests / SkipsWhenNoChannelRooms unchanged shape
The 'timeout' argument to actionCtx.Request was silently ignored:
RequestMsgWithContext uses the context's deadline, and the emitter's
ctx is the run-level ctx with no deadline. So defaultRequestTimeout
(5s) never bound anything; a slow room-service handler could hang
arbitrarily long, showing up as huge per-action p50 (e.g. 25s for
read_receipt) instead of cleanly timing out and contributing to
error_rate.

Wrap the incoming ctx with context.WithTimeout(timeout) so each call
gets a real deadline. Action handlers continue to pass
defaultRequestTimeout=5s; future per-action overrides can vary the
value without code changes.

This makes the per-action latency numbers a true latency signal: a
healthy handler shows actual round-trip ms; a sick one shows a wall
at the timeout value plus a matching jump in failed_ops / error_rate.
Previously the verdict only gated on the publish→broadcast latency (the
messaging pipeline). Operations like read_receipt, scroll_history,
member_add, refresh_room_list — all of which exercise different parts
of the system (room-service, history-service, room-worker) — only
contributed to error_rate when they timed out, never to latency.
A slow handler that stayed under the request timeout was invisible.

Now Thresholds carries per-action p95/p99 maps; evaluateStep iterates
ActionLatencies and trips if any gated action exceeds its cap. Default
thresholds reflect typical chat-backend handler latencies:

  read_receipt:      p95=100ms  p99=250ms   (mongo update + reply)
  refresh_room_list: p95=200ms  p99=500ms   (mongo query per user)
  scroll_history:    p95=500ms  p99=1500ms  (cassandra bucket walk)
  member_add:        p95=200ms  p99=500ms   (mongo + canonical pub)
  mute_toggle:       p95=100ms  p99=250ms   (mongo update)
  room_create:       p95=500ms  p99=1500ms  (mongo + canonical pub)

sendMessage/threadReply are not gated here — they're already gated by
the broadcast-pipeline P95LatencyMs / P99LatencyMs at the top of the
verdict (the latency the existing p50/p95/p99 columns reflect).

Operator can tune via --action-p95-ms / --action-p99-ms flags:
  loadgen daily --preset=daily-heavy \
    --action-p95-ms=read_receipt:80,scroll_history:300 \
    --action-p99-ms=read_receipt:200,scroll_history:800
Format: comma-separated name:N pairs. Unset actions keep defaults;
unknown names are rejected at parse time. Set N to a very large
number to effectively disable an action's gate.

Trip reasons name the action explicitly:
  scroll_history p95=612ms > 500
  read_receipt p99=340ms > 250
The action marks the user's subscription as read (msg.read handler).
"read_receipt" invited confusion with the separate MessageReadReceipt
notification path; "mark_read" names what the action actually does.

Breaking for existing CSV consumers and CLI users: the per-action
column names change from read_receipt_* to mark_read_*, and
--action-p95-ms / --action-p99-ms must use "mark_read" as the key.
…arge

history-large materialized a single 65M-row plannedMessage slice (~50 GB
of strings), so seeding got SIGKILLed before reaching Cassandra.
BuildHistoryFixtures now keeps only the Fixtures + ThreadParents map and
hands the seeder an IterateRoomMessages iterator that builds one room's
plan at a time (~50 MB peak). FullPlan stays available for tests and
small/medium presets where materialization is bounded.

Per-room randomness is split into two streams — structural (sender,
jitter, parent-selection, reply offsets) and content (message body
bytes) — so a cheap metadata-only walk can derive Room.LastMsgAt and
ThreadParents without paying the O(MessagesPerRoom × ContentBytes) cost
of regenerating content.

CreatedAt and content bytes for a given (seed, preset) are different
from before — no test asserts on either, but anyone diffing a re-seed
against an old Cassandra dump will see the change.
@hmchangw hmchangw force-pushed the claude/gifted-rubin-ry8HI branch from af5b086 to 7829660 Compare June 3, 2026 01:30
@mliu33 mliu33 merged commit 806f593 into main Jun 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants