Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
fbb77ca
docs: design spec for room-member add load test
claude May 19, 2026
70a788f
docs(loadgen): add implementation plan for room-member load test
claude May 19, 2026
5d8766c
feat(subject): add RoomMemberEventWildcard helper
claude May 19, 2026
0f389b5
feat(loadgen): add member-add Prometheus collectors
claude May 19, 2026
884a167
feat(loadgen): add Shape enum + inject/shape validation
claude May 19, 2026
c0d0929
feat(loadgen): add MembersPreset + builtin members presets
claude May 19, 2026
60c295f
feat(loadgen): add BuildMembersFixtures with per-room candidate pools
claude May 19, 2026
98c418e
feat(loadgen): add OwnersByRoom helper
claude May 19, 2026
d38269f
feat(loadgen): add canonical member publisher
claude May 19, 2026
da3da91
feat(loadgen): add frontdoor member publisher with reply correlation
claude May 19, 2026
66af9f4
feat(loadgen): add member-add E1/E2 collector
claude May 19, 2026
fe50675
feat(loadgen): add SustainedMembersGenerator
claude May 19, 2026
a51b758
feat(loadgen): add CapacityMembersGenerator with per-room ack signal
claude May 19, 2026
8be678e
feat(loadgen): add MembersSummary + CapacitySummary printers
claude May 19, 2026
7e48457
feat(loadgen): add --workload flag to seed and teardown
claude May 19, 2026
6f2ac8b
feat(loadgen): add members-sustained subcommand
claude May 19, 2026
53bf08c
feat(loadgen): add members-capacity subcommand
claude May 19, 2026
eebb317
docs(loadgen): document members workload + Makefile targets
claude May 19, 2026
866a956
test(loadgen): add members-sustained end-to-end integration test
claude May 19, 2026
c83b461
test(loadgen): use errors.Is for ErrPoolsExhausted assertion
claude May 19, 2026
074992d
refactor(loadgen): apply simplify-pass cleanups
claude May 19, 2026
eee926b
refactor(loadgen): cache presetLabel + doc shared snapshotLatencies
claude May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,456 changes: 3,456 additions & 0 deletions docs/superpowers/plans/2026-05-19-load-test-room-members.md

Large diffs are not rendered by default.

320 changes: 320 additions & 0 deletions docs/superpowers/specs/2026-05-19-load-test-room-members-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
# Load test: adding members to rooms

## Goal

Benchmark the add-member path end-to-end on a single site, covering both
sustained throughput and large-room capacity behavior. Reuses the existing
`tools/loadgen` binary, fixtures, metrics server, and Prometheus/Grafana
overlay.

The pipeline under test:

```
client
→ chat.user.{account}.request.room.{roomID}.{siteID}.member.add
→ room-service.handleAddMembers (auth, capacity, dedup, channel expansion)
→ chat.room.canonical.{siteID}.member.add (ROOMS stream)
→ room-worker (resolve, write subscriptions, emit broadcast + outbox)
→ chat.room.{roomID}.event (member_added RoomEvent)
```
Comment on lines +12 to +19
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

The fences in these sections are missing language hints. Please annotate them (for example text, bash, or go) to satisfy markdownlint and keep docs tooling consistent.

Also applies to: 50-55, 238-251, 299-315

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 12-12: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 12 - 19, Several fenced code blocks (for example the block that begins
with "client → chat.user.{account}.request.room.{roomID}.{siteID}.member.add →
room-service.handleAddMembers ..." and the other blocks at ranges 50-55,
238-251, 299-315) are missing language identifiers; edit those fenced code
blocks in docs/superpowers/specs/2026-05-19-load-test-room-members-design.md and
add an appropriate language tag (e.g., `text`, `bash`, or `go`) after the
opening backticks so each block is annotated for markdownlint/MD040 and tooling.


## Non-goals

- Cross-site channel expansion. Single site only.
- CI regression gate. Invoked manually.
- Auth benchmark. Uses shared `backend.creds` like existing loadgen.
- Cross-machine absolute-number comparisons. Within-machine A/B only.

## v1 scope

The first cut implements `--shape=users` only. The `Shape` enum, flag,
and validation are in place from the start so adding `orgs` / `channels` /
`mixed` is a small follow-up plan, but the v1 plan does not seed an org
pool or a source-channel pool, and the request builder only emits
user-account lists.
Comment on lines +30 to +34
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify the v1 shape contract to avoid unsupported runtime paths.

Line [30] says v1 is shape=users only, but Lines [140-142] and [279-282] describe active multi-shape runtime behavior/tests. Please make the contract explicit: either reject non-users shapes at parse time in v1, or mark those sections as post-v1 only.

Also applies to: 140-142, 279-282

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 30 - 34, The doc claims v1 implements only shape=users but later sections
describe multi-shape behavior; update the spec to make the v1 contract explicit
by either (A) adding a clear parse-time rejection for non-users shapes
(referencing the Shape enum and the --shape flag and validation logic) and
calling out that the request builder (which currently only emits user-account
lists) will error on orgs/channels/mixed, or (B) marking the multi-shape
behavior in sections referencing runtime/tests (lines describing multi-shape) as
"post-v1" only; pick one approach and apply it consistently to the Shape enum,
--shape flag/validation, and the request builder description so there are no
unsupported runtime paths implied.


Rationale: E2 correlation by `(roomID, sortedSentAccounts)` works
trivially for `shape=users` because room-worker's `actualAccounts` matches
what the tool sent (the candidate pool guarantees no overlap with existing
members). For `shape=orgs` and `shape=channels`, room-worker re-resolves
the expansion on its own, so the tool would have to pre-resolve and track
expected accounts per request — workable but doubles the fixture-builder
surface area. Defer until users-shape numbers point at a real question.

## Architecture

Two new subcommands on `tools/loadgen`, sharing the existing NATS connect,
Mongo/Valkey wiring, metrics server, percentile/CSV writers, signal
handling, and consumer-lag sampler:

```
loadgen members-sustained --preset --rate --duration --warmup \
--inject --users-per-add --shape [--csv] [--e2-timeout]
loadgen members-capacity --preset --target-size \
--inject --users-per-add --shape [--max-rate] [--csv] [--e2-timeout]
```

Flag defaults: `--users-per-add=10`, `--shape=users`, `--inject=frontdoor`,
`--warmup=10s`, `--duration=60s`, `--rate=100`, `--e2-timeout=30s`. No
default for `--target-size` (required on `members-capacity`).

The existing `seed` and `teardown` subcommands gain `--workload=messages|members`
(default `messages`, for back-compat) so the same binary can stage either
workload's fixtures. Re-seeding between runs (the chosen sustained-mode
strategy for handling permanent room growth) is `teardown` + `seed` wrapped
in a Makefile target.

## Inject modes

### Frontdoor (`--inject=frontdoor`)

- `nc.PublishRequest(memberAddSubject, replyInbox, payload)` with a single
subscription on `_INBOX.loadgen.members.>` for reply correlation. Member-add
uses standard `m.Msg.Respond`, not the bespoke
`chat.user.{account}.response.{reqID}` pattern that msg-send uses, so the
reply path differs from the existing loadgen collector.
- Hits `room-service.handleAddMembers`: subscription lookup, room lookup,
ChannelRef expansion (local only), `CountNewMembers`, capacity check.
- Reply `{"status":"accepted"}` → **E1**.
- Publishes canonical event → room-worker resolves + writes + broadcasts
→ **E2**.

### Canonical (`--inject=canonical`)

- `js.Publish(chat.room.canonical.{siteID}.member.add, payload)` into the
ROOMS stream. Skips room-service entirely.
- Room-worker still does its own resolution (`ListNewMembers` re-resolves
orgs against subscriptions) and the write + broadcast.
- **No E1** — JetStream publish has no reply. Only E2 is recorded. Same
blind spot as the existing loadgen's canonical mode.

### `shape=channels` incompatibility

Channel expansion lives in `room-service.handleAddMembers`. Once the
canonical event is published, `ChannelRefs` are decorative (preserved
for the sys-message payload only); room-worker uses the flat
`Users` / `Orgs` lists. So `--shape=channels --inject=canonical` would
silently add zero members.

**Decision:** reject this combination at flag-parse time with an explicit
error rather than re-implementing channel expansion in the loadgen.

### `--inject=both`

Not a real mode. Document in the README as "run the same workload twice,
diff the summaries". The flag stays scalar.

## Fixtures & presets

New `MembersPreset` struct, distinct from the message presets:

```go
type MembersPreset struct {
Name string
Users int // global user pool
Rooms int // rooms to seed
BaselineSize int // members per room at seed time (incl. owner)
CandidatePool int // unused-but-eligible users tagged per room
Shape ShapeMix // {usersFrac, orgsFrac, channelsFrac}
Orgs int // org pool (zero if shape excludes orgs)
OrgSize int // avg members per org
SourceChannels int // pool of channel-refs (zero if shape excludes channels)
}
```

Per seeded room: one owner subscription + `BaselineSize-1` regular member
subscriptions, all drawn deterministically from the global user pool.
The remaining `CandidatePool` users per room are *not* subscribed —
they are the pool the generator draws from. `roomID → []candidateAccount`
lives in the in-memory `MembersFixtures` struct, derived deterministically
from `--seed`, never persisted.

Three new presets:

| preset | rooms | baseline | candidate pool | use case |
|--------------------|-------|----------|----------------|-----------------------------------------|
| `members-small` | 5 | 10 | 50 | smoke / dev |
| `members-medium` | 100 | 100 | 500 | sustained-throughput default |
| `members-capacity` | 5 | 0 | 10000 | capacity-growth, fills to MAX_ROOM_SIZE |

Shape mix defaults to `{users:1.0}` for all three presets, overridable
per-run via `--shape=users|orgs|channels|mixed`.

## Generators

### SustainedGenerator

Open-loop ticker at `--rate` req/sec for `--duration`, bounded by the
existing `MaxInFlight` semaphore. Each tick:

1. Pick a room round-robin across the fixture set.
2. Pop K candidates from that room's in-memory pool. If the pool would
drop below K, mark the room "exhausted" and skip; if all rooms
exhausted, abort early with `"preset's CandidatePool too small for
rate × duration × K — re-seed with larger pool"`.
3. Build the `AddMembersRequest` (shape selector picks users / orgs /
channels). For canonical inject, set `RequesterID` + `RequesterAccount`
from the seeded owner so the canonical payload matches what room-service
would have produced.
4. Hand to publisher, tag with correlation ID for E1/E2 join.

### CapacityGenerator

Sequential per room. Each loop:

1. Iterate the preset's room set (5 rooms for `members-capacity`),
running rooms concurrently so a slow room doesn't gate the others.
2. Pop K candidates, build request, publish, **wait for E2 on this room**
before the next iteration on the same room. Within-room sequencing is
what makes "current size" a clean x-axis.
3. Stop a room when it reaches `--target-size` or its pool exhausts.
4. Each completed add records `(roomSize, e1Latency, e2Latency)` into a
size-bucketed histogram.

`--max-rate` is a safety valve: if set, cap each room's per-second adds
to avoid drowning room-worker during the smaller-room phase. Default
unset (sequential pacing alone).

## Publisher

```go
type MemberPublisher interface {
Publish(ctx context.Context, requesterAccount, roomID string,
req *model.AddMembersRequest, corrID string) error
}
```

Two implementations:

- `frontdoorPublisher` — `nc.PublishRequest` with reply inbox.
- `canonicalPublisher` — `js.Publish` to ROOMS stream subject.

The owner subscription is seeded per room regardless of inject mode
(frontdoor requires it for the auth check; canonical doesn't, but
keeping it lets you swap `--inject` between runs against the same
fixture without re-seeding).

## Collector

Mirrors the existing `Collector` shape but with `RecordMemberEvent` that
filters E2 to `member_added` `RoomEvent`s only (broadcast-worker emits
multiple event types per room). Reuses `ComputePercentiles` + CSV
writer.

Capacity mode swaps the single-percentile summary for a size-bucketed
table: `[size_bucket, count, e1_p50, e1_p99, e2_p50, e2_p99]`.

E1/E2 correlation:

- **E1** (frontdoor only) — NATS request/reply already gives a unique
reply inbox per request; the publisher records the send time when it
calls `PublishRequest` and the reply handler computes the delta. No
payload-side correlation ID needed.
- **E2** — room-worker generates the sys-message ID, so the existing
loadgen's `LastMsgID` trick doesn't apply. Instead, the collector
keys on `(roomID, sortedAddedAccounts)`. The candidate-pool design
guarantees disjoint user sets across concurrent requests to the same
room (each request pops K unique candidates), so this composite key
is unique without coordination. The broadcast handler decodes
`RoomEvent.Message.Members` (or whatever field the `member_added`
payload uses to list added accounts — to be confirmed against
`pkg/model` during implementation; if not present, the implementation
plan must add a correlation-friendly field on the broadcast event).

## Observability

New Prometheus metrics on the existing registry:

- `loadgen_member_published_total{phase,inject,shape}`
- `loadgen_member_publish_errors_total{reason=publish|room_service|timeout}`
- `loadgen_member_e1_latency_seconds` (histogram; frontdoor only)
- `loadgen_member_e2_latency_seconds` (histogram)
- `loadgen_member_room_size{room_id}` (gauge; capacity mode only)
- Existing `loadgen_consumer_pending{stream,consumer}` reused, sampled
against `room-worker` consumer on ROOMS stream.

Summary printer extended with a `MembersSummary` variant:

```
Preset members-medium
Inject frontdoor
Shape users
Target rate 100 req/s
Actual rate 98.7 req/s
Duration 60s (warmup 10s)
Users per add 10
Members added 59220
Errors publish=0 room_service=2 timeout=0
E1 p50/p95/p99 4.2 / 12.1 / 28.0 ms
E2 p50/p95/p99 9.7 / 31.0 / 78.4 ms
Consumer lag room-worker final_pending=0
```

Capacity mode adds a per-bucket table and a `room_id, final_size` block.

CSV export for capacity mode adds `room_size` and `inject` columns.

## Error handling

Three classes, distinct labels on `loadgen_member_publish_errors_total`:

- `publish` — NATS or JetStream publish failure → log + count, continue.
- `room_service` — frontdoor reply with non-empty `error` field
(capacity, dedup, permission) → log first 5 verbatim, count rest.
Hitting `"room is at maximum capacity"` is the natural stop signal in
sustained mode if the candidate pool is sized too generously.
- `timeout` — E2 not seen within `--e2-timeout` (default 30s) → count,
mark missing in collector. In capacity mode this is a hard stop for
that room (subsequent adds would queue behind a stuck worker).

Generator abort is fatal-but-clean: cancel the run context, drain
in-flight, print the partial summary with an `exhausted` or
`timed_out` flag set. Reuses the existing `signal.NotifyContext` and
2-second reply-drain from `runRun`.

## Testing

Per CLAUDE.md TDD rules (Red → Green → Refactor → Commit, ≥80% coverage):

- `members_test.go` — generator unit tests with stub `MemberPublisher`
and synthetic clock: rate adherence, candidate-pool exhaustion abort,
shape selector distribution (chi-squared over N=10000), capacity-mode
sequential ordering, `--max-rate` cap.
- `collector_member_test.go` — E1/E2 correlation join, size-bucketing
edges, percentile output, `member_added` filter rejects other
RoomEvent types.
Comment on lines +283 to +285
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inconsistent test filename in the testing section.

Line [283] uses collector_member_test.go, while the layout uses members_collector_test.go. Keep one canonical name to prevent churn and mislinked implementation tasks.

Also applies to: 303-305

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 283 - 285, The docs use two different test filenames —
`collector_member_test.go` and `members_collector_test.go` — causing
inconsistency; pick the canonical name (use `members_collector_test.go` to match
the layout) and replace all occurrences of `collector_member_test.go` in this
spec (including the testing section and the other spot around the later mention)
with `members_collector_test.go` so the spec and implementation tasks reference
the same filename.

- `seed_member_test.go` — fixture builder determinism for a given seed,
candidate pool disjoint from seeded members, owner role present on
every room.
- `members_integration_test.go` (`//go:build integration`) —
testcontainers: NATS+JS, Mongo, Valkey, real `room-service` and
`room-worker`. Seeds `members-small`, runs 5s sustained + a 1-room
capacity to size 50, asserts zero errors, non-empty E1+E2, final
Mongo subscription count matches expected.
- Flag validation: `--shape=channels --inject=canonical` rejected at
parse time; `--users-per-add` ≥ 1; `--target-size` ≤ MAX_ROOM_SIZE.

## File layout

```
tools/loadgen/
members.go # MembersPreset, generators, publisher
members_test.go
members_collector.go # E1/E2 join, size buckets, summary printer
members_collector_test.go
members_integration_test.go
seed_members.go # MembersFixtures builder, Seed/Teardown integration
seed_members_test.go
main.go # +members-sustained, members-capacity subcommands;
# +--workload flag on seed/teardown
report.go # +MembersSummary variant
metrics.go # +member_* metric families
deploy/Makefile # +seed-members, run-members-sustained,
# run-members-capacity, reset-members targets
README.md # +members section, presets table, examples
```

Existing files touched: `main.go`, `report.go`, `metrics.go`,
`deploy/Makefile`, `README.md`. New code mostly isolated to
`members*.go` and `seed_members*.go` so the messaging benchmark stays
readable.
6 changes: 6 additions & 0 deletions pkg/subject/subject.go
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,12 @@ func RoomMemberEvent(roomID string) string {
return fmt.Sprintf("chat.room.%s.event.member", roomID)
}

// RoomMemberEventWildcard is the subscription pattern matching member events
// (member_added / member_removed) across all rooms on this site.
func RoomMemberEventWildcard() string {
return "chat.room.*.event.member"
}

func MsgThreadParentPattern(siteID string) string {
return fmt.Sprintf("chat.user.{account}.request.room.{roomID}.%s.msg.thread.parent", siteID)
}
Expand Down
2 changes: 2 additions & 0 deletions pkg/subject/subject_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ func TestSubjectBuilders(t *testing.T) {
"chat.user.alice.request.room.r1.site-a.member.add"},
{"MemberEvent", subject.MemberEvent("r1"),
"chat.room.r1.event.member"},
{"RoomMemberEventWildcard", subject.RoomMemberEventWildcard(),
"chat.room.*.event.member"},
{"MemberList", subject.MemberList("alice", "r1", "site-a"),
"chat.user.alice.request.room.r1.site-a.member.list"},
{"MemberListWildcard", subject.MemberListWildcard("site-a"),
Expand Down
Loading
Loading