Skip to content

feat(loadgen): add read-receipt workload to max-rps#264

Open
hmchangw wants to merge 13 commits into
mainfrom
claude/magical-dirac-1ytWo
Open

feat(loadgen): add read-receipt workload to max-rps#264
hmchangw wants to merge 13 commits into
mainfrom
claude/magical-dirac-1ytWo

Conversation

@hmchangw
Copy link
Copy Markdown
Owner

@hmchangw hmchangw commented Jun 2, 2026

Summary

Adds a read-receipt workload to the loadgen max-rps ramp command. It drives the room-service read-receipt RPC (chat.user.{account}.request.room.{roomID}.{siteID}.message.read-receipt) — a synchronous request/reply read ("who has read message X") — at increasing RPS steps to find the maximum sustainable rate under the latency/error SLOs.

  • Mirrors the existing history workload (synchronous read, no JetStream consumer): plugs into the rpsWorkload interface and reuses the ramp engine, verdict logic, and report rendering unchanged.
  • Reuses the history seed and presets. Targets are derived from the message plan (top-level messages only); the requester is each message's sender, satisfying the handler's msgSender == requesterAccount guard.
  • New seed --workload=read-receipt --read-ratio=0.7 step stamps lastSeenAt on a configurable fraction of each room's subscribers so ListReadReceipts exercises its real $match/$lookup/$unwind path instead of short-circuiting on an empty match.
  • --slo-pending-growth is ignored (no consumer); --request-timeout sets the per-request timeout.

New files: readreceipt_collector.go, readreceipt_requester.go, readreceipt_generator.go, maxrps_readreceipt.go, readreceipt_seed.go (+ unit tests and an integration test). Wiring in maxrps.go, main.go, history_main.go. Docs in tools/loadgen/README.md. Design + plan under docs/superpowers/.

Test Plan

  • make test SERVICE=tools/loadgen — unit suite passes (race detector)
  • make lint — 0 issues
  • make fmt — no changes
  • gosec — PASS (govulncheck/semgrep run in CI; blocked locally by network policy)
  • make test-integration SERVICE=tools/loadgenTestReadReceiptWorkload_EndToEnd (compiles + go vet -tags integration clean; runs in CI — Docker unavailable in the dev environment)
  • Manual: loadgen seed --workload=read-receipt --preset=history-medium --read-ratio=0.7 then loadgen max-rps --workload=read-receipt --preset=history-medium --steps=200,500,1k,2k,5k

https://claude.ai/code/session_019TuJE9Y5nbsTMnXdycLiWo


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added read-receipt workload option to loadgen max-rps for capacity testing
    • Introduced --read-ratio flag to configure the fraction of subscribers marked as readers during workload preparation
    • Added read-receipt seeding support via loadgen seed command
  • Documentation

    • Updated loadgen README with read-receipt workload usage, examples, and CLI reference
    • Added design specification and implementation plan documentation

claude added 11 commits June 2, 2026 09:18
Moves the production natsReadReceiptRequester impl out of the interface
file (which caused unused-lint failures) to align with the history
pattern where the concrete NATS requester lives in the main wiring file.

https://claude.ai/code/session_019TuJE9Y5nbsTMnXdycLiWo
Wires the readReceiptWorkload into the max-rps ramp engine: restores
the production natsReadReceiptRequester, adds the read-receipt case to
runMaxRPS, and introduces buildReadReceiptInputs/runReadReceiptFor.

https://claude.ai/code/session_019TuJE9Y5nbsTMnXdycLiWo
Implements selectReaders (ceil-based uniform random selection) and
latestTopLevelByRoom helpers, then composes them in SeedReadReceiptState
to stamp lastSeenAt on a configurable fraction of subscribers per room.
Wires a new "read-receipt" workload into the seed subcommand via
runSeedReadReceipt, which seeds the full history fixture set (Mongo +
Cassandra) before stamping reader state.

https://claude.ai/code/session_019TuJE9Y5nbsTMnXdycLiWo
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request implements a complete read-receipt workload for the loadgen max-rps capacity-testing tool. It adds seeding logic to mark subscribers as readers in MongoDB, a generator to issue concurrent RPC requests at target rate, and CLI wiring to integrate the workload into existing seed and max-rps commands within the ramp/verdict/report pipeline.

Changes

Read-receipt max-rps workload implementation

Layer / File(s) Summary
Design and planning documentation
docs/superpowers/plans/2026-06-02-loadgen-read-receipt-maxrps.md, docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md
Two new design documents specify the workload purpose (drive read-receipt RPC at increasing RPS and report max sustainable rate under SLOs), architecture (reuse ramp/verdict/report pipeline), seeding strategy (stamp lastSeenAt on configurable subscriber fraction), component breakdown (collector, requester, generator, workload adapter, seed helpers), testing plan (unit and integration tests), and CLI wiring changes.
Collector and requester transport
tools/loadgen/readreceipt_collector.go, tools/loadgen/readreceipt_collector_test.go, tools/loadgen/readreceipt_requester.go
ReadReceiptCollector provides thread-safe latency/error/saturation recording via RecordSample, RecordError, and RecordSaturation methods; defensive-copy accessors return latencies, failed counts, and saturation ticks. ReadReceiptRequester interface and natsReadReceiptRequester implementation issue synchronous request/reply RPCs with per-call timeouts, wrapping NATS errors contextually.
Read-receipt seeding logic
tools/loadgen/readreceipt_seed.go, tools/loadgen/readreceipt_seed_test.go
SeedReadReceiptState function deterministically selects ceil(readRatio * n) subscribers per room using seeded RNG permutation, derives the latest top-level message timestamp per room (excluding threaded replies), and stamps lastSeenAt in MongoDB via subscriptions.UpdateMany. Tests validate selection ratio, determinism, and edge cases; latestTopLevelByRoom correctness on mixed thread/top-level messages.
Read-receipt generator for concurrent RPC dispatch
tools/loadgen/readreceipt_generator.go, tools/loadgen/readreceipt_generator_test.go
ReadReceiptGenerator issues requests at configured rate in open-loop pattern, deriving targets from top-level messages only. Optional semaphore-based concurrency bounding (with MaxInFlight) records saturation when capacity exhausted. Each request classifies failures (request errors, bad replies, embedded errors) and records latencies on success using collector. Tests cover target derivation, rate validation, error classification, and request issuance.
Read-receipt workload adapter and step execution
tools/loadgen/maxrps_readreceipt.go, tools/loadgen/maxrps_readreceipt_test.go, tools/loadgen/readreceipt_integration_test.go
readReceiptWorkload implements rpsWorkload interface: RunStep executes warmup (discarded) and hold (measured) phases sequentially via generator runs, drains trailing replies, and returns normalized rpsStepInputs with latency samples, operation counts, and saturation. newReadReceiptWorkload wires NATS, metrics HTTP server, and requester; cleanup shuts down server and drains NATS. Unit test validates input field population from collector state; integration test seeds MongoDB, stubs NATS responder, runs generator, and asserts seeded count and non-zero samples with zero failures.
Seed command CLI integration
tools/loadgen/history_main.go, tools/loadgen/main.go
Adds --read-ratio flag (default 0.7) to loadgen seed; extends workload dispatcher with read-receipt case that calls runSeedReadReceipt. runSeedReadReceipt validates readRatio range, loads history preset, connects to Mongo/Cassandra, builds/seeds history fixtures, then calls SeedReadReceiptState, returning exit code on success/error.
Max-rps command CLI integration
tools/loadgen/maxrps.go
Expands --workload flag to include read-receipt; routes both history and read-receipt through same default step sequence. Adds read-receipt case in runMaxRPS that validates preset/timeout, constructs readReceiptWorkload via newReadReceiptWorkload, and wires cleanup. Updates --request-timeout help to apply to both history and read-receipt.
README documentation
tools/loadgen/README.md
Extends max-rps synopsis and quick-start examples to include read-receipt workload with preparatory loadgen seed --read-ratio step. Updates --workload flag list and --request-timeout notes to reference read-receipt. Adds dedicated "Read-receipt workload" section documenting RPC target, seeding prerequisites, timeout applicability, and teardown mapping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The PR combines heterogeneous new components (collector, requester, generator, seeding helpers, workload adapter) across multiple files with logic density in the generator concurrency bounding and seeding selection. However, each component is self-contained with unit tests, and the architecture mirrors the existing history workload pattern. The integration points are straightforward wiring into existing CLI dispatch tables and the rpsWorkload interface.

Possibly related PRs

  • hmchangw/chat#240: Extends the existing max-rps ramp/SLO framework introduced in PR #240 by adding the new read-receipt workload into the same verdict/report pipeline.
  • hmchangw/chat#167: Depends on the room-service read-receipt RPC interface (subjects, request/response types, reader list behavior) introduced in PR #167.
  • hmchangw/chat#117: Extends the loadgen CLI seed/workload dispatch structure established in PR #117 with the new read-receipt seed and max-rps paths.

Suggested labels

ready

Suggested reviewers

  • mliu33
  • Joey0538

Poem

🐰 A workload born to test the load,
Read-receipts marching down the road,
With collectors, seeds, and rates so neat,
The max-rps pipeline stays complete!
hops off with a scroll 📜

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new read-receipt workload to the loadgen max-rps command, which is the primary objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/magical-dirac-1ytWo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-02-loadgen-read-receipt-maxrps.md`:
- Around line 700-713: The RunStep method in readReceiptWorkload uses time.Sleep
to "drain" in-flight replies; replace this with explicit synchronization by
having the generator created by w.newGenerator provide a deterministic
completion signal (for example a done channel or sync.WaitGroup) and waiting on
that signal instead of time.Sleep. Concretely: update
w.newGenerator/runReadReceiptFor contract so runReadReceiptFor returns (or
exposes) a completion indicator tied to the generator instance (e.g., a done
chan struct{} or *sync.WaitGroup) that is closed/marked when all reply
goroutines are finished; then in RunStep (the function shown) remove
time.Sleep(2 * time.Second) and wait on that completion signal before calling
buildReadReceiptInputs with the collector; ensure NewReadReceiptCollector and
collector usage remain unchanged but are only read after the generator signals
completion.

In `@docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md`:
- Around line 125-128: The RunStep description currently tells implementers to
"sleep briefly to drain trailing in-flight replies" — replace that with explicit
synchronization: when mirroring historyWorkload.RunStep (functions/methods
referenced: RunStep, buildReadReceiptInputs, warmup, hold, collector) require
the warmup and hold generators to run as distinct goroutines that signal
completion via channels or a sync.WaitGroup and ensure the measured collector
drains by waiting for a done signal from the generator/collector rather than
sleeping; document that the throwaway collector used for warmup discards samples
and does not block the synchronization, and specify the exact signal (e.g.,
generator done channel or WaitGroup) that callers must wait on before calling
buildReadReceiptInputs(targetRPS, hold, collector).
- Around line 93-95: Update the spec to reflect the actual CLI contract: replace
references to the non-existent `seed-read-receipt` subcommand with the real
invocation `loadgen seed --workload read-receipt` and show the same parameters
(`--preset`, `--seed`, `--read-ratio`) passed to `loadgen seed`; specifically
edit the sections that describe the command (currently mentioning
`seed-read-receipt`) so they present the correct command shape (e.g., `loadgen
seed --workload read-receipt --preset ... --seed ... --read-ratio ...`) and any
examples or docs that repeat `seed-read-receipt` to avoid operator confusion.
- Line 77: Remove the dangling Reset() references from the collector spec:
update the descriptions for the readreceipt_collector and history_collector to
omit "Reset()"-able and instead state the collector provides an in-memory
latency tape, timeout/reply-error/bad-reply/saturation counters and is
thread-safe (mutex); ensure any other occurrences (e.g., the second mention
around line 153) are likewise cleaned up and optionally add a short note that
Reset() will be documented when/if a Reset API is actually introduced.

In `@tools/loadgen/history_main.go`:
- Around line 82-84: The validation for the command-line flag readRatio
incorrectly allows NaN because flag.Float64 parses "NaN" and the current check
(readRatio <= 0 || readRatio > 1) doesn't catch it; update the validation in the
same block that references readRatio (before calling SeedReadReceiptState) to
explicitly reject NaN (use math.IsNaN(readRatio)) and treat it as invalid,
printing the same error message "--read-ratio must be in (0, 1]" and returning
2.

In `@tools/loadgen/maxrps_readreceipt.go`:
- Around line 71-76: In the cleanup closure, don't ignore errors from
srv.Shutdown and nc.Drain; capture each return value (e.g., err :=
srv.Shutdown(shutCtx) and err := nc.Drain()) and log a warning or error with
context (for example "server shutdown failed" and "nats drain failed") including
the error value so teardown failures are visible; keep the existing context
cancel call and ensure logging happens before returning from the cleanup func.
- Around line 112-119: The goroutine currently discards the error returned by
gen.Run(genCtx), causing failures to be hidden; change it to capture the result
(e.g., send the error to a buffered channel or store it in a variable
synchronized with the WaitGroup) and after wg.Wait() check that error and return
it instead of the nil from waitOrCancel; ensure you still call cancel() and
wait, and return the gen.Run error if present (whether it finished before or
after waitOrCancel). Reference: gen.Run, genCtx, waitOrCancel, cancel, wg.

In `@tools/loadgen/README.md`:
- Around line 320-325: The README example is misleading because
runTeardownHistory derives room IDs from the preset+seed, so teardown must be
run with the same --seed used to create the fixtures; update the docs to state
this requirement, mention that failing to pass the original --seed will leave
Valkey room keys behind, and change the example for loadgen teardown
--workload=history --preset=history-medium to include a matching --seed (e.g.,
--seed=1234) and a short note referencing runTeardownHistory so readers know why
the seed must match.

In `@tools/loadgen/readreceipt_integration_test.go`:
- Around line 78-84: Replace the fixed time.Sleep by waiting on a deterministic
signal from the generator/responder: modify the test to create a done channel
(or use a sync.WaitGroup) that the stub responder closes or signals when it has
delivered all trailing replies, then replace time.Sleep(500*time.Millisecond)
with a select that waits on that done channel (or WaitGroup completion) or the
runCtx.Done() for timeout; update the stub responder (or generator) to signal
the done channel when it finishes emitting replies so the test uses
gen.Run(runCtx) plus a synchronous wait on that signal before asserting on
collector.Samples() and collector.Failed().

In `@tools/loadgen/readreceipt_seed.go`:
- Around line 69-77: byRoom is a map so iterating over it uses a
nondeterministic order and consumes the shared RNG (rng) unpredictably; to
restore deterministic behavior for SeedReadReceiptState, collect the room IDs
from byRoom, sort them (e.g., lexicographically), and iterate over the sorted
slice when generating lastSeen and calling selectReaders so the same rng
sequence is consumed in a stable order for every run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0270cc3-2bf2-4650-a477-4f67c93bdacb

📥 Commits

Reviewing files that changed from the base of the PR and between 9c5d14a and 7695977.

📒 Files selected for processing (16)
  • docs/superpowers/plans/2026-06-02-loadgen-read-receipt-maxrps.md
  • docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md
  • tools/loadgen/README.md
  • tools/loadgen/history_main.go
  • tools/loadgen/main.go
  • tools/loadgen/maxrps.go
  • tools/loadgen/maxrps_readreceipt.go
  • tools/loadgen/maxrps_readreceipt_test.go
  • tools/loadgen/readreceipt_collector.go
  • tools/loadgen/readreceipt_collector_test.go
  • tools/loadgen/readreceipt_generator.go
  • tools/loadgen/readreceipt_generator_test.go
  • tools/loadgen/readreceipt_integration_test.go
  • tools/loadgen/readreceipt_requester.go
  • tools/loadgen/readreceipt_seed.go
  • tools/loadgen/readreceipt_seed_test.go

Comment on lines +700 to +713
func (w *readReceiptWorkload) RunStep(ctx context.Context, targetRPS int, warmup, hold time.Duration) (rpsStepInputs, error) {
if warmup > 0 {
warmCollector := NewReadReceiptCollector()
if err := runReadReceiptFor(ctx, w.newGenerator(warmCollector, targetRPS), warmup); err != nil {
return rpsStepInputs{}, err
}
}
collector := NewReadReceiptCollector()
if err := runReadReceiptFor(ctx, w.newGenerator(collector, targetRPS), hold); err != nil {
return rpsStepInputs{}, err
}
time.Sleep(2 * time.Second) // drain trailing in-flight replies into the collector
return buildReadReceiptInputs(targetRPS, hold, collector), nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace sleep-based draining with explicit synchronization in the prescribed RunStep flow.

Line 711 currently instructs time.Sleep(2 * time.Second) to drain replies. That pattern is brittle and violates the project’s concurrency rule; make the plan require deterministic completion signaling (e.g., wait on generator-owned sync primitives) instead.

As per coding guidelines: "Never use time.Sleep for goroutine synchronization — use proper sync primitives (channels, sync.WaitGroup, sync.Mutex)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-06-02-loadgen-read-receipt-maxrps.md` around
lines 700 - 713, The RunStep method in readReceiptWorkload uses time.Sleep to
"drain" in-flight replies; replace this with explicit synchronization by having
the generator created by w.newGenerator provide a deterministic completion
signal (for example a done channel or sync.WaitGroup) and waiting on that signal
instead of time.Sleep. Concretely: update w.newGenerator/runReadReceiptFor
contract so runReadReceiptFor returns (or exposes) a completion indicator tied
to the generator instance (e.g., a done chan struct{} or *sync.WaitGroup) that
is closed/marked when all reply goroutines are finished; then in RunStep (the
function shown) remove time.Sleep(2 * time.Second) and wait on that completion
signal before calling buildReadReceiptInputs with the collector; ensure
NewReadReceiptCollector and collector usage remain unchanged but are only read
after the generator signals completion.

|------|---------|----------------|
| `tools/loadgen/maxrps_readreceipt.go` | `maxrps_history.go` | `readReceiptWorkload` implementing `rpsWorkload`. `newReadReceiptWorkload` wires NATS, the metrics HTTP server, the requester, and derives targets from `BuildHistoryFixtures`. `RunStep` runs warmup (discarded) then hold (measured) and returns `rpsStepInputs`. `Label()` returns `"read-receipt"`. |
| `tools/loadgen/readreceipt_generator.go` | `history_generator.go` | `ReadReceiptGenerator` with a `Rate` ticker and a `MaxInFlight` semaphore. Each tick picks a random target, issues the request via the requester, and records the result in the collector. Saturation (pool full on tick) is recorded, not dropped silently. |
| `tools/loadgen/readreceipt_collector.go` | `history_collector.go` | In-memory latency tape plus `timeout`, `reply-error`, `bad-reply`, and `saturation` counters. Thread-safe (mutex), `Reset()`-able. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove Reset() references from collector behavior unless that API is actually added.

Line 77 and Line 153 describe a Reset()-able collector, but the implemented collector contract does not include Reset(). Align the spec to the real API to avoid misleading test/usage expectations.

Also applies to: 153-153

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md` at
line 77, Remove the dangling Reset() references from the collector spec: update
the descriptions for the readreceipt_collector and history_collector to omit
"Reset()"-able and instead state the collector provides an in-memory latency
tape, timeout/reply-error/bad-reply/saturation counters and is thread-safe
(mutex); ensure any other occurrences (e.g., the second mention around line 153)
are likewise cleaned up and optionally add a short note that Reset() will be
documented when/if a Reset API is actually introduced.

Comment on lines +93 to +95
- `main.go`: add a `seed-read-receipt` subcommand that runs the full history
seed (`Seed`, `SeedRoomKeys`, `SeedThreadRooms`, `SeedHistoryCassandra`) then
`SeedReadReceiptState`, parameterized by `--preset`, `--seed`, `--read-ratio`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the seed command contract: this spec documents a non-existent CLI shape.

Lines 93–95 and Line 101 describe seed-read-receipt, but this PR’s contract is loadgen seed --workload read-receipt .... Keeping the wrong command here will cause immediate operator error.

Also applies to: 101-101

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md`
around lines 93 - 95, Update the spec to reflect the actual CLI contract:
replace references to the non-existent `seed-read-receipt` subcommand with the
real invocation `loadgen seed --workload read-receipt` and show the same
parameters (`--preset`, `--seed`, `--read-ratio`) passed to `loadgen seed`;
specifically edit the sections that describe the command (currently mentioning
`seed-read-receipt`) so they present the correct command shape (e.g., `loadgen
seed --workload read-receipt --preset ... --seed ... --read-ratio ...`) and any
examples or docs that repeat `seed-read-receipt` to avoid operator confusion.

Comment on lines +125 to +128
Mirrors `historyWorkload.RunStep`: run a fresh generator for `warmup` against a
throwaway collector (samples discarded), then run a fresh generator for `hold`
against the measured collector, sleep briefly to drain trailing in-flight
replies, and return `buildReadReceiptInputs(targetRPS, hold, collector)`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid prescribing sleep-based draining in RunStep behavior.

The “sleep briefly to drain trailing replies” guidance should be replaced with explicit synchronization semantics; sleep-based coordination is nondeterministic and policy-inconsistent.

As per coding guidelines: "Never use time.Sleep for goroutine synchronization — use proper sync primitives (channels, sync.WaitGroup, sync.Mutex)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-06-02-loadgen-read-receipt-maxrps-design.md`
around lines 125 - 128, The RunStep description currently tells implementers to
"sleep briefly to drain trailing in-flight replies" — replace that with explicit
synchronization: when mirroring historyWorkload.RunStep (functions/methods
referenced: RunStep, buildReadReceiptInputs, warmup, hold, collector) require
the warmup and hold generators to run as distinct goroutines that signal
completion via channels or a sync.WaitGroup and ensure the measured collector
drains by waiting for a done signal from the generator/collector rather than
sleeping; document that the throwaway collector used for warmup discards samples
and does not block the synchronization, and specify the exact signal (e.g.,
generator done channel or WaitGroup) that callers must wait on before calling
buildReadReceiptInputs(targetRPS, hold, collector).

Comment on lines +82 to +84
if readRatio <= 0 || readRatio > 1 {
fmt.Fprintln(os.Stderr, "--read-ratio must be in (0, 1]")
return 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does Go's flag package accept "NaN" for a float64 flag, and do comparisons like x <= 0 || x > 1 evaluate to false when x is NaN?

💡 Result:

Yes. The standard library flag package’s float64 flag ultimately uses strconv.ParseFloat to parse the flag value, and strconv.ParseFloat recognizes the string "NaN" (case-insensitive) as a NaN float64. [1][2] For comparisons: in Go, any comparison operation (<, <=, >, >=, ==,!=) involving a NaN produces results that follow IEEE 754 semantics: x <= 0 is false when x is NaN, and x > 1 is also false when x is NaN; therefore the boolean expression (x <= 0 || x > 1) evaluates to false when x is NaN. This is explicitly tested by Go’s float comparison test suite (nan <= f, nan > f, and nan <= nan are all false). [3] Note:!= is the only “true” comparison when comparing NaN to anything (including itself), consistent with IEEE 754. [3]

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "readRatio|read-ratio|SeedReadReceiptState|runSeedReadReceipt" tools/loadgen/history_main.go tools/loadgen/main.go || true
echo "---- history_main.go (around guard) ----"
sed -n '60,110p' tools/loadgen/history_main.go
echo "---- history_main.go (around runSeedReadReceipt) ----"
sed -n '110,190p' tools/loadgen/history_main.go
echo "---- main.go (flag wiring) ----"
sed -n '1,220p' tools/loadgen/main.go

Repository: hmchangw/chat

Length of output: 12596


Reject NaN for --read-ratio.

flag.Float64 accepts "NaN", and readRatio <= 0 || readRatio > 1 evaluates to false when readRatio is NaN, so an invalid value can reach SeedReadReceiptState.

Suggested fix
-	if readRatio <= 0 || readRatio > 1 {
+	if !(readRatio > 0 && readRatio <= 1) {
 		fmt.Fprintln(os.Stderr, "--read-ratio must be in (0, 1]")
 		return 2
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if readRatio <= 0 || readRatio > 1 {
fmt.Fprintln(os.Stderr, "--read-ratio must be in (0, 1]")
return 2
if !(readRatio > 0 && readRatio <= 1) {
fmt.Fprintln(os.Stderr, "--read-ratio must be in (0, 1]")
return 2
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/history_main.go` around lines 82 - 84, The validation for the
command-line flag readRatio incorrectly allows NaN because flag.Float64 parses
"NaN" and the current check (readRatio <= 0 || readRatio > 1) doesn't catch it;
update the validation in the same block that references readRatio (before
calling SeedReadReceiptState) to explicitly reject NaN (use
math.IsNaN(readRatio)) and treat it as invalid, printing the same error message
"--read-ratio must be in (0, 1]" and returning 2.

Comment on lines +71 to +76
cleanup := func() {
shutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
_ = srv.Shutdown(shutCtx)
cancel()
_ = nc.Drain()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Log teardown failures instead of discarding them.

Lines 73 and 75 silently drop Shutdown/Drain errors, which hides failed listener teardown or NATS drain problems during loadgen cleanup. At least warn with context so the run output reflects partial shutdowns.

💡 Suggested fix
 	cleanup := func() {
 		shutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
-		_ = srv.Shutdown(shutCtx)
+		if err := srv.Shutdown(shutCtx); err != nil {
+			slog.Warn("shutdown metrics server", "error", err)
+		}
 		cancel()
-		_ = nc.Drain()
+		if err := nc.Drain(); err != nil {
+			slog.Warn("drain nats connection", "error", err)
+		}
 	}

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cleanup := func() {
shutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
_ = srv.Shutdown(shutCtx)
cancel()
_ = nc.Drain()
}
cleanup := func() {
shutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
if err := srv.Shutdown(shutCtx); err != nil {
slog.Warn("shutdown metrics server", "error", err)
}
cancel()
if err := nc.Drain(); err != nil {
slog.Warn("drain nats connection", "error", err)
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_readreceipt.go` around lines 71 - 76, In the cleanup
closure, don't ignore errors from srv.Shutdown and nc.Drain; capture each return
value (e.g., err := srv.Shutdown(shutCtx) and err := nc.Drain()) and log a
warning or error with context (for example "server shutdown failed" and "nats
drain failed") including the error value so teardown failures are visible; keep
the existing context cancel call and ensure logging happens before returning
from the cleanup func.

Comment on lines +112 to +119
go func() {
defer wg.Done()
_ = gen.Run(genCtx)
}()
err := waitOrCancel(ctx, d)
cancel()
wg.Wait()
return err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate gen.Run failures instead of dropping them.

Line 114 discards the generator error, so an early requester/setup failure turns into a full-duration wait and a misleadingly successful step result. Capture the goroutine result and return it when gen.Run exits before or after cancellation.

💡 Suggested fix
 func runReadReceiptFor(ctx context.Context, gen *ReadReceiptGenerator, d time.Duration) error {
 	genCtx, cancel := context.WithCancel(ctx)
-	var wg sync.WaitGroup
-	wg.Add(1)
+	defer cancel()
+	errCh := make(chan error, 1)
+	var wg sync.WaitGroup
+	wg.Add(1)
 	go func() {
 		defer wg.Done()
-		_ = gen.Run(genCtx)
+		errCh <- gen.Run(genCtx)
 	}()
-	err := waitOrCancel(ctx, d)
-	cancel()
-	wg.Wait()
-	return err
+	timer := time.NewTimer(d)
+	defer timer.Stop()
+
+	select {
+	case err := <-errCh:
+		cancel()
+		wg.Wait()
+		return err
+	case <-ctx.Done():
+		cancel()
+		wg.Wait()
+		return ctx.Err()
+	case <-timer.C:
+		cancel()
+		wg.Wait()
+		return <-errCh
+	}
 }

As per coding guidelines: "Never ignore errors silently — comment if intentionally discarded".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_readreceipt.go` around lines 112 - 119, The goroutine
currently discards the error returned by gen.Run(genCtx), causing failures to be
hidden; change it to capture the result (e.g., send the error to a buffered
channel or store it in a variable synchronized with the WaitGroup) and after
wg.Wait() check that error and return it instead of the nil from waitOrCancel;
ensure you still call cancel() and wait, and return the gen.Run error if present
(whether it finished before or after waitOrCancel). Reference: gen.Run, genCtx,
waitOrCancel, cancel, wg.

Comment thread tools/loadgen/README.md
Comment on lines +320 to +325
To tear down, use the history teardown — read-receipt seeds the identical
history fixtures, so `loadgen teardown --workload=history --preset=<name>` drops
everything (dropping `subscriptions` removes the stamped `lastSeenAt` too):

```bash
loadgen teardown --workload=history --preset=history-medium
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mention that teardown must reuse the seed.

This example implies any seed works, but runTeardownHistory derives room IDs from the preset+seed before deleting Valkey room keys. If someone seeded read-receipt data with a non-default seed, this command will drop Mongo/Cassandra state but leave the corresponding room keys behind unless the same --seed is passed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/README.md` around lines 320 - 325, The README example is
misleading because runTeardownHistory derives room IDs from the preset+seed, so
teardown must be run with the same --seed used to create the fixtures; update
the docs to state this requirement, mention that failing to pass the original
--seed will leave Valkey room keys behind, and change the example for loadgen
teardown --workload=history --preset=history-medium to include a matching --seed
(e.g., --seed=1234) and a short note referencing runTeardownHistory so readers
know why the seed must match.

Comment on lines +78 to +84
runCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
require.NoError(t, gen.Run(runCtx))
time.Sleep(500 * time.Millisecond) // drain trailing replies

assert.NotEmpty(t, collector.Samples(), "generator produced zero samples")
assert.Equal(t, 0, collector.Failed(), "stub responder should yield zero failures")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace the fixed sleep with a real synchronization signal.

time.Sleep(500 * time.Millisecond) makes this test timing-dependent: slower CI can still observe the collector too early, while faster runs just pay an unnecessary delay. Wait on a concrete condition from the responder/generator instead of sleeping for an arbitrary interval. As per coding guidelines, "Never use time.Sleep for goroutine synchronization — use proper sync primitives (channels, sync.WaitGroup, sync.Mutex)."

💡 One way to make the assertion deterministic
 	runCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
 	defer cancel()
 	require.NoError(t, gen.Run(runCtx))
-	time.Sleep(500 * time.Millisecond) // drain trailing replies
+	require.Eventually(t, func() bool {
+		return len(collector.Samples()) > 0 || collector.Failed() > 0
+	}, time.Second, 10*time.Millisecond)
 
 	assert.NotEmpty(t, collector.Samples(), "generator produced zero samples")
 	assert.Equal(t, 0, collector.Failed(), "stub responder should yield zero failures")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
runCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
require.NoError(t, gen.Run(runCtx))
time.Sleep(500 * time.Millisecond) // drain trailing replies
assert.NotEmpty(t, collector.Samples(), "generator produced zero samples")
assert.Equal(t, 0, collector.Failed(), "stub responder should yield zero failures")
runCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
require.NoError(t, gen.Run(runCtx))
require.Eventually(t, func() bool {
return len(collector.Samples()) > 0 || collector.Failed() > 0
}, time.Second, 10*time.Millisecond)
assert.NotEmpty(t, collector.Samples(), "generator produced zero samples")
assert.Equal(t, 0, collector.Failed(), "stub responder should yield zero failures")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/readreceipt_integration_test.go` around lines 78 - 84, Replace
the fixed time.Sleep by waiting on a deterministic signal from the
generator/responder: modify the test to create a done channel (or use a
sync.WaitGroup) that the stub responder closes or signals when it has delivered
all trailing replies, then replace time.Sleep(500*time.Millisecond) with a
select that waits on that done channel (or WaitGroup completion) or the
runCtx.Done() for timeout; update the stub responder (or generator) to signal
the done channel when it finishes emitting replies so the test uses
gen.Run(runCtx) plus a synchronous wait on that signal before asserting on
collector.Samples() and collector.Failed().

Comment on lines +69 to +77
rng := rand.New(rand.NewSource(seed))
coll := db.Collection("subscriptions")
for roomID, roomSubs := range byRoom {
floor, ok := latest[roomID]
if !ok {
continue // room has no top-level messages — nothing to read
}
lastSeen := floor.Add(time.Millisecond).UTC()
chosen := selectReaders(roomSubs, readRatio, rng)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sort rooms before consuming the seeded RNG.

byRoom is a Go map, so this loop runs in nondeterministic order. Because every room pulls from the same rng, the chosen readers change across runs even with the same seed, which breaks the deterministic seeding behavior promised by SeedReadReceiptState.

💡 Proposed fix
 import (
 	"context"
 	"fmt"
 	"math"
 	"math/rand"
+	"sort"
 	"time"
@@
 	rng := rand.New(rand.NewSource(seed))
 	coll := db.Collection("subscriptions")
-	for roomID, roomSubs := range byRoom {
+	roomIDs := make([]string, 0, len(byRoom))
+	for roomID := range byRoom {
+		roomIDs = append(roomIDs, roomID)
+	}
+	sort.Strings(roomIDs)
+	for _, roomID := range roomIDs {
+		roomSubs := byRoom[roomID]
 		floor, ok := latest[roomID]
 		if !ok {
 			continue // room has no top-level messages — nothing to read
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rng := rand.New(rand.NewSource(seed))
coll := db.Collection("subscriptions")
for roomID, roomSubs := range byRoom {
floor, ok := latest[roomID]
if !ok {
continue // room has no top-level messages — nothing to read
}
lastSeen := floor.Add(time.Millisecond).UTC()
chosen := selectReaders(roomSubs, readRatio, rng)
rng := rand.New(rand.NewSource(seed))
coll := db.Collection("subscriptions")
roomIDs := make([]string, 0, len(byRoom))
for roomID := range byRoom {
roomIDs = append(roomIDs, roomID)
}
sort.Strings(roomIDs)
for _, roomID := range roomIDs {
roomSubs := byRoom[roomID]
floor, ok := latest[roomID]
if !ok {
continue // room has no top-level messages — nothing to read
}
lastSeen := floor.Add(time.Millisecond).UTC()
chosen := selectReaders(roomSubs, readRatio, rng)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/readreceipt_seed.go` around lines 69 - 77, byRoom is a map so
iterating over it uses a nondeterministic order and consumes the shared RNG
(rng) unpredictably; to restore deterministic behavior for SeedReadReceiptState,
collect the room IDs from byRoom, sort them (e.g., lexicographically), and
iterate over the sorted slice when generating lastSeen and calling selectReaders
so the same rng sequence is consumed in a stable order for every run.

claude added 2 commits June 4, 2026 06:54
Resolve conflicts between the read-receipt max-rps workload and the
room-read workload + bottleneck-attribution work from main:

- main.go / maxrps.go: keep both read-receipt and room-read workloads
  in the seed and max-rps dispatch, flag help, and default-steps list.
- README.md: keep both the read-receipt section and main's
  bottleneck-attribution + daily-IM guide additions.
- Adapt read-receipt seed/workload to main's refactored HistoryFixtures
  API: SeedThreadRooms/SeedHistoryCassandra now take *HistoryFixtures
  (SeedHistoryCassandra returns msgCount), and the in-memory plan is
  materialized via res.FullPlan() instead of the removed res.Plan field.
Resolve README.md conflict: keep main's new INCONCLUSIVE reasons
(emit-underrun vs saturation columns + rate-pacing note) ahead of the
read-receipt workload subsection. main.go auto-merged cleanly with both
the read-receipt and room-read workloads present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants