feat(loadgen): bottleneck attribution for max-rps(messages)#262
Conversation
Pipeline-stage causality engine that fuses loadgen's per-stage signals with cAdvisor container resource trends to name the bottleneck on a max-rps(messages) breach. Scoped to messages + max-rps for v1.
Fake promclient via the consumer-defined interface covers the engine at unit scope; no container dependency to exercise.
When the messages workload trips, diagnoseBottleneck runs the attribution engine and appends the BOTTLENECK block to the rendered report and three culprit columns to the CSV trip row. Nil verdict (disabled, no trip, non-messages workload) leaves the report and CSV unchanged. Also fixes pre-existing hugeParam gocritic violations and goimports formatting in attribution.go, attribution_report.go, attribution_test.go, and main.go. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
- Rename dependency key "mongo" -> "mongodb" in messagesStageGraph and dependencyDisplayName to match the actual compose service label used by cAdvisor (com.docker.compose.service=mongodb), so MongoDB attribution no longer silently fails. - Apply dependencyDisplayName in fallbackRanking so low-confidence verdicts display "Cassandra" / "MongoDB" rather than raw keys, consistent with Pass 2. - Expand the no-baseline comment in saturated to note the over-blame risk. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
|
Warning Review limit reached
More reviews will be available in 23 minutes and 39 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements bottleneck attribution for the ChangesBottleneck Attribution for Messages Workload
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tools/loadgen/maxrps_report.go (1)
96-109:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBOTTLENECK block is dropped when no step passes.
When the first ramp step trips,
lastPassRPSreturns 0 and the functionreturn nils before reaching thebn != nilblock.diagnoseBottleneckreturns a verdict for any trip (first-step trips included), so the diagnosed BOTTLENECK is silently suppressed in precisely the case where it's most useful. The current tests don't cover this (they all have a passing step first).🐛 Render the bottleneck block on both the pass and no-pass paths
fmt.Fprintln(w) pass := lastPassRPS(results) if pass == 0 { fmt.Fprintf(w, "ANSWER: no step passed (workload=%s, preset=%s)\n", workload, preset) - return nil - } - fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset) - if trip := firstTrip(results); trip != nil { - fmt.Fprintf(w, " Next limit: %s\n", strings.Join(trip.Reasons, "; ")) - } + } else { + fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset) + if trip := firstTrip(results); trip != nil { + fmt.Fprintf(w, " Next limit: %s\n", strings.Join(trip.Reasons, "; ")) + } + } if bn != nil { renderBottleneck(w, bn) } return nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/maxrps_report.go` around lines 96 - 109, The BOTTLENECK output is skipped when lastPassRPS(results) == 0 because the function returns before checking bn; update the control flow so renderBottleneck(w, bn) is executed regardless of whether a passing step exists. Concretely, keep the existing “no step passed” ANSWER path but do not return immediately: after printing the "no step passed" message (using lastPassRPS and firstTrip), still call renderBottleneck(w, bn) when bn != nil (and then return), and ensure the existing passing-path behavior (printing max RPS, Next limit, then renderBottleneck) remains unchanged; locate changes around lastPassRPS, firstTrip, bn and renderBottleneck to implement this.tools/loadgen/deploy/Makefile (1)
78-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWait for Prometheus readiness before launching
max-rps.Line 80 only starts the metrics containers; Lines 81-85 immediately begin the ramp. With the 5s scrape interval in
tools/loadgen/deploy/prometheus/prometheus.yml, the first trip can happen before Prometheus has even scraped cAdvisor once, so the new attribution path frequently degrades toundeterminedon the exactmake run-max-rpsworkflow the README advertises. Please gate the exec on Prometheus readiness (for example via a healthcheck +up --wait, or an explicit/ -/readypoll) before starting the run.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/deploy/Makefile` around lines 78 - 85, The run-max-rps target currently brings up cadvisor and prometheus then immediately execs into the loadgen container to run /loadgen max-rps; modify run-max-rps so it waits for Prometheus to be ready before running the exec: after "$(COMPOSE) --profile dashboards up -d cadvisor prometheus" add a readiness gate that polls Prometheus' /-/ready (or uses compose healthcheck/up --wait) and only proceeds to "$(COMPOSE) exec -T loadgen /loadgen max-rps ..." once prometheus returns healthy; reference the run-max-rps target, $(COMPOSE) invocation, and the prometheus service name when implementing the check.
🧹 Nitpick comments (3)
tools/loadgen/maxrps_report_test.go (1)
118-130: ⚡ Quick winAdd coverage for a first-step trip with a bottleneck.
This test always has a passing step before the trip, so it doesn't catch the early-return gap in
renderRPSReportWithBottleneck(no step passed → BOTTLENECK suppressed). Once that path is fixed, a case with only averdictTripstep plus a non-nilbnwould guard against regression.💚 Suggested additional test case
func TestRenderRPSReport_AppendsBottleneck_NoPass(t *testing.T) { results := []rpsStepResult{ {TargetRPS: 500, Kind: verdictTrip, Reasons: []string{"E2 p95=400ms > 100ms"}}, } bn := bottleneckVerdict{Component: "message-worker", Resource: "Cassandra", Confidence: "high", Determined: true} var sb strings.Builder require.NoError(t, renderRPSReportWithBottleneck(&sb, results, "messages", "medium", &bn)) out := sb.String() assert.Contains(t, out, "ANSWER: no step passed") assert.Contains(t, out, "BOTTLENECK: message-worker (Cassandra-bound)") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/maxrps_report_test.go` around lines 118 - 130, renderRPSReportWithBottleneck currently early-returns/suppresses the BOTTLENECK when no step passed; update it so that if a non-nil bottleneckVerdict (bn) is provided it still appends the bottleneck info even when all rpsStepResult entries are verdictTrip (i.e., do not gate emitting the "BOTTLENECK" block on finding a passing step). Change the control flow in renderRPSReportWithBottleneck to compute the answer text (e.g., "no step passed" vs "max RPS = X") but always check bn != nil and append its formatted line (using bn.Component and bn.Resource such as "Component (Resource-bound)"), and add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover the no-pass + bn case.tools/loadgen/promclient.go (1)
63-73: ⚡ Quick winSurface Prometheus HTTP status on non-2xx responses
In
tools/loadgen/promclient.goRangeQuery, Resty typically returns a nilerrfor HTTP 4xx/5xx, so non-2xx responses currently surface only as JSONdecode prometheus responseerrors (orprometheus query failed). Checkingresp.IsError()and returningresp.StatusCode()makes failures much more diagnosable.♻️ Proposed tweak
if err != nil { return nil, fmt.Errorf("query prometheus: %w", err) } + if resp.IsError() { + return nil, fmt.Errorf("query prometheus: status %d: %s", resp.StatusCode(), resp.String()) + } var parsed rangeQueryResponse🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/promclient.go` around lines 63 - 73, In RangeQuery, detect non-2xx HTTP responses from the Resty response before attempting to unmarshal: check resp.IsError() (or inspect resp.StatusCode()) after the request and return an error that includes the HTTP status code and resp.Status() or resp.Body() to make failures diagnosable; keep the subsequent json.Unmarshal into rangeQueryResponse and the existing parsed.Status check but only run them when resp.IsError() is false.tools/loadgen/attribution.go (1)
105-186: ⚡ Quick winBound total bottleneck diagnosis time (not just the per-QueryRange timeout).
RangeQueryuses a 10s HTTP timeout, butDiagnoseruns many sequential PromQL queries (viasaturated/cpuCores, plus extra work infallbackRanking). If Prometheus hangs rather than fails fast, the end-of-run BOTTLENECK report can still stall for many tens of seconds; wrap thectxused foreng.Diagnose(ctx, ...)(e.g.,context.WithTimeoutarounddiagnoseBottleneck/the call site) to cap total diagnosis duration.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/attribution.go` around lines 105 - 186, The Diagnose path can run many sequential PromQL calls (saturated, cpuCores, fallbackRanking) and needs a global timeout; wrap the context passed into eng.Diagnose (or into diagnoseBottleneck) with context.WithTimeout so the entire Diagnose run is bounded (e.g., create a child ctx with a sensible total timeout, defer cancel(), and pass that ctx into eng.Diagnose/diagnoseBottleneck), ensuring all internal calls (saturated, cpuCores, fallbackRanking) inherit the deadline and the bottleneck report cannot hang indefinitely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md`:
- Around line 1439-1446: The markdown fenced block containing the
ANSWER/BOTTLENECK example lacks a language tag and triggers MD040; update that
fenced block (the triple-backtick block that starts with "ANSWER: max RPS =
2000..." and includes "BOTTLENECK: message-worker (Cassandra-bound)") to include
a language token such as text (i.e., change ``` to ```text) so the block is
recognized as code/text and the markdownlint warning is resolved.
In `@docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md`:
- Around line 52-59: The fenced example block containing the lines starting with
"ANSWER: max RPS = 2000..." is missing a language tag; update the opening fence
of that code block to include a language identifier (e.g., change ``` to ```text
or ```console) and ensure the closing fence remains ``` so markdownlint MD040 is
satisfied—locate the fenced block by the unique content "ANSWER: max RPS = 2000
(workload=messages, preset=medium)" and add the language tag to its opening
fence.
In `@tools/loadgen/README.md`:
- Around line 290-297: The fenced code block containing the example starting
with "ANSWER: max RPS = 2000 (workload=messages, preset=medium) ..." is missing
a language tag; update the opening fence from ``` to ```text so the block is
labeled (this resolves markdownlint MD040) and leave the block contents
unchanged.
---
Outside diff comments:
In `@tools/loadgen/deploy/Makefile`:
- Around line 78-85: The run-max-rps target currently brings up cadvisor and
prometheus then immediately execs into the loadgen container to run /loadgen
max-rps; modify run-max-rps so it waits for Prometheus to be ready before
running the exec: after "$(COMPOSE) --profile dashboards up -d cadvisor
prometheus" add a readiness gate that polls Prometheus' /-/ready (or uses
compose healthcheck/up --wait) and only proceeds to "$(COMPOSE) exec -T loadgen
/loadgen max-rps ..." once prometheus returns healthy; reference the run-max-rps
target, $(COMPOSE) invocation, and the prometheus service name when implementing
the check.
In `@tools/loadgen/maxrps_report.go`:
- Around line 96-109: The BOTTLENECK output is skipped when lastPassRPS(results)
== 0 because the function returns before checking bn; update the control flow so
renderBottleneck(w, bn) is executed regardless of whether a passing step exists.
Concretely, keep the existing “no step passed” ANSWER path but do not return
immediately: after printing the "no step passed" message (using lastPassRPS and
firstTrip), still call renderBottleneck(w, bn) when bn != nil (and then return),
and ensure the existing passing-path behavior (printing max RPS, Next limit,
then renderBottleneck) remains unchanged; locate changes around lastPassRPS,
firstTrip, bn and renderBottleneck to implement this.
---
Nitpick comments:
In `@tools/loadgen/attribution.go`:
- Around line 105-186: The Diagnose path can run many sequential PromQL calls
(saturated, cpuCores, fallbackRanking) and needs a global timeout; wrap the
context passed into eng.Diagnose (or into diagnoseBottleneck) with
context.WithTimeout so the entire Diagnose run is bounded (e.g., create a child
ctx with a sensible total timeout, defer cancel(), and pass that ctx into
eng.Diagnose/diagnoseBottleneck), ensuring all internal calls (saturated,
cpuCores, fallbackRanking) inherit the deadline and the bottleneck report cannot
hang indefinitely.
In `@tools/loadgen/maxrps_report_test.go`:
- Around line 118-130: renderRPSReportWithBottleneck currently
early-returns/suppresses the BOTTLENECK when no step passed; update it so that
if a non-nil bottleneckVerdict (bn) is provided it still appends the bottleneck
info even when all rpsStepResult entries are verdictTrip (i.e., do not gate
emitting the "BOTTLENECK" block on finding a passing step). Change the control
flow in renderRPSReportWithBottleneck to compute the answer text (e.g., "no step
passed" vs "max RPS = X") but always check bn != nil and append its formatted
line (using bn.Component and bn.Resource such as "Component (Resource-bound)"),
and add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover
the no-pass + bn case.
In `@tools/loadgen/promclient.go`:
- Around line 63-73: In RangeQuery, detect non-2xx HTTP responses from the Resty
response before attempting to unmarshal: check resp.IsError() (or inspect
resp.StatusCode()) after the request and return an error that includes the HTTP
status code and resp.Status() or resp.Body() to make failures diagnosable; keep
the subsequent json.Unmarshal into rangeQueryResponse and the existing
parsed.Status check but only run them when resp.IsError() is false.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f072573-1e5e-4807-ac60-91df745e99fc
📒 Files selected for processing (27)
.gitignoredocs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.mddocs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.mdtools/loadgen/README.mdtools/loadgen/attribution.gotools/loadgen/attribution_report.gotools/loadgen/attribution_report_test.gotools/loadgen/attribution_test.gotools/loadgen/config_bottleneck_test.gotools/loadgen/deploy/Makefiletools/loadgen/deploy/docker-compose.ymltools/loadgen/deploy/prometheus/prometheus.ymltools/loadgen/identity.gotools/loadgen/identity_test.gotools/loadgen/main.gotools/loadgen/maxrps.gotools/loadgen/maxrps_report.gotools/loadgen/maxrps_report_test.gotools/loadgen/maxrps_test.gotools/loadgen/promclient.gotools/loadgen/promclient_test.gotools/loadgen/ramp.gotools/loadgen/ramp_test.gotools/loadgen/stagegraph.gotools/loadgen/stagegraph_test.gotools/loadgen/verdict.gotools/loadgen/verdict_test.go
Satisfies markdownlint MD040 (fenced-code-language) flagged by CodeRabbit on PR #262. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
…anking saturated() already computes trip-window cores; return it so fallbackRanking ranks on that value instead of re-issuing an identical Prometheus query per saturated service. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
…jan-gA5rC # Conflicts: # tools/loadgen/README.md # tools/loadgen/main.go
84213da to
c9b69ef
Compare
Summary
When a
loadgen max-rps --workload=messagesramp trips an SLO, it now appends aBOTTLENECK:block to the verdict that names the culprit component, the saturated resource, and a confidence — instead of leaving you to cross-reference the verdict against a Grafana dashboard by eye.It fuses loadgen's own per-stage signals (E1/E2 latency split, per-durable JetStream backlog) with cAdvisor container-CPU trends pulled from Prometheus, walks the messages pipeline stage-graph, and attributes the breach.
How it works
attribution.go): a 5-pass causality walk —high(a backing-up stage whose own CPU is saturated) →high(its backing dependency is saturated) →medium(backs up, no resource knee → likely I/O/lock wait) →low(resource-ranking fallback) →undetermined.promclient.go(Prometheusquery_rangeclient over the sharedrestyutil),stagegraph.go(declarative messages pipeline),identity.go(cAdvisor compose-service label → selector, with short-ID fallback),attribution_report.go(renders the block + CSV columns).BOTTLENECK_ENABLED=falseor no Prometheus URL; never returns an error or blocks the run. Prometheus down / thin data / breach on step 1 →BOTTLENECK: undetermined (<reason>)and the run reports exactly as before.make run-max-rpsnow brings up cAdvisor + Prometheus automatically (a cAdvisor scrape job + service were added to the loadgen deploy overlay). Tunables viaBOTTLENECK_*env (documented intools/loadgen/README.md).Config:
BOTTLENECK_ENABLED(defaulttrue),BOTTLENECK_PROM_URL,BOTTLENECK_KNEE_TOLERANCE(0.10),BOTTLENECK_QUERY_STEP(5s),BOTTLENECK_CONTAINER_MAP.Design + plan are included under
docs/superpowers/.Test Plan
make test SERVICE=tools/loadgen— full unit suite passes with-racepromQuerier(no live Prometheus needed in unit tests)make lint— clean (0 issues)make sast-gosec— passesmake build SERVICE=tools/loadgen— buildsmake run-max-rps PRESET=mediumagainst the local stack and confirm a realBOTTLENECK:line on a trip (heuristics — knee tolerance, ~1-core floor — may want host-specific tuning)Notes / known v1 heuristics
cpuSaturatedFloorCores = 1.0andKneeTolerance = 0.10are host-relative starting points (no CPU limits on the local stack); expect to tune.HoldEndincludes the adapter's ~2s post-hold drain, slightly diluting the trip-window CPU rate — flagged in-code for a future tighter window.https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
Generated by Claude Code
Summary by CodeRabbit
New Features
max-rps --workload=messagestests; when a test trips, the system diagnoses which component caused the failure using CPU and latency metrics.make run-max-rps.Documentation
BOTTLENECK_*configuration options.