feat(loadgen): bottleneck attribution for max-rps(messages) by hmchangw · Pull Request #262 · hmchangw/chat

hmchangw · 2026-06-02T09:46:25Z

Summary

When a loadgen max-rps --workload=messages ramp trips an SLO, it now appends a BOTTLENECK: block to the verdict that names the culprit component, the saturated resource, and a confidence — instead of leaving you to cross-reference the verdict against a Grafana dashboard by eye.

ANSWER: max RPS = 2000 (workload=messages, preset=medium)
        Next limit: E2 p95=143ms > 100ms
BOTTLENECK: message-worker (Cassandra-bound)
        message-worker consumer backlog grew (first stage to back up)
        cassandra CPU plateaued between 1000 and 2000 rps while load rose
        confidence: high

It fuses loadgen's own per-stage signals (E1/E2 latency split, per-durable JetStream backlog) with cAdvisor container-CPU trends pulled from Prometheus, walks the messages pipeline stage-graph, and attributes the breach.

How it works

New engine (attribution.go): a 5-pass causality walk — high (a backing-up stage whose own CPU is saturated) → high (its backing dependency is saturated) → medium (backs up, no resource knee → likely I/O/lock wait) → low (resource-ranking fallback) → undetermined.
Saturation = a CPU "knee" relative to the ramp. Because the local stack sets no CPU limits, "saturated" means a container's CPU plateaued between the last passing step and the tripping step while offered RPS rose — combined with an absolute floor (~1 core) so a low-flat container (waiting on a slow dependency) is attributed to that dependency rather than mislabeled CPU-bound.
Supporting units: promclient.go (Prometheus query_range client over the shared restyutil), stagegraph.go (declarative messages pipeline), identity.go (cAdvisor compose-service label → selector, with short-ID fallback), attribution_report.go (renders the block + CSV columns).
Purely additive & best-effort: messages-workload only; disabled when BOTTLENECK_ENABLED=false or no Prometheus URL; never returns an error or blocks the run. Prometheus down / thin data / breach on step 1 → BOTTLENECK: undetermined (<reason>) and the run reports exactly as before.
Deploy: make run-max-rps now brings up cAdvisor + Prometheus automatically (a cAdvisor scrape job + service were added to the loadgen deploy overlay). Tunables via BOTTLENECK_* env (documented in tools/loadgen/README.md).

Config: BOTTLENECK_ENABLED (default true), BOTTLENECK_PROM_URL, BOTTLENECK_KNEE_TOLERANCE (0.10), BOTTLENECK_QUERY_STEP (5s), BOTTLENECK_CONTAINER_MAP.

Design + plan are included under docs/superpowers/.

Test Plan

make test SERVICE=tools/loadgen — full unit suite passes with -race
Every new file ≥ 80% coverage (engine functions 83–100%); error/empty/reset/undetermined paths covered via an injected fake promQuerier (no live Prometheus needed in unit tests)
make lint — clean (0 issues)
make sast-gosec — passes
make build SERVICE=tools/loadgen — builds
Manual: make run-max-rps PRESET=medium against the local stack and confirm a real BOTTLENECK: line on a trip (heuristics — knee tolerance, ~1-core floor — may want host-specific tuning)

Notes / known v1 heuristics

cpuSaturatedFloorCores = 1.0 and KneeTolerance = 0.10 are host-relative starting points (no CPU limits on the local stack); expect to tune.
HoldEnd includes the adapter's ~2s post-hold drain, slightly diluting the trip-window CPU rate — flagged in-code for a future tighter window.
messages-only for v1; history/members would each add their own stage-graph.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

Generated by Claude Code

Summary by CodeRabbit

New Features
- Added bottleneck attribution for max-rps --workload=messages tests; when a test trips, the system diagnoses which component caused the failure using CPU and latency metrics.
- Prometheus and cAdvisor services now automatically start with make run-max-rps.
Documentation
- Added bottleneck attribution feature documentation and new BOTTLENECK_* configuration options.

Pipeline-stage causality engine that fuses loadgen's per-stage signals with cAdvisor container resource trends to name the bottleneck on a max-rps(messages) breach. Scoped to messages + max-rps for v1.

Fake promclient via the consumer-defined interface covers the engine at unit scope; no container dependency to exercise.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

…input

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

…dency wording https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

When the messages workload trips, diagnoseBottleneck runs the attribution engine and appends the BOTTLENECK block to the rendered report and three culprit columns to the CSV trip row. Nil verdict (disabled, no trip, non-messages workload) leaves the report and CSV unchanged. Also fixes pre-existing hugeParam gocritic violations and goimports formatting in attribution.go, attribution_report.go, attribution_test.go, and main.go. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

…orks

- Rename dependency key "mongo" -> "mongodb" in messagesStageGraph and dependencyDisplayName to match the actual compose service label used by cAdvisor (com.docker.compose.service=mongodb), so MongoDB attribution no longer silently fails. - Apply dependencyDisplayName in fallbackRanking so low-confidence verdicts display "Cassandra" / "MongoDB" rather than raw keys, consistent with Pass 2. - Expand the no-baseline comment in saturated to note the over-blame risk. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

coderabbitai · 2026-06-02T09:46:38Z

Warning

Review limit reached

@hmchangw, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 39 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5144d648-40d2-49bf-bbbc-c9624f637c13

📥 Commits

Reviewing files that changed from the base of the PR and between 31ed4ae and c9b69ef.

📒 Files selected for processing (3)

.gitignore
docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md
docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md

📝 Walkthrough

Walkthrough

This PR implements bottleneck attribution for the max-rps --workload=messages load generator. When a ramp step trips an SLO, the feature diagnoses which component/resource is saturated by fusing loadgen per-stage signals (latency breaches and backlog deltas) with cAdvisor container CPU metrics from Prometheus, then appends a BOTTLENECK: verdict to the report.

Changes

Bottleneck Attribution for Messages Workload

Layer / File(s)	Summary
Design & Implementation Plan `docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md`, `docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md`	High-level architecture and task-by-task breakdown of the bottleneck attribution feature, including algorithm overview, error handling, output format, and component requirements.
User-Facing Documentation `tools/loadgen/README.md`	Describes the `BOTTLENECK:` block format, attribution signal fusion, environment configuration (`BOTTLENECK_*` vars), and fallback behavior when telemetry is unavailable.
Measurement Window & Step Results `tools/loadgen/verdict.go`, `tools/loadgen/ramp.go`, `tools/loadgen/*_test.go`	Extend `rpsStepResult` with `HoldStart`/`HoldEnd` (wall-clock measurement windows) and `Pending` (per-durable backlog deltas); capture timing in `runRamp` around each step execution.
Prometheus Range Query Client `tools/loadgen/promclient.go`, `tools/loadgen/promclient_test.go`	HTTP client for Prometheus `/api/v1/query_range`, parsing matrix results with timestamp/value pairs and error handling for non-success responses.
Pipeline Model & Stage Graph `tools/loadgen/stagegraph.go`, `tools/loadgen/stagegraph_test.go`	Declarative `stage` struct modeling the messages pipeline (gatekeeper, message worker, broadcast worker) with container labels, durability, latency series, and dependency mappings.
Container Identity Resolution `tools/loadgen/identity.go`, `tools/loadgen/identity_test.go`	`identityResolver` builds PromQL selectors using compose-service labels or optional short-ID fallback; `parseContainerMap` parses `BOTTLENECK_CONTAINER_MAP` environment variable.
Attribution Engine `tools/loadgen/attribution.go`, `tools/loadgen/attribution_test.go`	Core diagnosis logic with precedence-based causality walk: CPU saturation detection (plateau analysis), backing-up detection (backlog/latency SLO), dependency reasoning, and CPU-ranking fallback, tested across 11 scenarios.
Bottleneck Output Formatting `tools/loadgen/attribution_report.go`, `tools/loadgen/attribution_report_test.go`	`renderBottleneck` formats determined/undetermined verdicts; `bottleneckCSVColumns` provides CSV row values for integration into report.
Configuration & maxrps Integration `tools/loadgen/main.go`, `tools/loadgen/maxrps.go`, `tools/loadgen/maxrps_test.go`	`bottleneckConfig` struct with `BOTTLENECK_`-prefixed env vars; `diagnoseBottleneck` helper with guards (messages-only, enabled, Prometheus URL, trip exists); integration into `runMaxRPS`.
RPS Report Rendering `tools/loadgen/maxrps_report.go`, `tools/loadgen/maxrps_report_test.go`	Refactor `renderRPSReport` to delegate to `renderRPSReportWithBottleneck`; extend CSV header/rows with bottleneck columns; conditionally append `BOTTLENECK:` block to report output.
Deployment Infrastructure `tools/loadgen/deploy/docker-compose.yml`, `tools/loadgen/deploy/Makefile`, `tools/loadgen/deploy/prometheus/prometheus.yml`, `.gitignore`	Add cAdvisor service (privileged, host mount, health check), Prometheus scrape job for cAdvisor, service dependency ordering, `BOTTLENECK_PROM_URL` env var, and Makefile target to start services before test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hmchangw/chat#240: Main PR extends the existing tools/loadgen max-rps messages SLO-finder implementation with bottleneck attribution wired into maxrps.go/maxrps_report.go and step result enrichment.

Suggested labels

ready

Suggested reviewers

mliu33
ngangwar962

Poem

🐰 A ramp that trips now tells a tale,
Which CPU or queue did fail?
From Prometheus and backlog signs,
The bottleneck is drawn in lines!
No more unknowns—just facts so clear, 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature being added: bottleneck attribution for the max-rps command when running the messages workload.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/magical-ramanujan-gA5rC

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tools/loadgen/maxrps_report.go (1)
96-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

BOTTLENECK block is dropped when no step passes.

When the first ramp step trips, lastPassRPS returns 0 and the function return nils before reaching the bn != nil block. diagnoseBottleneck returns a verdict for any trip (first-step trips included), so the diagnosed BOTTLENECK is silently suppressed in precisely the case where it's most useful. The current tests don't cover this (they all have a passing step first).
🐛 Render the bottleneck block on both the pass and no-pass paths
 	fmt.Fprintln(w)
 	pass := lastPassRPS(results)
 	if pass == 0 {
 		fmt.Fprintf(w, "ANSWER: no step passed (workload=%s, preset=%s)\n", workload, preset)
-		return nil
-	}
-	fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset)
-	if trip := firstTrip(results); trip != nil {
-		fmt.Fprintf(w, "        Next limit: %s\n", strings.Join(trip.Reasons, "; "))
-	}
+	} else {
+		fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset)
+		if trip := firstTrip(results); trip != nil {
+			fmt.Fprintf(w, "        Next limit: %s\n", strings.Join(trip.Reasons, "; "))
+		}
+	}
 	if bn != nil {
 		renderBottleneck(w, bn)
 	}
 	return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_report.go` around lines 96 - 109, The BOTTLENECK output
is skipped when lastPassRPS(results) == 0 because the function returns before
checking bn; update the control flow so renderBottleneck(w, bn) is executed
regardless of whether a passing step exists. Concretely, keep the existing “no
step passed” ANSWER path but do not return immediately: after printing the "no
step passed" message (using lastPassRPS and firstTrip), still call
renderBottleneck(w, bn) when bn != nil (and then return), and ensure the
existing passing-path behavior (printing max RPS, Next limit, then
renderBottleneck) remains unchanged; locate changes around lastPassRPS,
firstTrip, bn and renderBottleneck to implement this.
tools/loadgen/deploy/Makefile (1)
78-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for Prometheus readiness before launching max-rps.

Line 80 only starts the metrics containers; Lines 81-85 immediately begin the ramp. With the 5s scrape interval in tools/loadgen/deploy/prometheus/prometheus.yml, the first trip can happen before Prometheus has even scraped cAdvisor once, so the new attribution path frequently degrades to undetermined on the exact make run-max-rps workflow the README advertises. Please gate the exec on Prometheus readiness (for example via a healthcheck + up --wait, or an explicit / -/ready poll) before starting the run.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/deploy/Makefile` around lines 78 - 85, The run-max-rps target
currently brings up cadvisor and prometheus then immediately execs into the
loadgen container to run /loadgen max-rps; modify run-max-rps so it waits for
Prometheus to be ready before running the exec: after "$(COMPOSE) --profile
dashboards up -d cadvisor prometheus" add a readiness gate that polls
Prometheus' /-/ready (or uses compose healthcheck/up --wait) and only proceeds
to "$(COMPOSE) exec -T loadgen /loadgen max-rps ..." once prometheus returns
healthy; reference the run-max-rps target, $(COMPOSE) invocation, and the
prometheus service name when implementing the check.

🧹 Nitpick comments (3)

tools/loadgen/maxrps_report_test.go (1)
118-130: ⚡ Quick win

Add coverage for a first-step trip with a bottleneck.

This test always has a passing step before the trip, so it doesn't catch the early-return gap in renderRPSReportWithBottleneck (no step passed → BOTTLENECK suppressed). Once that path is fixed, a case with only a verdictTrip step plus a non-nil bn would guard against regression.
💚 Suggested additional test case
func TestRenderRPSReport_AppendsBottleneck_NoPass(t *testing.T) {
	results := []rpsStepResult{
		{TargetRPS: 500, Kind: verdictTrip, Reasons: []string{"E2 p95=400ms > 100ms"}},
	}
	bn := bottleneckVerdict{Component: "message-worker", Resource: "Cassandra", Confidence: "high", Determined: true}
	var sb strings.Builder
	require.NoError(t, renderRPSReportWithBottleneck(&sb, results, "messages", "medium", &bn))
	out := sb.String()
	assert.Contains(t, out, "ANSWER: no step passed")
	assert.Contains(t, out, "BOTTLENECK: message-worker (Cassandra-bound)")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_report_test.go` around lines 118 - 130,
renderRPSReportWithBottleneck currently early-returns/suppresses the BOTTLENECK
when no step passed; update it so that if a non-nil bottleneckVerdict (bn) is
provided it still appends the bottleneck info even when all rpsStepResult
entries are verdictTrip (i.e., do not gate emitting the "BOTTLENECK" block on
finding a passing step). Change the control flow in
renderRPSReportWithBottleneck to compute the answer text (e.g., "no step passed"
vs "max RPS = X") but always check bn != nil and append its formatted line
(using bn.Component and bn.Resource such as "Component (Resource-bound)"), and
add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover the
no-pass + bn case.
tools/loadgen/promclient.go (1)
63-73: ⚡ Quick win

Surface Prometheus HTTP status on non-2xx responses

In tools/loadgen/promclient.go RangeQuery, Resty typically returns a nil err for HTTP 4xx/5xx, so non-2xx responses currently surface only as JSON decode prometheus response errors (or prometheus query failed). Checking resp.IsError() and returning resp.StatusCode() makes failures much more diagnosable.
♻️ Proposed tweak
 	if err != nil {
 		return nil, fmt.Errorf("query prometheus: %w", err)
 	}
+	if resp.IsError() {
+		return nil, fmt.Errorf("query prometheus: status %d: %s", resp.StatusCode(), resp.String())
+	}
 
 	var parsed rangeQueryResponse
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/promclient.go` around lines 63 - 73, In RangeQuery, detect
non-2xx HTTP responses from the Resty response before attempting to unmarshal:
check resp.IsError() (or inspect resp.StatusCode()) after the request and return
an error that includes the HTTP status code and resp.Status() or resp.Body() to
make failures diagnosable; keep the subsequent json.Unmarshal into
rangeQueryResponse and the existing parsed.Status check but only run them when
resp.IsError() is false.
tools/loadgen/attribution.go (1)
105-186: ⚡ Quick win

Bound total bottleneck diagnosis time (not just the per-QueryRange timeout).
RangeQuery uses a 10s HTTP timeout, but Diagnose runs many sequential PromQL queries (via saturated/cpuCores, plus extra work in fallbackRanking). If Prometheus hangs rather than fails fast, the end-of-run BOTTLENECK report can still stall for many tens of seconds; wrap the ctx used for eng.Diagnose(ctx, ...) (e.g., context.WithTimeout around diagnoseBottleneck/the call site) to cap total diagnosis duration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/attribution.go` around lines 105 - 186, The Diagnose path can
run many sequential PromQL calls (saturated, cpuCores, fallbackRanking) and
needs a global timeout; wrap the context passed into eng.Diagnose (or into
diagnoseBottleneck) with context.WithTimeout so the entire Diagnose run is
bounded (e.g., create a child ctx with a sensible total timeout, defer cancel(),
and pass that ctx into eng.Diagnose/diagnoseBottleneck), ensuring all internal
calls (saturated, cpuCores, fallbackRanking) inherit the deadline and the
bottleneck report cannot hang indefinitely.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md`:
- Around line 1439-1446: The markdown fenced block containing the
ANSWER/BOTTLENECK example lacks a language tag and triggers MD040; update that
fenced block (the triple-backtick block that starts with "ANSWER: max RPS =
2000..." and includes "BOTTLENECK: message-worker (Cassandra-bound)") to include
a language token such as text (i.e., change ``` to ```text) so the block is
recognized as code/text and the markdownlint warning is resolved.

In `@docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md`:
- Around line 52-59: The fenced example block containing the lines starting with
"ANSWER: max RPS = 2000..." is missing a language tag; update the opening fence
of that code block to include a language identifier (e.g., change ``` to ```text
or ```console) and ensure the closing fence remains ``` so markdownlint MD040 is
satisfied—locate the fenced block by the unique content "ANSWER: max RPS = 2000
(workload=messages, preset=medium)" and add the language tag to its opening
fence.

In `@tools/loadgen/README.md`:
- Around line 290-297: The fenced code block containing the example starting
with "ANSWER: max RPS = 2000 (workload=messages, preset=medium) ..." is missing
a language tag; update the opening fence from ``` to ```text so the block is
labeled (this resolves markdownlint MD040) and leave the block contents
unchanged.

---

Outside diff comments:
In `@tools/loadgen/deploy/Makefile`:
- Around line 78-85: The run-max-rps target currently brings up cadvisor and
prometheus then immediately execs into the loadgen container to run /loadgen
max-rps; modify run-max-rps so it waits for Prometheus to be ready before
running the exec: after "$(COMPOSE) --profile dashboards up -d cadvisor
prometheus" add a readiness gate that polls Prometheus' /-/ready (or uses
compose healthcheck/up --wait) and only proceeds to "$(COMPOSE) exec -T loadgen
/loadgen max-rps ..." once prometheus returns healthy; reference the run-max-rps
target, $(COMPOSE) invocation, and the prometheus service name when implementing
the check.

In `@tools/loadgen/maxrps_report.go`:
- Around line 96-109: The BOTTLENECK output is skipped when lastPassRPS(results)
== 0 because the function returns before checking bn; update the control flow so
renderBottleneck(w, bn) is executed regardless of whether a passing step exists.
Concretely, keep the existing “no step passed” ANSWER path but do not return
immediately: after printing the "no step passed" message (using lastPassRPS and
firstTrip), still call renderBottleneck(w, bn) when bn != nil (and then return),
and ensure the existing passing-path behavior (printing max RPS, Next limit,
then renderBottleneck) remains unchanged; locate changes around lastPassRPS,
firstTrip, bn and renderBottleneck to implement this.

---

Nitpick comments:
In `@tools/loadgen/attribution.go`:
- Around line 105-186: The Diagnose path can run many sequential PromQL calls
(saturated, cpuCores, fallbackRanking) and needs a global timeout; wrap the
context passed into eng.Diagnose (or into diagnoseBottleneck) with
context.WithTimeout so the entire Diagnose run is bounded (e.g., create a child
ctx with a sensible total timeout, defer cancel(), and pass that ctx into
eng.Diagnose/diagnoseBottleneck), ensuring all internal calls (saturated,
cpuCores, fallbackRanking) inherit the deadline and the bottleneck report cannot
hang indefinitely.

In `@tools/loadgen/maxrps_report_test.go`:
- Around line 118-130: renderRPSReportWithBottleneck currently
early-returns/suppresses the BOTTLENECK when no step passed; update it so that
if a non-nil bottleneckVerdict (bn) is provided it still appends the bottleneck
info even when all rpsStepResult entries are verdictTrip (i.e., do not gate
emitting the "BOTTLENECK" block on finding a passing step). Change the control
flow in renderRPSReportWithBottleneck to compute the answer text (e.g., "no step
passed" vs "max RPS = X") but always check bn != nil and append its formatted
line (using bn.Component and bn.Resource such as "Component (Resource-bound)"),
and add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover
the no-pass + bn case.

In `@tools/loadgen/promclient.go`:
- Around line 63-73: In RangeQuery, detect non-2xx HTTP responses from the Resty
response before attempting to unmarshal: check resp.IsError() (or inspect
resp.StatusCode()) after the request and return an error that includes the HTTP
status code and resp.Status() or resp.Body() to make failures diagnosable; keep
the subsequent json.Unmarshal into rangeQueryResponse and the existing
parsed.Status check but only run them when resp.IsError() is false.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f072573-1e5e-4807-ac60-91df745e99fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9c5d14a and 31ed4ae.

📒 Files selected for processing (27)

.gitignore
docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md
docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md
tools/loadgen/README.md
tools/loadgen/attribution.go
tools/loadgen/attribution_report.go
tools/loadgen/attribution_report_test.go
tools/loadgen/attribution_test.go
tools/loadgen/config_bottleneck_test.go
tools/loadgen/deploy/Makefile
tools/loadgen/deploy/docker-compose.yml
tools/loadgen/deploy/prometheus/prometheus.yml
tools/loadgen/identity.go
tools/loadgen/identity_test.go
tools/loadgen/main.go
tools/loadgen/maxrps.go
tools/loadgen/maxrps_report.go
tools/loadgen/maxrps_report_test.go
tools/loadgen/maxrps_test.go
tools/loadgen/promclient.go
tools/loadgen/promclient_test.go
tools/loadgen/ramp.go
tools/loadgen/ramp_test.go
tools/loadgen/stagegraph.go
tools/loadgen/stagegraph_test.go
tools/loadgen/verdict.go
tools/loadgen/verdict_test.go

Satisfies markdownlint MD040 (fenced-code-language) flagged by CodeRabbit on PR #262. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

…anking saturated() already computes trip-window cores; return it so fallbackRanking ranks on that value instead of re-issuing an identical Prometheus query per saturated service. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

…jan-gA5rC # Conflicts: # tools/loadgen/README.md # tools/loadgen/main.go

mliu33

Thanks!

claude added 25 commits June 2, 2026 04:07

docs: design for loadgen bottleneck attribution (v1)

d35dfd4

Pipeline-stage causality engine that fuses loadgen's per-stage signals with cAdvisor container resource trends to name the bottleneck on a max-rps(messages) breach. Scoped to messages + max-rps for v1.

docs: drop optional integration test from bottleneck design

9f8abfc

Fake promclient via the consumer-defined interface covers the engine at unit scope; no container dependency to exercise.

docs: implementation plan for loadgen bottleneck attribution

57a43de

feat(loadgen): carry hold window + per-durable deltas on step result

1b0da6d

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): clarify Task 1 window test + note drain in HoldEnd

f81f786

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

feat(loadgen): add Prometheus range-query client

1c7c1e8

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): cover promclient error + skip paths; comment polish

135e4ac

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

feat(loadgen): add messages pipeline stage-graph

45d726d

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): assert all stage-graph fields; clarify DependsOn comment

9610b7a

feat(loadgen): add cAdvisor container identity resolver

b3ce772

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): assert parseContainerMap returns non-nil map on empty …

157ac6a

…input

feat(loadgen): add bottleneck engine CPU-knee primitive

57b236d

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): cover cpuCores error paths; document window assumption

1182602

feat(loadgen): add bottleneck causality walk + fallback

a4bc57a

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): cover all-clear verdict; dedup error stub; align depen…

a015465

…dency wording https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

feat(loadgen): render BOTTLENECK verdict block

6bc1e1a

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

test(loadgen): cover empty-reasons undetermined render path

bd2337a

feat(loadgen): add bottleneck attribution config

e2d0b96

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

refactor(loadgen): consolidate bottleneck gating; cover diagnose guards

4d89cf5

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

feat(loadgen): bring up cAdvisor for bottleneck attribution + docs

d14c884

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

fix(loadgen): run-max-rps starts cAdvisor+Prometheus so attribution w…

725a6a8

…orks

docs(loadgen): tighten bottleneck attribution deploy note

44b99c8

chore: gitignore stray root-level loadgen build artifact

3689242

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md Outdated

Comment thread docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md Outdated

Comment thread tools/loadgen/README.md Outdated

claude added 3 commits June 2, 2026 10:09

docs(loadgen): tag ANSWER/BOTTLENECK example fences as text (MD040)

f4c8790

Satisfies markdownlint MD040 (fenced-code-language) flagged by CodeRabbit on PR #262. https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j

Merge remote-tracking branch 'origin/main' into claude/magical-ramanu…

c9b69ef

…jan-gA5rC # Conflicts: # tools/loadgen/README.md # tools/loadgen/main.go

hmchangw force-pushed the claude/magical-ramanujan-gA5rC branch from 84213da to c9b69ef Compare June 4, 2026 02:34

mliu33 approved these changes Jun 4, 2026

View reviewed changes

mliu33 merged commit 04bb191 into main Jun 4, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loadgen): bottleneck attribution for max-rps(messages)#262

feat(loadgen): bottleneck attribution for max-rps(messages)#262
mliu33 merged 28 commits into
mainfrom
claude/magical-ramanujan-gA5rC

hmchangw commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mliu33 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hmchangw commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test Plan

Notes / known v1 heuristics

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mliu33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hmchangw commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading