Skip to content

feat(loadgen): bottleneck attribution for max-rps(messages)#262

Merged
mliu33 merged 28 commits into
mainfrom
claude/magical-ramanujan-gA5rC
Jun 4, 2026
Merged

feat(loadgen): bottleneck attribution for max-rps(messages)#262
mliu33 merged 28 commits into
mainfrom
claude/magical-ramanujan-gA5rC

Conversation

@hmchangw
Copy link
Copy Markdown
Owner

@hmchangw hmchangw commented Jun 2, 2026

Summary

When a loadgen max-rps --workload=messages ramp trips an SLO, it now appends a BOTTLENECK: block to the verdict that names the culprit component, the saturated resource, and a confidence — instead of leaving you to cross-reference the verdict against a Grafana dashboard by eye.

ANSWER: max RPS = 2000 (workload=messages, preset=medium)
        Next limit: E2 p95=143ms > 100ms
BOTTLENECK: message-worker (Cassandra-bound)
        message-worker consumer backlog grew (first stage to back up)
        cassandra CPU plateaued between 1000 and 2000 rps while load rose
        confidence: high

It fuses loadgen's own per-stage signals (E1/E2 latency split, per-durable JetStream backlog) with cAdvisor container-CPU trends pulled from Prometheus, walks the messages pipeline stage-graph, and attributes the breach.

How it works

  • New engine (attribution.go): a 5-pass causality walk — high (a backing-up stage whose own CPU is saturated) → high (its backing dependency is saturated) → medium (backs up, no resource knee → likely I/O/lock wait) → low (resource-ranking fallback) → undetermined.
  • Saturation = a CPU "knee" relative to the ramp. Because the local stack sets no CPU limits, "saturated" means a container's CPU plateaued between the last passing step and the tripping step while offered RPS rose — combined with an absolute floor (~1 core) so a low-flat container (waiting on a slow dependency) is attributed to that dependency rather than mislabeled CPU-bound.
  • Supporting units: promclient.go (Prometheus query_range client over the shared restyutil), stagegraph.go (declarative messages pipeline), identity.go (cAdvisor compose-service label → selector, with short-ID fallback), attribution_report.go (renders the block + CSV columns).
  • Purely additive & best-effort: messages-workload only; disabled when BOTTLENECK_ENABLED=false or no Prometheus URL; never returns an error or blocks the run. Prometheus down / thin data / breach on step 1 → BOTTLENECK: undetermined (<reason>) and the run reports exactly as before.
  • Deploy: make run-max-rps now brings up cAdvisor + Prometheus automatically (a cAdvisor scrape job + service were added to the loadgen deploy overlay). Tunables via BOTTLENECK_* env (documented in tools/loadgen/README.md).

Config: BOTTLENECK_ENABLED (default true), BOTTLENECK_PROM_URL, BOTTLENECK_KNEE_TOLERANCE (0.10), BOTTLENECK_QUERY_STEP (5s), BOTTLENECK_CONTAINER_MAP.

Design + plan are included under docs/superpowers/.

Test Plan

  • make test SERVICE=tools/loadgen — full unit suite passes with -race
  • Every new file ≥ 80% coverage (engine functions 83–100%); error/empty/reset/undetermined paths covered via an injected fake promQuerier (no live Prometheus needed in unit tests)
  • make lint — clean (0 issues)
  • make sast-gosec — passes
  • make build SERVICE=tools/loadgen — builds
  • Manual: make run-max-rps PRESET=medium against the local stack and confirm a real BOTTLENECK: line on a trip (heuristics — knee tolerance, ~1-core floor — may want host-specific tuning)

Notes / known v1 heuristics

  • cpuSaturatedFloorCores = 1.0 and KneeTolerance = 0.10 are host-relative starting points (no CPU limits on the local stack); expect to tune.
  • HoldEnd includes the adapter's ~2s post-hold drain, slightly diluting the trip-window CPU rate — flagged in-code for a future tighter window.
  • messages-only for v1; history/members would each add their own stage-graph.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added bottleneck attribution for max-rps --workload=messages tests; when a test trips, the system diagnoses which component caused the failure using CPU and latency metrics.
    • Prometheus and cAdvisor services now automatically start with make run-max-rps.
  • Documentation

    • Added bottleneck attribution feature documentation and new BOTTLENECK_* configuration options.

claude added 25 commits June 2, 2026 04:07
Pipeline-stage causality engine that fuses loadgen's per-stage signals
with cAdvisor container resource trends to name the bottleneck on a
max-rps(messages) breach. Scoped to messages + max-rps for v1.
Fake promclient via the consumer-defined interface covers the engine at
unit scope; no container dependency to exercise.
When the messages workload trips, diagnoseBottleneck runs the attribution
engine and appends the BOTTLENECK block to the rendered report and three
culprit columns to the CSV trip row. Nil verdict (disabled, no trip,
non-messages workload) leaves the report and CSV unchanged. Also fixes
pre-existing hugeParam gocritic violations and goimports formatting in
attribution.go, attribution_report.go, attribution_test.go, and main.go.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
- Rename dependency key "mongo" -> "mongodb" in messagesStageGraph and
  dependencyDisplayName to match the actual compose service label used by
  cAdvisor (com.docker.compose.service=mongodb), so MongoDB attribution
  no longer silently fails.
- Apply dependencyDisplayName in fallbackRanking so low-confidence verdicts
  display "Cassandra" / "MongoDB" rather than raw keys, consistent with Pass 2.
- Expand the no-baseline comment in saturated to note the over-blame risk.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Warning

Review limit reached

@hmchangw, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 39 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5144d648-40d2-49bf-bbbc-c9624f637c13

📥 Commits

Reviewing files that changed from the base of the PR and between 31ed4ae and c9b69ef.

📒 Files selected for processing (3)
  • .gitignore
  • docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md
  • docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md
📝 Walkthrough

Walkthrough

This PR implements bottleneck attribution for the max-rps --workload=messages load generator. When a ramp step trips an SLO, the feature diagnoses which component/resource is saturated by fusing loadgen per-stage signals (latency breaches and backlog deltas) with cAdvisor container CPU metrics from Prometheus, then appends a BOTTLENECK: verdict to the report.

Changes

Bottleneck Attribution for Messages Workload

Layer / File(s) Summary
Design & Implementation Plan
docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md, docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md
High-level architecture and task-by-task breakdown of the bottleneck attribution feature, including algorithm overview, error handling, output format, and component requirements.
User-Facing Documentation
tools/loadgen/README.md
Describes the BOTTLENECK: block format, attribution signal fusion, environment configuration (BOTTLENECK_* vars), and fallback behavior when telemetry is unavailable.
Measurement Window & Step Results
tools/loadgen/verdict.go, tools/loadgen/ramp.go, tools/loadgen/*_test.go
Extend rpsStepResult with HoldStart/HoldEnd (wall-clock measurement windows) and Pending (per-durable backlog deltas); capture timing in runRamp around each step execution.
Prometheus Range Query Client
tools/loadgen/promclient.go, tools/loadgen/promclient_test.go
HTTP client for Prometheus /api/v1/query_range, parsing matrix results with timestamp/value pairs and error handling for non-success responses.
Pipeline Model & Stage Graph
tools/loadgen/stagegraph.go, tools/loadgen/stagegraph_test.go
Declarative stage struct modeling the messages pipeline (gatekeeper, message worker, broadcast worker) with container labels, durability, latency series, and dependency mappings.
Container Identity Resolution
tools/loadgen/identity.go, tools/loadgen/identity_test.go
identityResolver builds PromQL selectors using compose-service labels or optional short-ID fallback; parseContainerMap parses BOTTLENECK_CONTAINER_MAP environment variable.
Attribution Engine
tools/loadgen/attribution.go, tools/loadgen/attribution_test.go
Core diagnosis logic with precedence-based causality walk: CPU saturation detection (plateau analysis), backing-up detection (backlog/latency SLO), dependency reasoning, and CPU-ranking fallback, tested across 11 scenarios.
Bottleneck Output Formatting
tools/loadgen/attribution_report.go, tools/loadgen/attribution_report_test.go
renderBottleneck formats determined/undetermined verdicts; bottleneckCSVColumns provides CSV row values for integration into report.
Configuration & maxrps Integration
tools/loadgen/main.go, tools/loadgen/maxrps.go, tools/loadgen/maxrps_test.go
bottleneckConfig struct with BOTTLENECK_-prefixed env vars; diagnoseBottleneck helper with guards (messages-only, enabled, Prometheus URL, trip exists); integration into runMaxRPS.
RPS Report Rendering
tools/loadgen/maxrps_report.go, tools/loadgen/maxrps_report_test.go
Refactor renderRPSReport to delegate to renderRPSReportWithBottleneck; extend CSV header/rows with bottleneck columns; conditionally append BOTTLENECK: block to report output.
Deployment Infrastructure
tools/loadgen/deploy/docker-compose.yml, tools/loadgen/deploy/Makefile, tools/loadgen/deploy/prometheus/prometheus.yml, .gitignore
Add cAdvisor service (privileged, host mount, health check), Prometheus scrape job for cAdvisor, service dependency ordering, BOTTLENECK_PROM_URL env var, and Makefile target to start services before test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hmchangw/chat#240: Main PR extends the existing tools/loadgen max-rps messages SLO-finder implementation with bottleneck attribution wired into maxrps.go/maxrps_report.go and step result enrichment.

Suggested labels

ready

Suggested reviewers

  • mliu33
  • ngangwar962

Poem

🐰 A ramp that trips now tells a tale,
Which CPU or queue did fail?
From Prometheus and backlog signs,
The bottleneck is drawn in lines!
No more unknowns—just facts so clear, 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature being added: bottleneck attribution for the max-rps command when running the messages workload.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/magical-ramanujan-gA5rC

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tools/loadgen/maxrps_report.go (1)

96-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

BOTTLENECK block is dropped when no step passes.

When the first ramp step trips, lastPassRPS returns 0 and the function return nils before reaching the bn != nil block. diagnoseBottleneck returns a verdict for any trip (first-step trips included), so the diagnosed BOTTLENECK is silently suppressed in precisely the case where it's most useful. The current tests don't cover this (they all have a passing step first).

🐛 Render the bottleneck block on both the pass and no-pass paths
 	fmt.Fprintln(w)
 	pass := lastPassRPS(results)
 	if pass == 0 {
 		fmt.Fprintf(w, "ANSWER: no step passed (workload=%s, preset=%s)\n", workload, preset)
-		return nil
-	}
-	fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset)
-	if trip := firstTrip(results); trip != nil {
-		fmt.Fprintf(w, "        Next limit: %s\n", strings.Join(trip.Reasons, "; "))
-	}
+	} else {
+		fmt.Fprintf(w, "ANSWER: max RPS = %d (workload=%s, preset=%s)\n", pass, workload, preset)
+		if trip := firstTrip(results); trip != nil {
+			fmt.Fprintf(w, "        Next limit: %s\n", strings.Join(trip.Reasons, "; "))
+		}
+	}
 	if bn != nil {
 		renderBottleneck(w, bn)
 	}
 	return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_report.go` around lines 96 - 109, The BOTTLENECK output
is skipped when lastPassRPS(results) == 0 because the function returns before
checking bn; update the control flow so renderBottleneck(w, bn) is executed
regardless of whether a passing step exists. Concretely, keep the existing “no
step passed” ANSWER path but do not return immediately: after printing the "no
step passed" message (using lastPassRPS and firstTrip), still call
renderBottleneck(w, bn) when bn != nil (and then return), and ensure the
existing passing-path behavior (printing max RPS, Next limit, then
renderBottleneck) remains unchanged; locate changes around lastPassRPS,
firstTrip, bn and renderBottleneck to implement this.
tools/loadgen/deploy/Makefile (1)

78-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for Prometheus readiness before launching max-rps.

Line 80 only starts the metrics containers; Lines 81-85 immediately begin the ramp. With the 5s scrape interval in tools/loadgen/deploy/prometheus/prometheus.yml, the first trip can happen before Prometheus has even scraped cAdvisor once, so the new attribution path frequently degrades to undetermined on the exact make run-max-rps workflow the README advertises. Please gate the exec on Prometheus readiness (for example via a healthcheck + up --wait, or an explicit / -/ready poll) before starting the run.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/deploy/Makefile` around lines 78 - 85, The run-max-rps target
currently brings up cadvisor and prometheus then immediately execs into the
loadgen container to run /loadgen max-rps; modify run-max-rps so it waits for
Prometheus to be ready before running the exec: after "$(COMPOSE) --profile
dashboards up -d cadvisor prometheus" add a readiness gate that polls
Prometheus' /-/ready (or uses compose healthcheck/up --wait) and only proceeds
to "$(COMPOSE) exec -T loadgen /loadgen max-rps ..." once prometheus returns
healthy; reference the run-max-rps target, $(COMPOSE) invocation, and the
prometheus service name when implementing the check.
🧹 Nitpick comments (3)
tools/loadgen/maxrps_report_test.go (1)

118-130: ⚡ Quick win

Add coverage for a first-step trip with a bottleneck.

This test always has a passing step before the trip, so it doesn't catch the early-return gap in renderRPSReportWithBottleneck (no step passed → BOTTLENECK suppressed). Once that path is fixed, a case with only a verdictTrip step plus a non-nil bn would guard against regression.

💚 Suggested additional test case
func TestRenderRPSReport_AppendsBottleneck_NoPass(t *testing.T) {
	results := []rpsStepResult{
		{TargetRPS: 500, Kind: verdictTrip, Reasons: []string{"E2 p95=400ms > 100ms"}},
	}
	bn := bottleneckVerdict{Component: "message-worker", Resource: "Cassandra", Confidence: "high", Determined: true}
	var sb strings.Builder
	require.NoError(t, renderRPSReportWithBottleneck(&sb, results, "messages", "medium", &bn))
	out := sb.String()
	assert.Contains(t, out, "ANSWER: no step passed")
	assert.Contains(t, out, "BOTTLENECK: message-worker (Cassandra-bound)")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/maxrps_report_test.go` around lines 118 - 130,
renderRPSReportWithBottleneck currently early-returns/suppresses the BOTTLENECK
when no step passed; update it so that if a non-nil bottleneckVerdict (bn) is
provided it still appends the bottleneck info even when all rpsStepResult
entries are verdictTrip (i.e., do not gate emitting the "BOTTLENECK" block on
finding a passing step). Change the control flow in
renderRPSReportWithBottleneck to compute the answer text (e.g., "no step passed"
vs "max RPS = X") but always check bn != nil and append its formatted line
(using bn.Component and bn.Resource such as "Component (Resource-bound)"), and
add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover the
no-pass + bn case.
tools/loadgen/promclient.go (1)

63-73: ⚡ Quick win

Surface Prometheus HTTP status on non-2xx responses

In tools/loadgen/promclient.go RangeQuery, Resty typically returns a nil err for HTTP 4xx/5xx, so non-2xx responses currently surface only as JSON decode prometheus response errors (or prometheus query failed). Checking resp.IsError() and returning resp.StatusCode() makes failures much more diagnosable.

♻️ Proposed tweak
 	if err != nil {
 		return nil, fmt.Errorf("query prometheus: %w", err)
 	}
+	if resp.IsError() {
+		return nil, fmt.Errorf("query prometheus: status %d: %s", resp.StatusCode(), resp.String())
+	}
 
 	var parsed rangeQueryResponse
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/promclient.go` around lines 63 - 73, In RangeQuery, detect
non-2xx HTTP responses from the Resty response before attempting to unmarshal:
check resp.IsError() (or inspect resp.StatusCode()) after the request and return
an error that includes the HTTP status code and resp.Status() or resp.Body() to
make failures diagnosable; keep the subsequent json.Unmarshal into
rangeQueryResponse and the existing parsed.Status check but only run them when
resp.IsError() is false.
tools/loadgen/attribution.go (1)

105-186: ⚡ Quick win

Bound total bottleneck diagnosis time (not just the per-QueryRange timeout).
RangeQuery uses a 10s HTTP timeout, but Diagnose runs many sequential PromQL queries (via saturated/cpuCores, plus extra work in fallbackRanking). If Prometheus hangs rather than fails fast, the end-of-run BOTTLENECK report can still stall for many tens of seconds; wrap the ctx used for eng.Diagnose(ctx, ...) (e.g., context.WithTimeout around diagnoseBottleneck/the call site) to cap total diagnosis duration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/attribution.go` around lines 105 - 186, The Diagnose path can
run many sequential PromQL calls (saturated, cpuCores, fallbackRanking) and
needs a global timeout; wrap the context passed into eng.Diagnose (or into
diagnoseBottleneck) with context.WithTimeout so the entire Diagnose run is
bounded (e.g., create a child ctx with a sensible total timeout, defer cancel(),
and pass that ctx into eng.Diagnose/diagnoseBottleneck), ensuring all internal
calls (saturated, cpuCores, fallbackRanking) inherit the deadline and the
bottleneck report cannot hang indefinitely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md`:
- Around line 1439-1446: The markdown fenced block containing the
ANSWER/BOTTLENECK example lacks a language tag and triggers MD040; update that
fenced block (the triple-backtick block that starts with "ANSWER: max RPS =
2000..." and includes "BOTTLENECK: message-worker (Cassandra-bound)") to include
a language token such as text (i.e., change ``` to ```text) so the block is
recognized as code/text and the markdownlint warning is resolved.

In `@docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md`:
- Around line 52-59: The fenced example block containing the lines starting with
"ANSWER: max RPS = 2000..." is missing a language tag; update the opening fence
of that code block to include a language identifier (e.g., change ``` to ```text
or ```console) and ensure the closing fence remains ``` so markdownlint MD040 is
satisfied—locate the fenced block by the unique content "ANSWER: max RPS = 2000
(workload=messages, preset=medium)" and add the language tag to its opening
fence.

In `@tools/loadgen/README.md`:
- Around line 290-297: The fenced code block containing the example starting
with "ANSWER: max RPS = 2000 (workload=messages, preset=medium) ..." is missing
a language tag; update the opening fence from ``` to ```text so the block is
labeled (this resolves markdownlint MD040) and leave the block contents
unchanged.

---

Outside diff comments:
In `@tools/loadgen/deploy/Makefile`:
- Around line 78-85: The run-max-rps target currently brings up cadvisor and
prometheus then immediately execs into the loadgen container to run /loadgen
max-rps; modify run-max-rps so it waits for Prometheus to be ready before
running the exec: after "$(COMPOSE) --profile dashboards up -d cadvisor
prometheus" add a readiness gate that polls Prometheus' /-/ready (or uses
compose healthcheck/up --wait) and only proceeds to "$(COMPOSE) exec -T loadgen
/loadgen max-rps ..." once prometheus returns healthy; reference the run-max-rps
target, $(COMPOSE) invocation, and the prometheus service name when implementing
the check.

In `@tools/loadgen/maxrps_report.go`:
- Around line 96-109: The BOTTLENECK output is skipped when lastPassRPS(results)
== 0 because the function returns before checking bn; update the control flow so
renderBottleneck(w, bn) is executed regardless of whether a passing step exists.
Concretely, keep the existing “no step passed” ANSWER path but do not return
immediately: after printing the "no step passed" message (using lastPassRPS and
firstTrip), still call renderBottleneck(w, bn) when bn != nil (and then return),
and ensure the existing passing-path behavior (printing max RPS, Next limit,
then renderBottleneck) remains unchanged; locate changes around lastPassRPS,
firstTrip, bn and renderBottleneck to implement this.

---

Nitpick comments:
In `@tools/loadgen/attribution.go`:
- Around line 105-186: The Diagnose path can run many sequential PromQL calls
(saturated, cpuCores, fallbackRanking) and needs a global timeout; wrap the
context passed into eng.Diagnose (or into diagnoseBottleneck) with
context.WithTimeout so the entire Diagnose run is bounded (e.g., create a child
ctx with a sensible total timeout, defer cancel(), and pass that ctx into
eng.Diagnose/diagnoseBottleneck), ensuring all internal calls (saturated,
cpuCores, fallbackRanking) inherit the deadline and the bottleneck report cannot
hang indefinitely.

In `@tools/loadgen/maxrps_report_test.go`:
- Around line 118-130: renderRPSReportWithBottleneck currently
early-returns/suppresses the BOTTLENECK when no step passed; update it so that
if a non-nil bottleneckVerdict (bn) is provided it still appends the bottleneck
info even when all rpsStepResult entries are verdictTrip (i.e., do not gate
emitting the "BOTTLENECK" block on finding a passing step). Change the control
flow in renderRPSReportWithBottleneck to compute the answer text (e.g., "no step
passed" vs "max RPS = X") but always check bn != nil and append its formatted
line (using bn.Component and bn.Resource such as "Component (Resource-bound)"),
and add the suggested test TestRenderRPSReport_AppendsBottleneck_NoPass to cover
the no-pass + bn case.

In `@tools/loadgen/promclient.go`:
- Around line 63-73: In RangeQuery, detect non-2xx HTTP responses from the Resty
response before attempting to unmarshal: check resp.IsError() (or inspect
resp.StatusCode()) after the request and return an error that includes the HTTP
status code and resp.Status() or resp.Body() to make failures diagnosable; keep
the subsequent json.Unmarshal into rangeQueryResponse and the existing
parsed.Status check but only run them when resp.IsError() is false.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f072573-1e5e-4807-ac60-91df745e99fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9c5d14a and 31ed4ae.

📒 Files selected for processing (27)
  • .gitignore
  • docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md
  • docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md
  • tools/loadgen/README.md
  • tools/loadgen/attribution.go
  • tools/loadgen/attribution_report.go
  • tools/loadgen/attribution_report_test.go
  • tools/loadgen/attribution_test.go
  • tools/loadgen/config_bottleneck_test.go
  • tools/loadgen/deploy/Makefile
  • tools/loadgen/deploy/docker-compose.yml
  • tools/loadgen/deploy/prometheus/prometheus.yml
  • tools/loadgen/identity.go
  • tools/loadgen/identity_test.go
  • tools/loadgen/main.go
  • tools/loadgen/maxrps.go
  • tools/loadgen/maxrps_report.go
  • tools/loadgen/maxrps_report_test.go
  • tools/loadgen/maxrps_test.go
  • tools/loadgen/promclient.go
  • tools/loadgen/promclient_test.go
  • tools/loadgen/ramp.go
  • tools/loadgen/ramp_test.go
  • tools/loadgen/stagegraph.go
  • tools/loadgen/stagegraph_test.go
  • tools/loadgen/verdict.go
  • tools/loadgen/verdict_test.go

Comment thread docs/superpowers/plans/2026-06-02-loadgen-bottleneck-attribution.md Outdated
Comment thread docs/superpowers/specs/2026-06-02-loadgen-bottleneck-attribution-design.md Outdated
Comment thread tools/loadgen/README.md Outdated
claude added 3 commits June 2, 2026 10:09
…anking

saturated() already computes trip-window cores; return it so fallbackRanking ranks on that value instead of re-issuing an identical Prometheus query per saturated service.

https://claude.ai/code/session_01GsqveU92hRQdC1nF8eEA6j
…jan-gA5rC

# Conflicts:
#	tools/loadgen/README.md
#	tools/loadgen/main.go
@hmchangw hmchangw force-pushed the claude/magical-ramanujan-gA5rC branch from 84213da to c9b69ef Compare June 4, 2026 02:34
Copy link
Copy Markdown
Collaborator

@mliu33 mliu33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mliu33 mliu33 merged commit 04bb191 into main Jun 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants