loadgen: batched pacer to remove the RPS ramp ceiling + actionable shortfall diagnosis#270
Conversation
|
Warning Review limit reached
More reviews will be available in 34 minutes and 44 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughThis PR adds rate-based pacing to the loadgen tool, replacing single-ticker emission with a pacer that computes per-tick event counts. Generator and HistoryGenerator are split into serial (MaxInFlight ≤ 0) and paced (MaxInFlight > 0) execution paths. Underrun (when pacer target exceeds load capacity) and saturation (when worker pool is full) are tracked as load-box signals, propagated through verdict evaluation, and exported in CSV output. ChangesLoad generator pacing and underrun metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tools/loadgen/maxrps_report_test.go (1)
58-59: ⚡ Quick winAssert exact CSV header order, not just presence.
These checks only confirm columns exist. Since column order is part of the report contract, assert the full header string (or parsed slice equality) to prevent silent ordering regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/maxrps_report_test.go` around lines 58 - 59, Replace the loose presence checks on the CSV header (the two assert.Contains calls that inspect lines[0]) with an exact-order assertion: either compare lines[0] to the expected full header string or parse lines[0] into a slice and assert slice equality against the exact expected column order; update the assertions that reference lines and the assert package accordingly so the test enforces header order rather than just presence.tools/loadgen/verdict_test.go (1)
194-226: ⚡ Quick winAdd a tie-case shortfall test for equal underrun/saturation counts.
The new tests cover both dominant paths, but not the boundary where
EmitUnderrun == Saturation > 0. Adding this case will lock in expected attribution behavior and prevent regressions in the main diagnosis path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/verdict_test.go` around lines 194 - 226, Add a test that covers the tie case where EmitUnderrun == Saturation > 0: create an rpsStepInputs with TargetRPS/AttemptedOps similar to existing tests, set EmitUnderrun and Saturation to the same non-zero value (e.g., 250), and call evaluateRPSStep; assert the returned verdict.Kind is verdictInconclusive and assert the diagnosis consistently attributes the shortfall to the chosen path (for example assert Reasons[0] contains "underrun" and "shard" and that got.EmitUnderrun equals the tie value) so the behavior is locked in and will fail if attribution changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/loadgen/generator.go`:
- Around line 91-94: The current logic treats any MaxInFlight <= 0 as serial;
instead reject negative values and reserve serial only for exactly 0: in
Generator.Run check g.cfg.MaxInFlight and if it's negative return an error
(don’t fall back to runSerial), if it equals 0 call runSerial(ctx), otherwise
call runPaced(ctx); apply the same change to HistoryGenerator.Run which mirrors
this logic so both return a clear error for negative MaxInFlight rather than
silently switching modes.
In `@tools/loadgen/history_generator.go`:
- Around line 229-232: The current branch treats any non-positive
g.cfg.MaxInFlight as serial, which hides invalid negative settings; change the
logic to explicitly validate g.cfg.MaxInFlight: if g.cfg.MaxInFlight < 0 return
an error (fail fast) describing the invalid concurrency value, if == 0 call
g.runSerial(ctx) and otherwise call g.runPaced(ctx); update the calling site or
signature to propagate the error like Generator.Run does so invalid negatives
are rejected rather than silently using runSerial.
In `@tools/loadgen/verdict.go`:
- Around line 175-183: The switch that returns fmt.Sprintf(...) misattributes
when in.EmitUnderrun == in.Saturation and both are non-zero, because it falls
through to the saturation branch; add a dedicated equality case before the
existing comparisons that checks in.EmitUnderrun == in.Saturation &&
in.Saturation > 0 and returns a neutral/tie message (using base) without
suggesting a single dominant remediation, then keep the current branches for
in.EmitUnderrun > in.Saturation and in.Saturation > 0 as-is.
---
Nitpick comments:
In `@tools/loadgen/maxrps_report_test.go`:
- Around line 58-59: Replace the loose presence checks on the CSV header (the
two assert.Contains calls that inspect lines[0]) with an exact-order assertion:
either compare lines[0] to the expected full header string or parse lines[0]
into a slice and assert slice equality against the exact expected column order;
update the assertions that reference lines and the assert package accordingly so
the test enforces header order rather than just presence.
In `@tools/loadgen/verdict_test.go`:
- Around line 194-226: Add a test that covers the tie case where EmitUnderrun ==
Saturation > 0: create an rpsStepInputs with TargetRPS/AttemptedOps similar to
existing tests, set EmitUnderrun and Saturation to the same non-zero value
(e.g., 250), and call evaluateRPSStep; assert the returned verdict.Kind is
verdictInconclusive and assert the diagnosis consistently attributes the
shortfall to the chosen path (for example assert Reasons[0] contains "underrun"
and "shard" and that got.EmitUnderrun equals the tie value) so the behavior is
locked in and will fail if attribution changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 39d1a90d-cba4-4485-b268-b47adf879d57
📒 Files selected for processing (15)
tools/loadgen/README.mdtools/loadgen/generator.gotools/loadgen/generator_test.gotools/loadgen/history_collector.gotools/loadgen/history_generator.gotools/loadgen/maxrps_history.gotools/loadgen/maxrps_history_test.gotools/loadgen/maxrps_messages.gotools/loadgen/maxrps_messages_test.gotools/loadgen/maxrps_report.gotools/loadgen/maxrps_report_test.gotools/loadgen/pacer.gotools/loadgen/pacer_test.gotools/loadgen/verdict.gotools/loadgen/verdict_test.go
| if g.cfg.MaxInFlight <= 0 { | ||
| return g.runSerial(ctx) | ||
| } | ||
| return g.runPaced(ctx) |
There was a problem hiding this comment.
Reject negative MaxInFlight instead of silently switching to serial mode.
MaxInFlight is the concurrency knob for paced runs, but <= 0 currently routes invalid negative values into the legacy bisection path. That will make a misconfigured run look artificially slow and skew the new underrun/saturation diagnostics instead of failing fast. Reserve serial mode for exactly 0 and reject negatives here (and in HistoryGenerator.Run, which mirrors this logic).
Suggested fix
func (g *Generator) Run(ctx context.Context) error {
if g.cfg.Rate <= 0 {
return fmt.Errorf("rate must be > 0")
}
- if g.cfg.MaxInFlight <= 0 {
+ if g.cfg.MaxInFlight < 0 {
+ return fmt.Errorf("max in flight must be >= 0")
+ }
+ if g.cfg.MaxInFlight == 0 {
return g.runSerial(ctx)
}
return g.runPaced(ctx)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if g.cfg.MaxInFlight <= 0 { | |
| return g.runSerial(ctx) | |
| } | |
| return g.runPaced(ctx) | |
| if g.cfg.MaxInFlight < 0 { | |
| return fmt.Errorf("max in flight must be >= 0") | |
| } | |
| if g.cfg.MaxInFlight == 0 { | |
| return g.runSerial(ctx) | |
| } | |
| return g.runPaced(ctx) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/generator.go` around lines 91 - 94, The current logic treats
any MaxInFlight <= 0 as serial; instead reject negative values and reserve
serial only for exactly 0: in Generator.Run check g.cfg.MaxInFlight and if it's
negative return an error (don’t fall back to runSerial), if it equals 0 call
runSerial(ctx), otherwise call runPaced(ctx); apply the same change to
HistoryGenerator.Run which mirrors this logic so both return a clear error for
negative MaxInFlight rather than silently switching modes.
| if g.cfg.MaxInFlight <= 0 { | ||
| return g.runSerial(ctx) | ||
| } | ||
| return g.runPaced(ctx) |
There was a problem hiding this comment.
Fail fast on negative MaxInFlight here as well.
This path has the same problem as Generator.Run: <= 0 quietly turns an invalid negative concurrency setting into the legacy serial mode. For a benchmarking tool that changes run semantics materially, that should be a validation error, not an implicit fallback.
Suggested fix
func (g *HistoryGenerator) Run(ctx context.Context) error {
if g.cfg.Rate <= 0 {
return fmt.Errorf("rate must be > 0")
}
- if g.cfg.MaxInFlight <= 0 {
+ if g.cfg.MaxInFlight < 0 {
+ return fmt.Errorf("max in flight must be >= 0")
+ }
+ if g.cfg.MaxInFlight == 0 {
return g.runSerial(ctx)
}
return g.runPaced(ctx)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if g.cfg.MaxInFlight <= 0 { | |
| return g.runSerial(ctx) | |
| } | |
| return g.runPaced(ctx) | |
| func (g *HistoryGenerator) Run(ctx context.Context) error { | |
| if g.cfg.Rate <= 0 { | |
| return fmt.Errorf("rate must be > 0") | |
| } | |
| if g.cfg.MaxInFlight < 0 { | |
| return fmt.Errorf("max in flight must be >= 0") | |
| } | |
| if g.cfg.MaxInFlight == 0 { | |
| return g.runSerial(ctx) | |
| } | |
| return g.runPaced(ctx) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/history_generator.go` around lines 229 - 232, The current
branch treats any non-positive g.cfg.MaxInFlight as serial, which hides invalid
negative settings; change the logic to explicitly validate g.cfg.MaxInFlight: if
g.cfg.MaxInFlight < 0 return an error (fail fast) describing the invalid
concurrency value, if == 0 call g.runSerial(ctx) and otherwise call
g.runPaced(ctx); update the calling site or signature to propagate the error
like Generator.Run does so invalid negatives are rejected rather than silently
using runSerial.
| switch { | ||
| case in.EmitUnderrun > in.Saturation: | ||
| return fmt.Sprintf("%s — load box could not emit on schedule (emit underrun=%d, saturation=%d); "+ | ||
| "reduce per-box rate, add load shards, or give the load box more CPU", | ||
| base, in.EmitUnderrun, in.Saturation) | ||
| case in.Saturation > 0: | ||
| return fmt.Sprintf("%s — in-flight pool saturated (saturation=%d, emit underrun=%d); "+ | ||
| "raise MaxInFlight (and/or reduce backend latency)", | ||
| base, in.Saturation, in.EmitUnderrun) |
There was a problem hiding this comment.
Handle tie case explicitly to avoid misleading “dominant” attribution.
When EmitUnderrun == Saturation and both are non-zero, the current switch falls into the saturation branch, even though neither signal dominates. This can misdirect remediation guidance in INCONCLUSIVE diagnostics.
Suggested fix
switch {
case in.EmitUnderrun > in.Saturation:
return fmt.Sprintf("%s — load box could not emit on schedule (emit underrun=%d, saturation=%d); "+
"reduce per-box rate, add load shards, or give the load box more CPU",
base, in.EmitUnderrun, in.Saturation)
+case in.EmitUnderrun > 0 && in.EmitUnderrun == in.Saturation:
+ return fmt.Sprintf("%s — load box limited by both emit scheduling and in-flight capacity equally "+
+ "(emit underrun=%d, saturation=%d); reduce per-box rate/add shards/CPU and raise MaxInFlight as needed",
+ base, in.EmitUnderrun, in.Saturation)
case in.Saturation > 0:
return fmt.Sprintf("%s — in-flight pool saturated (saturation=%d, emit underrun=%d); "+
"raise MaxInFlight (and/or reduce backend latency)",
base, in.Saturation, in.EmitUnderrun)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| switch { | |
| case in.EmitUnderrun > in.Saturation: | |
| return fmt.Sprintf("%s — load box could not emit on schedule (emit underrun=%d, saturation=%d); "+ | |
| "reduce per-box rate, add load shards, or give the load box more CPU", | |
| base, in.EmitUnderrun, in.Saturation) | |
| case in.Saturation > 0: | |
| return fmt.Sprintf("%s — in-flight pool saturated (saturation=%d, emit underrun=%d); "+ | |
| "raise MaxInFlight (and/or reduce backend latency)", | |
| base, in.Saturation, in.EmitUnderrun) | |
| switch { | |
| case in.EmitUnderrun > in.Saturation: | |
| return fmt.Sprintf("%s — load box could not emit on schedule (emit underrun=%d, saturation=%d); "+ | |
| "reduce per-box rate, add load shards, or give the load box more CPU", | |
| base, in.EmitUnderrun, in.Saturation) | |
| case in.EmitUnderrun > 0 && in.EmitUnderrun == in.Saturation: | |
| return fmt.Sprintf("%s — load box limited by both emit scheduling and in-flight capacity equally "+ | |
| "(emit underrun=%d, saturation=%d); reduce per-box rate/add shards/CPU and raise MaxInFlight as needed", | |
| base, in.EmitUnderrun, in.Saturation) | |
| case in.Saturation > 0: | |
| return fmt.Sprintf("%s — in-flight pool saturated (saturation=%d, emit underrun=%d); "+ | |
| "raise MaxInFlight (and/or reduce backend latency)", | |
| base, in.Saturation, in.EmitUnderrun) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/loadgen/verdict.go` around lines 175 - 183, The switch that returns
fmt.Sprintf(...) misattributes when in.EmitUnderrun == in.Saturation and both
are non-zero, because it falls through to the saturation branch; add a dedicated
equality case before the existing comparisons that checks in.EmitUnderrun ==
in.Saturation && in.Saturation > 0 and returns a neutral/tie message (using
base) without suggesting a single dominant remediation, then keep the current
branches for in.EmitUnderrun > in.Saturation and in.Saturation > 0 as-is.
3dd88cd to
17dac7c
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tools/loadgen/main_test.go (1)
20-37: ⚡ Quick winConsider adding edge case tests for nil and empty inputs.
The test validates the happy path well, but the coding guidelines require edge case coverage. Adding test cases for
hardPublishErrorCount(nil)andhardPublishErrorCount([]*dto.MetricFamily{})would document the function's safe handling of these inputs.📋 Proposed edge case tests
+func TestHardPublishErrorCount_NilInput(t *testing.T) { + assert.Equal(t, 0, hardPublishErrorCount(nil)) +} + +func TestHardPublishErrorCount_EmptyMetrics(t *testing.T) { + assert.Equal(t, 0, hardPublishErrorCount([]*dto.MetricFamily{})) +}As per coding guidelines: "Test coverage must include: happy path, error paths, edge cases (empty collections, boundary conditions), and invalid input."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/main_test.go` around lines 20 - 37, Add edge-case assertions to TestHardPublishErrorCount_ExcludesSelfLimitAndGatekeeper: call hardPublishErrorCount(nil) and hardPublishErrorCount([]*dto.MetricFamily{}) and assert both return 0; place these two checks alongside the existing happy-path assertion in the same test (or as subtests) so the function hardPublishErrorCount is validated for nil and empty-slice inputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tools/loadgen/main_test.go`:
- Around line 20-37: Add edge-case assertions to
TestHardPublishErrorCount_ExcludesSelfLimitAndGatekeeper: call
hardPublishErrorCount(nil) and hardPublishErrorCount([]*dto.MetricFamily{}) and
assert both return 0; place these two checks alongside the existing happy-path
assertion in the same test (or as subtests) so the function
hardPublishErrorCount is validated for nil and empty-slice inputs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 535ca8c7-e77c-41da-a079-bb0c2e52eb22
📒 Files selected for processing (17)
tools/loadgen/README.mdtools/loadgen/generator.gotools/loadgen/generator_test.gotools/loadgen/history_collector.gotools/loadgen/history_generator.gotools/loadgen/main.gotools/loadgen/main_test.gotools/loadgen/maxrps_history.gotools/loadgen/maxrps_history_test.gotools/loadgen/maxrps_messages.gotools/loadgen/maxrps_messages_test.gotools/loadgen/maxrps_report.gotools/loadgen/maxrps_report_test.gotools/loadgen/pacer.gotools/loadgen/pacer_test.gotools/loadgen/verdict.gotools/loadgen/verdict_test.go
✅ Files skipped from review due to trivial changes (2)
- tools/loadgen/README.md
- tools/loadgen/maxrps_report.go
🚧 Files skipped from review as they are similar to previous changes (10)
- tools/loadgen/pacer.go
- tools/loadgen/history_collector.go
- tools/loadgen/maxrps_messages_test.go
- tools/loadgen/maxrps_history_test.go
- tools/loadgen/maxrps_history.go
- tools/loadgen/maxrps_messages.go
- tools/loadgen/verdict.go
- tools/loadgen/verdict_test.go
- tools/loadgen/generator_test.go
- tools/loadgen/pacer_test.go
…ortfall diagnosis
The open-loop generators drove load with a single time.Ticker at 1s/rate.
At high rates that interval goes sub-millisecond, which the Go runtime can't
honor — the ticker silently coalesces ticks (buffer-of-1) and one event is
released per delivered tick. Achieved RPS plateaued at a few thousand
regardless of --steps, so most max-rps ramps reported INCONCLUSIVE
("load box limited") well below the real service capacity.
Fix A — batched pacer (pacer.go): tick on a coarse, reliably-schedulable
interval (clamped to a floor) and release rate*interval events per tick,
driven by wall-clock elapsed so the long-run rate stays exact. Wired into
both Generator.runPaced and HistoryGenerator.runPaced; MaxInFlight=0 keeps
the legacy serial path for bisection.
Fix E — make INCONCLUSIVE actionable: the pacer reports events it could not
release on schedule as "emit underrun", recorded distinctly from pool
"saturation". evaluateRPSStep now names the dominant load-box limit in the
shortfall reason — emit underrun (add CPU/shards, lower per-box rate) vs
saturation (raise MaxInFlight) — and both surface as CSV columns.
TDD: pacer math, verdict shortfall attribution, collector/build-inputs
plumbing, and a relative serial-vs-paced throughput regression (race-safe).
https://claude.ai/code/session_01JTVrS6qxLaYjA1S837L8Dp
…ned-run summary The `loadgen run` summary computed PublishErrors as "sum of all loadgen_publish_errors_total series minus gatekeeper". That bucket includes the load-box self-limit reasons, which are pacing diagnostics rather than publish failures: "saturated" (pre-existing) and "underrun" (added with the batched pacer). A plain sustained run therefore over-reported publish errors by however much the box saturated the in-flight pool or fell behind schedule. Replace the fragile sum-all-minus-gatekeeper pattern with an explicit enumeration of the real error reasons (publish, marshal, bad_reply), mirroring buildMessagesInputs. This fixes the new underrun miscount and the pre-existing saturated miscount, and won't silently break when future non-error reasons are added. https://claude.ai/code/session_01JTVrS6qxLaYjA1S837L8Dp
…allReason args Cleanup pass — no behavior change. - Extract the batched-pacer worker-pool loop and the legacy serial-ticker loop into shared serialDispatch/pacedDispatch helpers in pacer.go. Both Generator and HistoryGenerator now delegate via thin adapters, removing ~40 lines of byte-identical loop duplication that differed only in the dispatched method and the underrun sink (Prometheus label vs HistoryCollector tally). - shortfallReason no longer takes *rpsStepInputs: every field it reads (TargetRPS, AchievedRPS, Saturation, EmitUnderrun) is already on the *rpsStepResult it was passed, so the redundant input arg is dropped. https://claude.ai/code/session_01JTVrS6qxLaYjA1S837L8Dp
17dac7c to
3ff7f44
Compare
The open-loop generators drove load with a single time.Ticker at 1s/rate.
At high rates that interval goes sub-millisecond, which the Go runtime can't
honor — the ticker silently coalesces ticks (buffer-of-1) and one event is
released per delivered tick. Achieved RPS plateaued at a few thousand
regardless of --steps, so most max-rps ramps reported INCONCLUSIVE
("load box limited") well below the real service capacity.
Fix A — batched pacer (pacer.go): tick on a coarse, reliably-schedulable
interval (clamped to a floor) and release rate*interval events per tick,
driven by wall-clock elapsed so the long-run rate stays exact. Wired into
both Generator.runPaced and HistoryGenerator.runPaced; MaxInFlight=0 keeps
the legacy serial path for bisection.
Fix E — make INCONCLUSIVE actionable: the pacer reports events it could not
release on schedule as "emit underrun", recorded distinctly from pool
"saturation". evaluateRPSStep now names the dominant load-box limit in the
shortfall reason — emit underrun (add CPU/shards, lower per-box rate) vs
saturation (raise MaxInFlight) — and both surface as CSV columns.
TDD: pacer math, verdict shortfall attribution, collector/build-inputs
plumbing, and a relative serial-vs-paced throughput regression (race-safe).
https://claude.ai/code/session_01JTVrS6qxLaYjA1S837L8Dp
Summary by CodeRabbit
New Features
Documentation
Tests