PMM-14671 Randomize QAN delivery timing within collection interval by JiriCtvrtka · Pull Request #5448 · percona/pmm

JiriCtvrtka · 2026-06-02T10:53:46Z

Ticket number: PMM-14671

Adjusted scope based on review feedback: metric collection timing remains unchanged, and only QANCollectRequest delivery is randomized within the interval. This preserves existing QAN bucket timestamps/periods while distributing request delivery to reduce server-side request bursts.

If this PR adds, removes or alters one or more API endpoints, please review and update the relevant API documentation as well:

API Docs updated

If this PR is related to other PRs, contributions, or ongoing work in this or other repositories, please reference them here:

Links to related work items (optional).

codecov · 2026-06-02T11:02:14Z

Codecov Report

❌ Patch coverage is 70.37037% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.37%. Comparing base (868e6bd) to head (f8c8b0a).

Files with missing lines	Patch %	Lines
agent/agents/supervisor/supervisor.go	54.16%	19 Missing and 3 partials ⚠️
agent/agents/scheduler.go	83.33%	6 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##               v3    #5448      +/-   ##
==========================================
+ Coverage   43.23%   43.37%   +0.13%     
==========================================
  Files         414      414              
  Lines       42776    42278     -498     
==========================================
- Hits        18496    18337     -159     
+ Misses      22408    22178     -230     
+ Partials     1872     1763     -109

Flag	Coverage Δ
agent	`49.25% <70.37%> (+0.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

JiriCtvrtka · 2026-06-03T09:14:39Z

@copilot review

… into PMM-14671-qan-load-distribution

JiriCtvrtka · 2026-06-04T12:42:42Z

@copilot review

ademidoff

I think we raised the right problem, but didn't validate the proposed solution.

It matters little when the collection takes place. "min:00" was quite a good approach to reason about and test. However, the sending part was causing a storm of requests and potentially even network congestion server-side. That's the real problem we want to solve.

Therefore, IMO we should change the goal from

Metric collection and delivery should be evenly distributed or randomised within the collection interval

to

Metric collection should stay the same but the delivery should be evenly distributed or randomised within the collection interval.

That would be a much smaller change to review and test.

JiriCtvrtka · 2026-06-04T15:57:52Z

I think we raised the right problem, but didn't validate the proposed solution.

It matters little when the collection takes place. "min:00" was quite a good approach to reason about and test. However, the sending part was causing a storm of requests and potentially even network congestion server-side. That's the real problem we want to solve.

Therefore, IMO we should change the goal from

Metric collection and delivery should be evenly distributed or randomised within the collection interval

to

Metric collection should stay the same but the delivery should be evenly distributed or randomised within the collection interval.

That would be a much smaller change to review and test.

Fair point, I agree. I changed the approach so collection stays unchanged and randomization happens only before sending QANCollectRequest.

This keeps QAN bucket timestamps/periods the same while spreading delivery to PMM Server across the interval. The trade-off is that collection load on a single pmm-agent can still spike if many services collect at the same time.

JiriCtvrtka · 2026-06-04T16:06:28Z

@copilot review

JiriCtvrtka · 2026-06-05T07:34:22Z

@copilot review

JiriCtvrtka · 2026-06-05T08:15:07Z

@copilot review

JiriCtvrtka · 2026-06-08T05:44:23Z

After discussion with @4nte, he will create a ticket related to server/CH changes.

4nte · 2026-06-08T16:13:30Z

@JiriCtvrtka Do you think we should make the QAN delivery interval configurable via an env var?

The current implementation will benefit large deployments that suffer from spikes by essentially delaying the delivery but at cost of increased latency for QAN data.

What if we let users lower it (e.g. 60s to 20s) to trade some server load for fresher QAN data, do you think this is something users would care for?

JiriCtvrtka · 2026-06-08T16:21:32Z

@JiriCtvrtka Do you think we should make the QAN delivery interval configurable via an env var?

The current implementation will benefit large deployments that suffer from spikes by essentially delaying the delivery but at cost of increased latency for QAN data.

What if we let users lower it (e.g. 60s to 20s) to trade some server load for fresher QAN data, do you think this is something users would care for?

@4nte this is not possible at this moment. PGSM needs to have 1 minute otherwise it will desync. Not sure about other sources right now from top of my head.

maxkondr · 2026-06-09T07:13:10Z

 	go func() {
+		qanDeliveryRequests := make(chan *agentv1.QANCollectRequest, qanRequestsBufferSize)
+		qanDeliveryDone := make(chan struct{})
+		go func() {


this func startBuiltin is common for QAN and RTA agents. This particular go-routine is created unconditionally for all agent types 9even when it is not required (RTA case) .

one more point - waiting for each such nested go-routine finish is performed via done channel pattern, isn't it better to have one workGroup instead channel per each go-routine?

maxkondr · 2026-06-09T07:57:44Z

 	return nil
 }

+func (s *Supervisor) deliverQANRequests(ctx context.Context, l *logrus.Entry, agentID string, requests <-chan *agentv1.QANCollectRequest) {


there is an alternative via rateLimiter:

package main import ( "context" "log" "time" "golang.org/x/time/rate" ) // Allow exactly 10 requests per second, with a maximum burst of 1. // This enforces a strict, even flow to the backend. var backendLimiter = rate.NewLimiter(rate.Every(100*time.Millisecond), 1) func HandleRequest(ctx context.Context, payload string) error { // 1. We wait for permission to proceed. // If the rate is exceeded, this Goroutine goes to sleep (is parked) // and wakes up exactly when it's allowed to hit the backend. if err := backendLimiter.Wait(ctx); err != nil { // This error ONLY triggers if the context expires or is canceled // before a token becomes available. return err } // 2. We are cleared to go. Hit the backing service. return callBackingService(payload) } func callBackingService(payload string) error { // Simulate backend work return nil }

you may define a time distribution factor in ..rate.Every(100*time.Millisecond).... It looks less complex

maxkondr · 2026-06-09T08:05:43Z

 	go pprof.Do(ctx, pprof.Labels("agentID", agentID, "type", agentType), agent.Run)

 	go func() {
+		qanDeliveryRequests := make(chan *agentv1.QANCollectRequest, qanRequestsBufferSize)


qanRequestsBufferSize == only 100 items. It may become full very quickly if we postpone sending QAN requests to network on really loaded systems. And it results into the following:

this go-routine will stop reading from agent.Changes()

but this agent.Changes() is used for delivering agent's status changes as well.

agent's status delivery will be postponed till QAN requests are processed

It looks like there is a need in moving agent's status change channel into a separate high priority channel that is not blocked by the rest channels.

maxkondr · 2026-06-09T08:10:41Z

+	for request := range requests {
+		if !qanDeliveryOffsetAssigned {
+			var offset time.Duration
+			offset, releaseQANDeliveryOffset = agents.RandomMinuteOffset(agentID)


the problem with this approach - it is applied even when there is only 1 QAN agent because this logic is placed on Agent side that knows nothing about number of QAN agents registered on PMM server.

It still may create local spikes (yes, they will be not so critical in comparing to heavy-loaded systems) because it only shifts sending the whole collected/buffered QAN requests collection from 0-s second of each minute to N-s second of each minute.

PMM-14671 Load distribution for QAN agents.

6c2cdd6

JiriCtvrtka and others added 2 commits June 2, 2026 13:26

PMM-14671 Lint.

8fcfe44

Merge branch 'v3' into PMM-14671-qan-load-distribution

1f82c3b

Copilot started work on behalf of JiriCtvrtka June 3, 2026 09:14 View session

Copilot finished work on behalf of JiriCtvrtka June 3, 2026 09:17

JiriCtvrtka and others added 3 commits June 4, 2026 11:52

Merge branch 'v3' into PMM-14671-qan-load-distribution

43527de

PMM-14671 Improvements.

d0a7eca

Merge remote-tracking branch 'origin/PMM-14671-qan-load-distribution'…

abcd6f5

… into PMM-14671-qan-load-distribution

Copilot started work on behalf of JiriCtvrtka June 4, 2026 12:42 View session

Copilot finished work on behalf of JiriCtvrtka June 4, 2026 12:44

JiriCtvrtka marked this pull request as ready for review June 4, 2026 13:00

JiriCtvrtka requested a review from a team as a code owner June 4, 2026 13:00

JiriCtvrtka requested review from 4nte, ademidoff and maxkondr and removed request for a team June 4, 2026 13:00

ademidoff reviewed Jun 4, 2026

View reviewed changes

Comment thread agent/agents/mongodb/mongolog/internal/mongolog.go Outdated

JiriCtvrtka added 2 commits June 4, 2026 17:54

PMM-14671 Improved solution based on Alex suggestion.

e5757f5

PMM-14671 Supervisor changes.

686a41b

Merge branch 'v3' into PMM-14671-qan-load-distribution

5187c1e

Copilot started work on behalf of JiriCtvrtka June 4, 2026 16:06 View session

Copilot AI changed the title ~~PMM-14671 Load distribution for QAN agents.~~ PMM-14671 Randomize QAN delivery timing within collection interval Jun 4, 2026

Copilot finished work on behalf of JiriCtvrtka June 4, 2026 16:08

JiriCtvrtka added 2 commits June 5, 2026 09:31

PMM-14671 Improvements.

97b9802

Merge branch 'v3' into PMM-14671-qan-load-distribution

907d442

Copilot started work on behalf of JiriCtvrtka June 5, 2026 07:34 View session

Copilot finished work on behalf of JiriCtvrtka June 5, 2026 07:35

github-code-quality Bot found potential problems Jun 5, 2026

View reviewed changes

Comment thread agent/agents/supervisor/supervisor.go Fixed

JiriCtvrtka added 2 commits June 5, 2026 09:53

PMM-14671 Defer fix.

05bbd38

PMM-14671 Release fix.

dced80e

Copilot started work on behalf of JiriCtvrtka June 5, 2026 08:15 View session

Copilot finished work on behalf of JiriCtvrtka June 5, 2026 08:16

JiriCtvrtka requested a review from ademidoff June 5, 2026 08:23

Merge branch 'v3' into PMM-14671-qan-load-distribution

9d161ae

JiriCtvrtka added 2 commits June 8, 2026 13:40

Merge branch 'v3' into PMM-14671-qan-load-distribution

f61430e

Merge branch 'v3' into PMM-14671-qan-load-distribution

f8c8b0a

maxkondr reviewed Jun 9, 2026

View reviewed changes

Conversation

JiriCtvrtka commented Jun 2, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JiriCtvrtka commented Jun 3, 2026

Uh oh!

JiriCtvrtka commented Jun 4, 2026

Uh oh!

ademidoff left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiriCtvrtka commented Jun 4, 2026

Uh oh!

JiriCtvrtka commented Jun 4, 2026

Uh oh!

JiriCtvrtka commented Jun 5, 2026

Uh oh!

Uh oh!

JiriCtvrtka commented Jun 5, 2026

Uh oh!

JiriCtvrtka commented Jun 8, 2026

Uh oh!

4nte commented Jun 8, 2026

Uh oh!

JiriCtvrtka commented Jun 8, 2026

Uh oh!

maxkondr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

maxkondr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

maxkondr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

maxkondr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JiriCtvrtka commented Jun 2, 2026 •

edited by Copilot AI

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading

ademidoff left a comment •

edited

Loading