Skip to content

PMM-14671 Randomize QAN delivery timing within collection interval#5448

Open
JiriCtvrtka wants to merge 16 commits into
v3from
PMM-14671-qan-load-distribution
Open

PMM-14671 Randomize QAN delivery timing within collection interval#5448
JiriCtvrtka wants to merge 16 commits into
v3from
PMM-14671-qan-load-distribution

Conversation

@JiriCtvrtka

@JiriCtvrtka JiriCtvrtka commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Ticket number: PMM-14671

Feature build: SUBMODULES-0

Adjusted scope based on review feedback: metric collection timing remains unchanged, and only QANCollectRequest delivery is randomized within the interval. This preserves existing QAN bucket timestamps/periods while distributing request delivery to reduce server-side request bursts.

If this PR adds, removes or alters one or more API endpoints, please review and update the relevant API documentation as well:

  • API Docs updated

If this PR is related to other PRs, contributions, or ongoing work in this or other repositories, please reference them here:

  • Links to related work items (optional).

@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.37037% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.37%. Comparing base (868e6bd) to head (f8c8b0a).

Files with missing lines Patch % Lines
agent/agents/supervisor/supervisor.go 54.16% 19 Missing and 3 partials ⚠️
agent/agents/scheduler.go 83.33% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5448      +/-   ##
==========================================
+ Coverage   43.23%   43.37%   +0.13%     
==========================================
  Files         414      414              
  Lines       42776    42278     -498     
==========================================
- Hits        18496    18337     -159     
+ Misses      22408    22178     -230     
+ Partials     1872     1763     -109     
Flag Coverage Δ
agent 49.25% <70.37%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka JiriCtvrtka marked this pull request as ready for review June 4, 2026 13:00
@JiriCtvrtka JiriCtvrtka requested a review from a team as a code owner June 4, 2026 13:00
@JiriCtvrtka JiriCtvrtka requested review from 4nte, ademidoff and maxkondr and removed request for a team June 4, 2026 13:00

@ademidoff ademidoff left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we raised the right problem, but didn't validate the proposed solution.

It matters little when the collection takes place. "min:00" was quite a good approach to reason about and test. However, the sending part was causing a storm of requests and potentially even network congestion server-side. That's the real problem we want to solve.

Therefore, IMO we should change the goal from

Metric collection and delivery should be evenly distributed or randomised within the collection interval

to

Metric collection should stay the same but the delivery should be evenly distributed or randomised within the collection interval.

That would be a much smaller change to review and test.

Comment thread agent/agents/mongodb/mongolog/internal/mongolog.go Outdated
@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

I think we raised the right problem, but didn't validate the proposed solution.

It matters little when the collection takes place. "min:00" was quite a good approach to reason about and test. However, the sending part was causing a storm of requests and potentially even network congestion server-side. That's the real problem we want to solve.

Therefore, IMO we should change the goal from

Metric collection and delivery should be evenly distributed or randomised within the collection interval

to

Metric collection should stay the same but the delivery should be evenly distributed or randomised within the collection interval.

That would be a much smaller change to review and test.

Fair point, I agree. I changed the approach so collection stays unchanged and randomization happens only before sending QANCollectRequest.

This keeps QAN bucket timestamps/periods the same while spreading delivery to PMM Server across the interval. The trade-off is that collection load on a single pmm-agent can still spike if many services collect at the same time.

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

Copilot AI changed the title PMM-14671 Load distribution for QAN agents. PMM-14671 Randomize QAN delivery timing within collection interval Jun 4, 2026
@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

Comment thread agent/agents/supervisor/supervisor.go Fixed
@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

After discussion with @4nte, he will create a ticket related to server/CH changes.

@4nte

4nte commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@JiriCtvrtka Do you think we should make the QAN delivery interval configurable via an env var?

The current implementation will benefit large deployments that suffer from spikes by essentially delaying the delivery but at cost of increased latency for QAN data.

What if we let users lower it (e.g. 60s to 20s) to trade some server load for fresher QAN data, do you think this is something users would care for?

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@JiriCtvrtka Do you think we should make the QAN delivery interval configurable via an env var?

The current implementation will benefit large deployments that suffer from spikes by essentially delaying the delivery but at cost of increased latency for QAN data.

What if we let users lower it (e.g. 60s to 20s) to trade some server load for fresher QAN data, do you think this is something users would care for?

@4nte this is not possible at this moment. PGSM needs to have 1 minute otherwise it will desync. Not sure about other sources right now from top of my head.

go func() {
qanDeliveryRequests := make(chan *agentv1.QANCollectRequest, qanRequestsBufferSize)
qanDeliveryDone := make(chan struct{})
go func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this func startBuiltin is common for QAN and RTA agents. This particular go-routine is created unconditionally for all agent types 9even when it is not required (RTA case) .

one more point - waiting for each such nested go-routine finish is performed via done channel pattern, isn't it better to have one workGroup instead channel per each go-routine?

return nil
}

func (s *Supervisor) deliverQANRequests(ctx context.Context, l *logrus.Entry, agentID string, requests <-chan *agentv1.QANCollectRequest) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an alternative via rateLimiter:

package main

import (
	"context"
	"log"
	"time"

	"golang.org/x/time/rate"
)

// Allow exactly 10 requests per second, with a maximum burst of 1.
// This enforces a strict, even flow to the backend.
var backendLimiter = rate.NewLimiter(rate.Every(100*time.Millisecond), 1)

func HandleRequest(ctx context.Context, payload string) error {
	// 1. We wait for permission to proceed. 
	// If the rate is exceeded, this Goroutine goes to sleep (is parked)
	// and wakes up exactly when it's allowed to hit the backend.
	if err := backendLimiter.Wait(ctx); err != nil {
		// This error ONLY triggers if the context expires or is canceled 
		// before a token becomes available.
		return err 
	}

	// 2. We are cleared to go. Hit the backing service.
	return callBackingService(payload)
}

func callBackingService(payload string) error {
	// Simulate backend work
	return nil
}

you may define a time distribution factor in ..rate.Every(100*time.Millisecond).... It looks less complex

go pprof.Do(ctx, pprof.Labels("agentID", agentID, "type", agentType), agent.Run)

go func() {
qanDeliveryRequests := make(chan *agentv1.QANCollectRequest, qanRequestsBufferSize)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qanRequestsBufferSize == only 100 items. It may become full very quickly if we postpone sending QAN requests to network on really loaded systems. And it results into the following:

  1. this go-routine will stop reading from agent.Changes()
  2. but this agent.Changes() is used for delivering agent's status changes as well.
  3. agent's status delivery will be postponed till QAN requests are processed

It looks like there is a need in moving agent's status change channel into a separate high priority channel that is not blocked by the rest channels.

for request := range requests {
if !qanDeliveryOffsetAssigned {
var offset time.Duration
offset, releaseQANDeliveryOffset = agents.RandomMinuteOffset(agentID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with this approach - it is applied even when there is only 1 QAN agent because this logic is placed on Agent side that knows nothing about number of QAN agents registered on PMM server.

It still may create local spikes (yes, they will be not so critical in comparing to heavy-loaded systems) because it only shifts sending the whole collected/buffered QAN requests collection from 0-s second of each minute to N-s second of each minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants