PMM-15143 Fix resource leaks in managed by ademidoff · Pull Request #5484 · percona/pmm

ademidoff · 2026-06-10T18:13:53Z

Ticket number: PMM-15143

getReleaseNotesText returned on a non-200 status before the defer resp.Body.Close() was registered, leaking the response body and its TCP connection on every missing release note (a recurring path via the update-check loop and ListUpdates). Move the defer above the status check so the body is closed on all return paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

StartChecks ran its asynchronous advisor checks on context.Background(), so the goroutine could not be cancelled on shutdown or loss of HA leadership and kept running (up to the per-check timeout) on a node that should no longer be doing leader work. Record the context passed to Run and use it for the StartChecks goroutine; it defaults to Background until Run starts, preserving the previous behavior before the service is run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

StartAllServices added each service to the running map under the lock but called wg.Add(1) later, outside it, while StopAllServices and removeService decrement based on running-map membership. Two races followed: - removeService called wg.Done() unconditionally, so when a service's Start failed at the same time StopAllServices ran, both paths decremented the same Add(1), driving the counter negative and panicking. - A Stop/remove that observed a service in running between the unlock and the deferred Add(1) could Done() before Add(). Pair wg.Add(1) with running-map insertion under the lock, and only Done() in removeService when it actually removed the entry. Add a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The telemetry HA leader service was the only leader service coupled to the top-level shutdown WaitGroup: wg.Add(1) ran unconditionally at registration while wg.Done() was deferred inside the leader closure. On a follower node the closure never runs, so wg.Done() is never called and wg.Wait() blocks forever, hanging shutdown; on leadership re-acquisition the closure runs again and over-decrements the single Add. Drop the wg coupling so telemetry matches the other leader services (checks, scheduler, versionCache, cleaner). Its shutdown is already awaited transitively: haService.Run stops all leader services on ctx cancellation and waits for them, and haService.Run itself runs under the top-level wg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Run created s.rareTicker/standardTicker/frequentTicker without holding s.tm, while UpdateIntervals read and Reset them under s.tm - a data race on the ticker pointers. UpdateIntervals also dereferenced them with no nil check, so a settings change before Run created the tickers (e.g. on a node that is not the HA leader) panicked with a nil pointer. Create the tickers under s.tm in Run, and guard UpdateIntervals against nil tickers (Run reads the new intervals from the persisted settings when it starts). Add a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

StopDump dereferenced s.cancel with no nil check, so calling it before any dump started would panic. The method has no callers anywhere in the repo, so remove it (dead code) rather than guard it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

TrimPITRArtifact passed artifact.MetadataList[0].RestoreTo to deleteArtifactPITRChunks right after MetadataRemoveFirstN, which clamps to the slice length and can leave MetadataList empty when firstN covers every record - indexing [0] then panics in the background goroutine. Pass a nil "until" when no metadata remains, which makes deleteArtifactPITRChunks remove all remaining chunks (its documented behavior). Add a regression subtest that trims all metadata. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ademidoff · 2026-06-10T18:18:38Z

 	for id, service := range s.all {
 		if _, ok := s.running[id]; !ok {
 			s.running[id] = service
+			s.wg.Add(1)


This is the most important fix here )

codecov · 2026-06-10T18:18:49Z

Codecov Report

❌ Patch coverage is 56.52174% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.55%. Comparing base (29af54e) to head (6086c01).

Files with missing lines	Patch %	Lines
managed/services/checks/checks.go	28.57%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##               v3    #5484      +/-   ##
==========================================
+ Coverage   43.46%   43.55%   +0.09%     
==========================================
  Files         413      413              
  Lines       42928    42728     -200     
==========================================
- Hits        18659    18611      -48     
+ Misses      22393    22272     -121     
+ Partials     1876     1845      -31

Flag	Coverage Δ
managed	`42.88% <56.52%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

maxkondr · 2026-06-12T12:22:53Z

+	runCtxM sync.Mutex
+	// runCtx is the service lifecycle context recorded by Run. It bounds
+	// asynchronous work started via StartChecks so it is cancelled on shutdown.
+	runCtx context.Context //nolint:containedctx


this is anti-pattern. https://pkg.go.dev/context

Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.

maxkondr · 2026-06-12T13:33:27Z

 	}

+	s.runCtxM.Lock()
+	ctx := s.runCtx


no need to pass upper-level context here, much simpler to implement via on demand channel that is read in runChecksLoop func

maxkondr · 2026-06-12T13:36:25Z

+	// Tickers are created by Run; if it has not started on this node (e.g. not
+	// the leader), there is nothing to reset - Run reads the new intervals from
+	// the persisted settings when it starts.
+	if s.rareTicker == nil || s.standardTicker == nil || s.frequentTicker == nil {


shouldn't the update request be sent to the current leader node only?

maxkondr · 2026-06-12T14:04:36Z

 	for id, service := range s.all {
 		if _, ok := s.running[id]; !ok {
 			s.running[id] = service
+			s.wg.Add(1)
 			toStart = append(toStart, startItem{svc: service, id: id})
 		}
 	}
 	s.rw.Unlock()

 	for _, service := range toStart {
-		s.wg.Add(1)
 		go func(svc LeaderService, svcID string) {
 			s.l.Infoln("Starting", svcID)
 			err := svc.Start(ctx)
 			if err != nil {
 				s.l.Errorln(err)
 				s.removeService(svcID)
 			}
 		}(service.svc, service.id)
 	}
 }


no need to create 2 loops, it is possible to spin up within the same loop

with using wg.Go() would avoid the issue that is fixed here now

maxkondr · 2026-06-12T14:21:43Z

 // StopAllServices stops all running services.
 func (s *services) StopAllServices() {
 	s.rw.Lock()
 	toStop := make([]LeaderService, 0, len(s.running))


looks line no need to to create a copy, it is possible to stop within the same loop because service.Stop() just calls context.Cancel() - it is fast enough and doesn't block

maxkondr · 2026-06-12T14:25:25Z

 	for _, service := range toStop {
 		s.l.Infoln("Stopping", service.ID())
 		service.Stop()
 		s.wg.Done()


this is conceptually wrong. s.wg.Done() must be called at the end of go-routine that is part of sync.waitGroup otherwise it leads to unpredictable results and go-routines may keep running even after wg.Wait() has returned.

maxkondr · 2026-06-12T14:27:29Z

 	toStop := make([]LeaderService, 0, len(s.running))
 	for id, service := range s.running {
 		toStop = append(toStop, service)
 		delete(s.running, id)


removing from s.running shall be called when the service is really stopped but not when there is an intention to stop it.

maxkondr · 2026-06-12T14:35:33Z

@@ -122,7 +122,12 @@
 // removeService removes a service from the registry of running services.
 func (s *services) removeService(id string) {


the logic of this function shall be only in one place - at the ent of go-routine in line 90. This function is not needed at all because it creates a dangerous situation with sync.WaitGroup counter

ademidoff and others added 7 commits June 10, 2026 19:44

ademidoff mentioned this pull request Jun 10, 2026

PMM-15143-fix-resource-leaks Percona-Lab/pmm-submodules#4401

Draft

ademidoff commented Jun 10, 2026

View reviewed changes

Merge branch 'v3' into PMM-15143-fix-resource-leaks

6086c01

ademidoff marked this pull request as ready for review June 11, 2026 13:47

ademidoff requested a review from a team as a code owner June 11, 2026 13:47

ademidoff requested review from 4nte and maxkondr and removed request for a team June 11, 2026 13:47

JiriCtvrtka approved these changes Jun 12, 2026

View reviewed changes

maxkondr reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMM-15143 Fix resource leaks in managed#5484

PMM-15143 Fix resource leaks in managed#5484
ademidoff wants to merge 8 commits into
v3from
PMM-15143-fix-resource-leaks

ademidoff commented Jun 10, 2026 •

edited

Loading

Uh oh!

ademidoff Jun 10, 2026

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

maxkondr Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -122,7 +122,12 @@
		// removeService removes a service from the registry of running services.
		func (s *services) removeService(id string) {

Conversation

ademidoff commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ademidoff commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading