Skip to content

PMM-15143 Fix resource leaks in managed#5484

Open
ademidoff wants to merge 8 commits into
v3from
PMM-15143-fix-resource-leaks
Open

PMM-15143 Fix resource leaks in managed#5484
ademidoff wants to merge 8 commits into
v3from
PMM-15143-fix-resource-leaks

Conversation

@ademidoff

@ademidoff ademidoff commented Jun 10, 2026

Copy link
Copy Markdown
Member

Ticket number: PMM-15143

Feature build: SUBMODULES-4401

ademidoff and others added 7 commits June 10, 2026 19:44
getReleaseNotesText returned on a non-200 status before the
defer resp.Body.Close() was registered, leaking the response body and
its TCP connection on every missing release note (a recurring path via
the update-check loop and ListUpdates). Move the defer above the status
check so the body is closed on all return paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StartChecks ran its asynchronous advisor checks on context.Background(),
so the goroutine could not be cancelled on shutdown or loss of HA
leadership and kept running (up to the per-check timeout) on a node that
should no longer be doing leader work. Record the context passed to Run
and use it for the StartChecks goroutine; it defaults to Background until
Run starts, preserving the previous behavior before the service is run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StartAllServices added each service to the running map under the lock but
called wg.Add(1) later, outside it, while StopAllServices and removeService
decrement based on running-map membership. Two races followed:

- removeService called wg.Done() unconditionally, so when a service's Start
  failed at the same time StopAllServices ran, both paths decremented the
  same Add(1), driving the counter negative and panicking.
- A Stop/remove that observed a service in running between the unlock and
  the deferred Add(1) could Done() before Add().

Pair wg.Add(1) with running-map insertion under the lock, and only Done()
in removeService when it actually removed the entry. Add a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The telemetry HA leader service was the only leader service coupled to the
top-level shutdown WaitGroup: wg.Add(1) ran unconditionally at registration
while wg.Done() was deferred inside the leader closure. On a follower node
the closure never runs, so wg.Done() is never called and wg.Wait() blocks
forever, hanging shutdown; on leadership re-acquisition the closure runs
again and over-decrements the single Add.

Drop the wg coupling so telemetry matches the other leader services
(checks, scheduler, versionCache, cleaner). Its shutdown is already awaited
transitively: haService.Run stops all leader services on ctx cancellation
and waits for them, and haService.Run itself runs under the top-level wg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Run created s.rareTicker/standardTicker/frequentTicker without holding
s.tm, while UpdateIntervals read and Reset them under s.tm - a data race on
the ticker pointers. UpdateIntervals also dereferenced them with no nil
check, so a settings change before Run created the tickers (e.g. on a node
that is not the HA leader) panicked with a nil pointer.

Create the tickers under s.tm in Run, and guard UpdateIntervals against
nil tickers (Run reads the new intervals from the persisted settings when
it starts). Add a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StopDump dereferenced s.cancel with no nil check, so calling it before any
dump started would panic. The method has no callers anywhere in the repo,
so remove it (dead code) rather than guard it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TrimPITRArtifact passed artifact.MetadataList[0].RestoreTo to
deleteArtifactPITRChunks right after MetadataRemoveFirstN, which clamps to
the slice length and can leave MetadataList empty when firstN covers every
record - indexing [0] then panics in the background goroutine.

Pass a nil "until" when no metadata remains, which makes
deleteArtifactPITRChunks remove all remaining chunks (its documented
behavior). Add a regression subtest that trims all metadata.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
for id, service := range s.all {
if _, ok := s.running[id]; !ok {
s.running[id] = service
s.wg.Add(1)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important fix here )

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.52174% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.55%. Comparing base (29af54e) to head (6086c01).

Files with missing lines Patch % Lines
managed/services/checks/checks.go 28.57% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5484      +/-   ##
==========================================
+ Coverage   43.46%   43.55%   +0.09%     
==========================================
  Files         413      413              
  Lines       42928    42728     -200     
==========================================
- Hits        18659    18611      -48     
+ Misses      22393    22272     -121     
+ Partials     1876     1845      -31     
Flag Coverage Δ
managed 42.88% <56.52%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ademidoff ademidoff marked this pull request as ready for review June 11, 2026 13:47
@ademidoff ademidoff requested a review from a team as a code owner June 11, 2026 13:47
@ademidoff ademidoff requested review from 4nte and maxkondr and removed request for a team June 11, 2026 13:47
runCtxM sync.Mutex
// runCtx is the service lifecycle context recorded by Run. It bounds
// asynchronous work started via StartChecks so it is cancelled on shutdown.
runCtx context.Context //nolint:containedctx

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is anti-pattern. https://pkg.go.dev/context

Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.

}

s.runCtxM.Lock()
ctx := s.runCtx

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass upper-level context here, much simpler to implement via on demand channel that is read in runChecksLoop func

// Tickers are created by Run; if it has not started on this node (e.g. not
// the leader), there is nothing to reset - Run reads the new intervals from
// the persisted settings when it starts.
if s.rareTicker == nil || s.standardTicker == nil || s.frequentTicker == nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the update request be sent to the current leader node only?

Comment on lines 74 to 93
for id, service := range s.all {
if _, ok := s.running[id]; !ok {
s.running[id] = service
s.wg.Add(1)
toStart = append(toStart, startItem{svc: service, id: id})
}
}
s.rw.Unlock()

for _, service := range toStart {
s.wg.Add(1)
go func(svc LeaderService, svcID string) {
s.l.Infoln("Starting", svcID)
err := svc.Start(ctx)
if err != nil {
s.l.Errorln(err)
s.removeService(svcID)
}
}(service.svc, service.id)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. no need to create 2 loops, it is possible to spin up within the same loop
  2. with using wg.Go() would avoid the issue that is fixed here now

// StopAllServices stops all running services.
func (s *services) StopAllServices() {
s.rw.Lock()
toStop := make([]LeaderService, 0, len(s.running))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks line no need to to create a copy, it is possible to stop within the same loop because service.Stop() just calls context.Cancel() - it is fast enough and doesn't block

for _, service := range toStop {
s.l.Infoln("Stopping", service.ID())
service.Stop()
s.wg.Done()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is conceptually wrong. s.wg.Done() must be called at the end of go-routine that is part of sync.waitGroup otherwise it leads to unpredictable results and go-routines may keep running even after wg.Wait() has returned.

toStop := make([]LeaderService, 0, len(s.running))
for id, service := range s.running {
toStop = append(toStop, service)
delete(s.running, id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing from s.running shall be called when the service is really stopped but not when there is an intention to stop it.

@@ -122,7 +122,12 @@
// removeService removes a service from the registry of running services.
func (s *services) removeService(id string) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic of this function shall be only in one place - at the ent of go-routine in line 90. This function is not needed at all because it creates a dangerous situation with sync.WaitGroup counter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants