PMM-15143 Fix resource leaks in managed#5484
Conversation
getReleaseNotesText returned on a non-200 status before the defer resp.Body.Close() was registered, leaking the response body and its TCP connection on every missing release note (a recurring path via the update-check loop and ListUpdates). Move the defer above the status check so the body is closed on all return paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StartChecks ran its asynchronous advisor checks on context.Background(), so the goroutine could not be cancelled on shutdown or loss of HA leadership and kept running (up to the per-check timeout) on a node that should no longer be doing leader work. Record the context passed to Run and use it for the StartChecks goroutine; it defaults to Background until Run starts, preserving the previous behavior before the service is run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StartAllServices added each service to the running map under the lock but called wg.Add(1) later, outside it, while StopAllServices and removeService decrement based on running-map membership. Two races followed: - removeService called wg.Done() unconditionally, so when a service's Start failed at the same time StopAllServices ran, both paths decremented the same Add(1), driving the counter negative and panicking. - A Stop/remove that observed a service in running between the unlock and the deferred Add(1) could Done() before Add(). Pair wg.Add(1) with running-map insertion under the lock, and only Done() in removeService when it actually removed the entry. Add a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The telemetry HA leader service was the only leader service coupled to the top-level shutdown WaitGroup: wg.Add(1) ran unconditionally at registration while wg.Done() was deferred inside the leader closure. On a follower node the closure never runs, so wg.Done() is never called and wg.Wait() blocks forever, hanging shutdown; on leadership re-acquisition the closure runs again and over-decrements the single Add. Drop the wg coupling so telemetry matches the other leader services (checks, scheduler, versionCache, cleaner). Its shutdown is already awaited transitively: haService.Run stops all leader services on ctx cancellation and waits for them, and haService.Run itself runs under the top-level wg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Run created s.rareTicker/standardTicker/frequentTicker without holding s.tm, while UpdateIntervals read and Reset them under s.tm - a data race on the ticker pointers. UpdateIntervals also dereferenced them with no nil check, so a settings change before Run created the tickers (e.g. on a node that is not the HA leader) panicked with a nil pointer. Create the tickers under s.tm in Run, and guard UpdateIntervals against nil tickers (Run reads the new intervals from the persisted settings when it starts). Add a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
StopDump dereferenced s.cancel with no nil check, so calling it before any dump started would panic. The method has no callers anywhere in the repo, so remove it (dead code) rather than guard it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TrimPITRArtifact passed artifact.MetadataList[0].RestoreTo to deleteArtifactPITRChunks right after MetadataRemoveFirstN, which clamps to the slice length and can leave MetadataList empty when firstN covers every record - indexing [0] then panics in the background goroutine. Pass a nil "until" when no metadata remains, which makes deleteArtifactPITRChunks remove all remaining chunks (its documented behavior). Add a regression subtest that trims all metadata. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| for id, service := range s.all { | ||
| if _, ok := s.running[id]; !ok { | ||
| s.running[id] = service | ||
| s.wg.Add(1) |
There was a problem hiding this comment.
This is the most important fix here )
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v3 #5484 +/- ##
==========================================
+ Coverage 43.46% 43.55% +0.09%
==========================================
Files 413 413
Lines 42928 42728 -200
==========================================
- Hits 18659 18611 -48
+ Misses 22393 22272 -121
+ Partials 1876 1845 -31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| runCtxM sync.Mutex | ||
| // runCtx is the service lifecycle context recorded by Run. It bounds | ||
| // asynchronous work started via StartChecks so it is cancelled on shutdown. | ||
| runCtx context.Context //nolint:containedctx |
There was a problem hiding this comment.
this is anti-pattern. https://pkg.go.dev/context
Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.
| } | ||
|
|
||
| s.runCtxM.Lock() | ||
| ctx := s.runCtx |
There was a problem hiding this comment.
no need to pass upper-level context here, much simpler to implement via on demand channel that is read in runChecksLoop func
| // Tickers are created by Run; if it has not started on this node (e.g. not | ||
| // the leader), there is nothing to reset - Run reads the new intervals from | ||
| // the persisted settings when it starts. | ||
| if s.rareTicker == nil || s.standardTicker == nil || s.frequentTicker == nil { |
There was a problem hiding this comment.
shouldn't the update request be sent to the current leader node only?
| for id, service := range s.all { | ||
| if _, ok := s.running[id]; !ok { | ||
| s.running[id] = service | ||
| s.wg.Add(1) | ||
| toStart = append(toStart, startItem{svc: service, id: id}) | ||
| } | ||
| } | ||
| s.rw.Unlock() | ||
|
|
||
| for _, service := range toStart { | ||
| s.wg.Add(1) | ||
| go func(svc LeaderService, svcID string) { | ||
| s.l.Infoln("Starting", svcID) | ||
| err := svc.Start(ctx) | ||
| if err != nil { | ||
| s.l.Errorln(err) | ||
| s.removeService(svcID) | ||
| } | ||
| }(service.svc, service.id) | ||
| } | ||
| } |
There was a problem hiding this comment.
- no need to create 2 loops, it is possible to spin up within the same loop
- with using wg.Go() would avoid the issue that is fixed here now
| // StopAllServices stops all running services. | ||
| func (s *services) StopAllServices() { | ||
| s.rw.Lock() | ||
| toStop := make([]LeaderService, 0, len(s.running)) |
There was a problem hiding this comment.
looks line no need to to create a copy, it is possible to stop within the same loop because service.Stop() just calls context.Cancel() - it is fast enough and doesn't block
| for _, service := range toStop { | ||
| s.l.Infoln("Stopping", service.ID()) | ||
| service.Stop() | ||
| s.wg.Done() |
There was a problem hiding this comment.
this is conceptually wrong. s.wg.Done() must be called at the end of go-routine that is part of sync.waitGroup otherwise it leads to unpredictable results and go-routines may keep running even after wg.Wait() has returned.
| toStop := make([]LeaderService, 0, len(s.running)) | ||
| for id, service := range s.running { | ||
| toStop = append(toStop, service) | ||
| delete(s.running, id) |
There was a problem hiding this comment.
removing from s.running shall be called when the service is really stopped but not when there is an intention to stop it.
| @@ -122,7 +122,12 @@ | |||
| // removeService removes a service from the registry of running services. | |||
| func (s *services) removeService(id string) { | |||
There was a problem hiding this comment.
the logic of this function shall be only in one place - at the ent of go-routine in line 90. This function is not needed at all because it creates a dangerous situation with sync.WaitGroup counter
Ticket number: PMM-15143
Feature build: SUBMODULES-4401