fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock#654
Open
kevinheneveld wants to merge 1 commit into
Conversation
…sient SQLite lock DownloadMonitorService's poll loop caught only OperationCanceledException, so any other exception from a cycle propagated out of ExecuteAsync. A BackgroundService that throws trips the .NET host's default BackgroundServiceExceptionBehavior.StopHost, so the host shut down gracefully (exit 0) and the container's unless-stopped policy restarted it. In practice the cycle threw a Microsoft.Data.Sqlite "database is locked" SqliteException while persisting a download (EfDownloadRepository.UpdateAsync/ GetByIdAsync) under write contention, restarting the whole app every poll cycle (~18 restarts in 9h live). Each restart re-ran startup scans and reset the stall-reaper's in-memory 60-minute timers so it never reaped, keeping CPU pegged. The sibling background services (QueueMonitorService, MovedDownloadProcessor, AutomaticSearchService) already log-and-continue on a failed cycle. The poll loop is extracted to RunMonitorLoopAsync (for isolation + a fast, parallel-safe unit test) and now swallows-and-continues on any non-cancellation, non-fatal exception instead of letting it stop the host. The underlying lock contention (no busy_timeout on the SQLite connection) is a separate hardening follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DownloadMonitorService's poll loop caught onlyOperationCanceledException, so any other exception thrown during a cycle escapedExecuteAsyncand tripped the .NET host's defaultBackgroundServiceExceptionBehavior.StopHost— shutting the whole app down. Under a container restart policy that manifests as a crash-loop. This makes the monitor loop log-and-continue on a failed cycle, matching every sibling background service.Why this matters
A transient
Microsoft.Data.Sqlite.SqliteException("database is locked") is a normal, recoverable event under concurrent writes — it's raised while persisting a download (e.g.EfDownloadRepository.UpdateAsync/GetByIdAsync). With the old loop, a single such exception didn't just fail one cycle; it took the entire host down. On an instance running a container restart policy this became a crash-loop, restarting roughly every poll cycle (observed ~18 restarts in 9 hours), and each restart re-ran startup work — so a recoverable lock turned into sustained downtime and wasted CPU.DownloadMonitorServicewas the lone background service with this gap:QueueMonitorService,MovedDownloadProcessor, andAutomaticSearchServicealready log-and-continue.Implementation notes
internal async Task RunMonitorLoopAsync(CancellationToken)so the resilience behavior is isolated and unit-testable (the publicExecuteAsynckeeps its 5s polling interval, which a test can't wait on).OperationCanceledExceptionwith cancellation requested → break (graceful shutdown, unchanged).OperationCanceledExceptionwithout cancellation → a transient timeout within a cycle; logged and the loop continues.OutOfMemoryException/StackOverflowExceptionare deliberately not swallowed.DownloadMonitorService.cs) plus a new test. No API, schema, or behavioral change to what the monitor does on a successful cycle.What this PR does NOT include
busy_timeout, so concurrent writes can still raise transient locks — this PR stops a single transient lock from being fatal, but reducing how often it occurs is a separate hardening change.Test plan
cd tests && dotnet test— 683 / 683 passing (1 new regression test:RunMonitorLoopAsync_CycleThrows_KeepsLoopingInsteadOfStoppingHostdrives the extracted loop, forces a cycle to throw, and asserts the loop task is still running rather than completed).