Improve robustness of service termination#503
Conversation
ComputeManager used a plain time.sleep() between cycles, so stop() (and SIGINT) did not take effect until the full sleep_interval elapsed. It now sleeps via the InterruptableSleep functor, woken by stop(). SynchronousComputeService already constructed an InterruptableSleep and interrupted it in stop(), but never slept through it: every sleep used plain time.sleep(), leaving the mechanism (and the SleepInterrupted handler in start()) as dead code. Its cycle sleeps now go through int_sleep, and start() clears the event so a stopped service can be cleanly restarted. Also removed the unused sched.scheduler instances and corrected the InterruptableSleep docstring, which described a scheduler integration that was never wired up. Adds tests asserting stop() promptly wakes a sleeping manager/service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #503 +/- ##
==========================================
+ Coverage 79.63% 80.54% +0.90%
==========================================
Files 29 31 +2
Lines 4833 4877 +44
==========================================
+ Hits 3849 3928 +79
+ Misses 984 949 -35 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ep interruptible Moves InterruptableSleep/SleepInterrupted out of compute.service into a new dependency-free alchemiscale.sleep module, now that a third consumer (the strategist) needs them from a different subpackage. They are re-exported from compute.service for backwards compatibility, and compute.manager imports from the canonical location. StrategistService now sleeps between cycles via the InterruptableSleep functor (interrupted by stop()), so termination takes effect promptly instead of waiting out the sleep interval. No KeyboardInterrupt is raised: the cooperative _stop checks and ProcessPoolExecutor teardown are preserved, since raising into the pool-managing thread risks orphaning child processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wires SIGHUP/SIGINT/SIGTERM to a service's stop(). Signal disposition is process-global and main-thread-only, so it belongs at the entry point; the helper keeps each CLI from re-implementing (and forgetting) the wiring. The compute service 'synchronous' CLI command now uses it. Also updates the news fragment to cover the broader service-termination work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Marks the three new tests that spawn services in background threads with `@pytest.mark.skip` to isolate whether they are the cause of the 6h hang at `test_validate_network_nonself` on Python 3.11/3.13. If CI goes green on 3.11/3.13 with this commit, the trigger lives in one of these tests (or the threads they leave behind on failure). To be reverted once root cause is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of bisect: previous push (skipping all three new threaded tests) went green on 3.11/3.13 in ~16m, confirming the hang is triggered by one of those three tests. Re-enabling the compute-manager test first because it spawns a non-daemon thread and exercises the most stateful service (Neo4j registration, polling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bisect step 2 (re-enabling test_manager_interruptible_sleep) reproduced the 3.11/3.13 hang while 3.12 passed cleanly. The test itself runs to completion on gw0 in ~4.5s, but gw0 stops making progress immediately after — strongly suggesting an issue with what the test leaves behind, not what it asserts. See log analysis in PR thread. Re-skipping so CI is green and the remaining bisect work (strategist and compute-service threaded tests) can proceed without re-burning 6h.
With n4js_preloaded, the manager's first cycle saw num_tasks > 0 and called create_compute_services -> multiprocessing.Process(...). Because integration tests force fork as the global start method (see running_service in alchemiscale/tests/integration/utils.py) and this test runs manager.start() in a background thread, that Process forks a multi-threaded Python worker. Locks held by other threads at the time of fork are inherited as held-with-no-owner in the child, which then deadlocks on its first acquire (typically the logging lock). The deadlocked child hung the xdist worker on Python 3.11 and 3.13; 3.12 happened to miss the trip. n4js_fresh has no preloaded tasks, so cycle() reports num_tasks == 0 and no Process is spawned. The test still exercises the actual code path it cares about (start() entering its interruptible sleep, stop() waking it). Also drops the @pytest.mark.skip added during bisection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ices Previous attempt swapped n4js_preloaded for n4js_fresh to avoid the fork-from-thread deadlock, but n4js_fresh leaves Neo4j empty including the compute identities (n4js_preloaded writes those via create_credentialed_entity), so the manager's registration POST hit /token -> 500 and start() never reached its sleep. Keep n4js_preloaded (credentials are in place), but replace manager.create_compute_services with a no-op so cycle() does not fork a Process. The test still exercises register / get_instruction / update_status / int_sleep / stop / deregister, which is what it's about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le_sleep Both were temporarily skipped during the PR #503 bisection. They are predicted innocent under the fork-from-thread hypothesis: - test_start_interruptible_sleep (SynchronousComputeService): the service runs tasks in-thread, never forks a multiprocessing.Process during its cycle. The test thread is also daemon=True. - test_service_interruptible_sleep (StrategistService): when the strategist does spawn subprocesses, it uses mp.get_context("spawn") (see strategist/service.py:589), which avoids the fork-from-thread pitfall entirely. Drop the skip markers and let CI confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously heartbeat() slept with plain time.sleep(self.heartbeat_interval), which stop() could not wake. The daemon heartbeat thread therefore stayed alive — and the worker stayed multi-threaded — for up to heartbeat_interval seconds after the main loop had already torn down. In production this means SIGTERM on the CLI service has a long tail where fork-from-a-multi-threaded-process is possible (any later multiprocessing call inherits other threads' locks as held-with-no-owner -> child deadlock). In tests this is exactly what tripped up test_compute_api / test_compute_synchronous on 3.13 after test_start_interruptible_sleep ran on the same xdist worker: those tests fork a server via running_service, which deadlocked because the worker still had the heartbeat thread alive. Python flagged it explicitly: DeprecationWarning: This process (pid=...) is multi-threaded, use of fork() may lead to deadlocks in the child. Switch heartbeat to the same int_sleep / SleepInterrupted dance the main loop uses. stop() now wakes both threads in one call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dotsdl
left a comment
There was a problem hiding this comment.
I'm happy with this! Ready for review @ianmkenney, @atravitz, or @mikemhenry!
- install_stop_handlers: add raise_keyboard_interrupt kwarg (default True) so the strategist CLI can opt out. The strategist's stop() triggers a ProcessPoolExecutor shutdown that must not be interrupted partway through. Strategist CLI now uses install_stop_handlers instead of an inline signal.signal loop, which removes the last in-tree use of the signal module from cli.py. New unit test covers the raise_keyboard_interrupt=False path. - compute/service.py: document the fact that self.int_sleep is shared between the main loop and the heartbeat thread, with the gotcha that any future split must interrupt both from stop() --- otherwise the heartbeat thread keeps the worker multi-threaded until its sleep naturally expires (the bug fixed in be8fc49). - Rename test_start_interruptible_sleep -> test_service_interruptible_sleep for naming parity with test_manager_interruptible_sleep and the strategist's test_service_interruptible_sleep. - Replace direct attribute assignment in the two test monkeypatches with monkeypatch.setattr, the idiomatic form in this codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #502 ("Move autoscaling sizing logic into ComputeManager base", merged to main as 7dfb3de) changed ComputeManager.create_compute_services from ``(self, data)`` to ``(self, data, target)``. Our monkeypatch was still the one-arg lambda; once the post-#502 ``cycle()`` calls it with two positional arguments, TypeError fires, ``start()``'s ``except Exception`` catches it and ERROR-deregisters, and the loop exits before logging "Sleeping for ...". The test's 30s deadline then trips with "manager never reached its sleep". Switch the no-op to ``lambda data, target: 0`` so the call signature matches and cycle proceeds normally into the interruptible sleep that the test is actually exercising. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikemhenry
left a comment
There was a problem hiding this comment.
Just a few suggestions!
Also, I think a test (even a smoke test) for the CLI would help with coverage and also make sure the changes to the CLI are doing what we expect them to.
| # stop conditions will use these | ||
| self._tasks_counter = 0 | ||
| self._start_time = time.time() | ||
|
|
There was a problem hiding this comment.
If there is an exception before this point, self._register() will not get the deregistered with self._deregister(). I think for this class and ComputeManager.start() you should do all the registering also within the try block so we can be sure nothing gets left behind
|
Okay pushed two tests and a quick fix that should work, my plan is to make a context manager that should make this easier |
The pre-refactor SynchronousComputeService.start() papered over a fragility with an inline check at the top of each main-loop iteration: if the heartbeat thread had died, it was resurrected with a fresh Thread before the next cycle. Mike's _running() refactor (30747fd) drops that check, which is correct structurally but exposes the underlying problem: client.heartbeat() can raise after exhausting its retry policy (sustained outage, auth-token expiry, persistent 5xx), and an uncaught exception in heartbeat() would silently kill the daemon thread while the main loop kept happily claiming and executing tasks. The cleanest fix is at the source rather than restoring the resurrection: catch per-beat failures inside heartbeat(), log them, and try again on the next interval. The thread now only exits on stop() (via SleepInterrupted) or process teardown. No resurrection check needed in start(). Regression test monkeypatches beat() to raise once, runs heartbeat() in a thread for a few intervals, and asserts (a) the thread is still alive, (b) beat was called more than once, (c) the failure was logged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dotsdl
left a comment
There was a problem hiding this comment.
I'm happy with this, thank you @mikemhenry! Merging once CI passes one last time!
Summary
Makes the long-running services —
SynchronousComputeService,ComputeManager, andStrategistService— shut down promptly and cleanly when asked to stop, instead of ignoringstop()(and termination signals) until the current sleep interval elapses. Along the way it fixes a latent dead mechanism, removes dead code, and consolidates the sleep primitive into one shared home.Motivation
All three services run a
while-loop that sleeps between work cycles. They relied on a plain, uninterruptibletime.sleep(), so astop()(e.g. from a SIGTERM/SIGINT handler) was not noticed until the sleep finished on its own — up todeep_sleep_interval(300s) for the compute service,sleep_interval(default 1800s) for the manager. For services that hold a registration in the state store, a slow or skipped shutdown means an orphanedComputeManagerRegistration/ComputeServiceRegistration.Changes
Interruptible sleep across all three services
ComputeManagernow sleeps via theInterruptableSleepfunctor, woken bystop().SynchronousComputeServiceconstructed anInterruptableSleepand interrupted it instop(), but every actual sleep used plaintime.sleep()— the functor was never called and theexcept SleepInterruptedhandler was dead code. Its cycle sleeps now go through it.StrategistServicesleeps between cycles via the functor too. NoKeyboardInterruptis raised for it: its cooperative_stopchecks andProcessPoolExecutorteardown are preserved, since raising asynchronously into the pool-managing thread risks orphaning child processes.start()clears the sleep event so a stopped service can be cleanly restarted.Shared primitive
InterruptableSleep/SleepInterruptedinto a new dependency-freealchemiscale.sleepmodule (a third consumer in a different subpackage made the previous home incompute.servicean awkward coupling). Re-exported fromalchemiscale.compute.servicefor backwards compatibility.Signal-handling helper
alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wiresSIGHUP/SIGINT/SIGTERMto a service'sstop(). Signal disposition is process-global and main-thread-only, so it stays at the entry point — the helper just removes the per-CLI boilerplate (and the risk of forgetting it). The compute servicesynchronousCLI command now uses it.Cleanup
sched.schedulerinstances fromSynchronousComputeService/AsynchronousComputeServiceand the now-unusedimport sched; corrected theInterruptableSleepdocstring, which described a scheduler integration that was never wired up.Tests
stop()promptly wakes a sleeping manager, compute service, and strategist, plus unit tests forinstall_stop_handlers.Companion PRs (separate repos)
The autoscaling manager CLIs in
alchemiscale-hpc(SLURM) andalchemiscale-k8sinstall no signal handlers today, so SIGTERM — the signalsystemd/docker stop/scancel/Kubernetes actually send — bypasses thefinallyblock and orphans the manager registration. Those CLIs will be wired toinstall_stop_handlersin follow-up PRs to their respective repos (they consume the helper added here).🤖 Generated with Claude Code