Improve robustness of service termination by dotsdl · Pull Request #503 · OpenFreeEnergy/alchemiscale

dotsdl · 2026-05-28T01:10:06Z

Summary

Makes the long-running services — SynchronousComputeService, ComputeManager, and StrategistService — shut down promptly and cleanly when asked to stop, instead of ignoring stop() (and termination signals) until the current sleep interval elapses. Along the way it fixes a latent dead mechanism, removes dead code, and consolidates the sleep primitive into one shared home.

Motivation

All three services run a while-loop that sleeps between work cycles. They relied on a plain, uninterruptible time.sleep(), so a stop() (e.g. from a SIGTERM/SIGINT handler) was not noticed until the sleep finished on its own — up to deep_sleep_interval (300s) for the compute service, sleep_interval (default 1800s) for the manager. For services that hold a registration in the state store, a slow or skipped shutdown means an orphaned ComputeManagerRegistration/ComputeServiceRegistration.

Changes

Interruptible sleep across all three services

ComputeManager now sleeps via the InterruptableSleep functor, woken by stop().
SynchronousComputeService constructed an InterruptableSleep and interrupted it in stop(), but every actual sleep used plain time.sleep() — the functor was never called and the except SleepInterrupted handler was dead code. Its cycle sleeps now go through it.
StrategistService sleeps between cycles via the functor too. No KeyboardInterrupt is raised for it: its cooperative _stop checks and ProcessPoolExecutor teardown are preserved, since raising asynchronously into the pool-managing thread risks orphaning child processes.
Each start() clears the sleep event so a stopped service can be cleanly restarted.

Shared primitive

Moved InterruptableSleep/SleepInterrupted into a new dependency-free alchemiscale.sleep module (a third consumer in a different subpackage made the previous home in compute.service an awkward coupling). Re-exported from alchemiscale.compute.service for backwards compatibility.

Signal-handling helper

Added alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wires SIGHUP/SIGINT/SIGTERM to a service's stop(). Signal disposition is process-global and main-thread-only, so it stays at the entry point — the helper just removes the per-CLI boilerplate (and the risk of forgetting it). The compute service synchronous CLI command now uses it.

Cleanup

Removed the unused sched.scheduler instances from SynchronousComputeService/AsynchronousComputeService and the now-unused import sched; corrected the InterruptableSleep docstring, which described a scheduler integration that was never wired up.

Tests

Added tests asserting stop() promptly wakes a sleeping manager, compute service, and strategist, plus unit tests for install_stop_handlers.

Companion PRs (separate repos)

The autoscaling manager CLIs in alchemiscale-hpc (SLURM) and alchemiscale-k8s install no signal handlers today, so SIGTERM — the signal systemd/docker stop/scancel/Kubernetes actually send — bypasses the finally block and orphans the manager registration. Those CLIs will be wired to install_stop_handlers in follow-up PRs to their respective repos (they consume the helper added here).

🤖 Generated with Claude Code

ComputeManager used a plain time.sleep() between cycles, so stop() (and SIGINT) did not take effect until the full sleep_interval elapsed. It now sleeps via the InterruptableSleep functor, woken by stop(). SynchronousComputeService already constructed an InterruptableSleep and interrupted it in stop(), but never slept through it: every sleep used plain time.sleep(), leaving the mechanism (and the SleepInterrupted handler in start()) as dead code. Its cycle sleeps now go through int_sleep, and start() clears the event so a stopped service can be cleanly restarted. Also removed the unused sched.scheduler instances and corrected the InterruptableSleep docstring, which described a scheduler integration that was never wired up. Adds tests asserting stop() promptly wakes a sleeping manager/service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-28T01:27:27Z

Codecov Report

❌ Patch coverage is 92.06349% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.54%. Comparing base (b146148) to head (30e82d3).

Files with missing lines	Patch %	Lines
alchemiscale/cli.py	0.00%	5 Missing ⚠️
alchemiscale/compute/service.py	91.22%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
+ Coverage   79.63%   80.54%   +0.90%     
==========================================
  Files          29       31       +2     
  Lines        4833     4877      +44     
==========================================
+ Hits         3849     3928      +79     
+ Misses        984      949      -35

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ep interruptible Moves InterruptableSleep/SleepInterrupted out of compute.service into a new dependency-free alchemiscale.sleep module, now that a third consumer (the strategist) needs them from a different subpackage. They are re-exported from compute.service for backwards compatibility, and compute.manager imports from the canonical location. StrategistService now sleeps between cycles via the InterruptableSleep functor (interrupted by stop()), so termination takes effect promptly instead of waiting out the sleep interval. No KeyboardInterrupt is raised: the cooperative _stop checks and ProcessPoolExecutor teardown are preserved, since raising into the pool-managing thread risks orphaning child processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wires SIGHUP/SIGINT/SIGTERM to a service's stop(). Signal disposition is process-global and main-thread-only, so it belongs at the entry point; the helper keeps each CLI from re-implementing (and forgetting) the wiring. The compute service 'synchronous' CLI command now uses it. Also updates the news fragment to cover the broader service-termination work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Marks the three new tests that spawn services in background threads with `@pytest.mark.skip` to isolate whether they are the cause of the 6h hang at `test_validate_network_nonself` on Python 3.11/3.13. If CI goes green on 3.11/3.13 with this commit, the trigger lives in one of these tests (or the threads they leave behind on failure). To be reverted once root cause is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 2 of bisect: previous push (skipping all three new threaded tests) went green on 3.11/3.13 in ~16m, confirming the hang is triggered by one of those three tests. Re-enabling the compute-manager test first because it spawns a non-daemon thread and exercises the most stateful service (Neo4j registration, polling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bisect step 2 (re-enabling test_manager_interruptible_sleep) reproduced the 3.11/3.13 hang while 3.12 passed cleanly. The test itself runs to completion on gw0 in ~4.5s, but gw0 stops making progress immediately after — strongly suggesting an issue with what the test leaves behind, not what it asserts. See log analysis in PR thread. Re-skipping so CI is green and the remaining bisect work (strategist and compute-service threaded tests) can proceed without re-burning 6h.

With n4js_preloaded, the manager's first cycle saw num_tasks > 0 and called create_compute_services -> multiprocessing.Process(...). Because integration tests force fork as the global start method (see running_service in alchemiscale/tests/integration/utils.py) and this test runs manager.start() in a background thread, that Process forks a multi-threaded Python worker. Locks held by other threads at the time of fork are inherited as held-with-no-owner in the child, which then deadlocks on its first acquire (typically the logging lock). The deadlocked child hung the xdist worker on Python 3.11 and 3.13; 3.12 happened to miss the trip. n4js_fresh has no preloaded tasks, so cycle() reports num_tasks == 0 and no Process is spawned. The test still exercises the actual code path it cares about (start() entering its interruptible sleep, stop() waking it). Also drops the @pytest.mark.skip added during bisection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ices Previous attempt swapped n4js_preloaded for n4js_fresh to avoid the fork-from-thread deadlock, but n4js_fresh leaves Neo4j empty including the compute identities (n4js_preloaded writes those via create_credentialed_entity), so the manager's registration POST hit /token -> 500 and start() never reached its sleep. Keep n4js_preloaded (credentials are in place), but replace manager.create_compute_services with a no-op so cycle() does not fork a Process. The test still exercises register / get_instruction / update_status / int_sleep / stop / deregister, which is what it's about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le_sleep Both were temporarily skipped during the PR #503 bisection. They are predicted innocent under the fork-from-thread hypothesis: - test_start_interruptible_sleep (SynchronousComputeService): the service runs tasks in-thread, never forks a multiprocessing.Process during its cycle. The test thread is also daemon=True. - test_service_interruptible_sleep (StrategistService): when the strategist does spawn subprocesses, it uses mp.get_context("spawn") (see strategist/service.py:589), which avoids the fork-from-thread pitfall entirely. Drop the skip markers and let CI confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously heartbeat() slept with plain time.sleep(self.heartbeat_interval), which stop() could not wake. The daemon heartbeat thread therefore stayed alive — and the worker stayed multi-threaded — for up to heartbeat_interval seconds after the main loop had already torn down. In production this means SIGTERM on the CLI service has a long tail where fork-from-a-multi-threaded-process is possible (any later multiprocessing call inherits other threads' locks as held-with-no-owner -> child deadlock). In tests this is exactly what tripped up test_compute_api / test_compute_synchronous on 3.13 after test_start_interruptible_sleep ran on the same xdist worker: those tests fork a server via running_service, which deadlocked because the worker still had the heartbeat thread alive. Python flagged it explicitly: DeprecationWarning: This process (pid=...) is multi-threaded, use of fork() may lead to deadlocks in the child. Switch heartbeat to the same int_sleep / SleepInterrupted dance the main loop uses. stop() now wakes both threads in one call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dotsdl

I'm happy with this! Ready for review @ianmkenney, @atravitz, or @mikemhenry!

- install_stop_handlers: add raise_keyboard_interrupt kwarg (default True) so the strategist CLI can opt out. The strategist's stop() triggers a ProcessPoolExecutor shutdown that must not be interrupted partway through. Strategist CLI now uses install_stop_handlers instead of an inline signal.signal loop, which removes the last in-tree use of the signal module from cli.py. New unit test covers the raise_keyboard_interrupt=False path. - compute/service.py: document the fact that self.int_sleep is shared between the main loop and the heartbeat thread, with the gotcha that any future split must interrupt both from stop() --- otherwise the heartbeat thread keeps the worker multi-threaded until its sleep naturally expires (the bug fixed in be8fc49). - Rename test_start_interruptible_sleep -> test_service_interruptible_sleep for naming parity with test_manager_interruptible_sleep and the strategist's test_service_interruptible_sleep. - Replace direct attribute assignment in the two test monkeypatches with monkeypatch.setattr, the idiomatic form in this codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #502 ("Move autoscaling sizing logic into ComputeManager base", merged to main as 7dfb3de) changed ComputeManager.create_compute_services from ``(self, data)`` to ``(self, data, target)``. Our monkeypatch was still the one-arg lambda; once the post-#502 ``cycle()`` calls it with two positional arguments, TypeError fires, ``start()``'s ``except Exception`` catches it and ERROR-deregisters, and the loop exits before logging "Sleeping for ...". The test's 30s deadline then trips with "manager never reached its sleep". Switch the no-op to ``lambda data, target: 0`` so the call signature matches and cycle proceeds normally into the interruptible sleep that the test is actually exercising. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikemhenry

Just a few suggestions!

Also, I think a test (even a smoke test) for the CLI would help with coverage and also make sure the changes to the CLI are doing what we expect them to.

mikemhenry · 2026-06-10T19:26:04Z

        # stop conditions will use these
        self._tasks_counter = 0
        self._start_time = time.time()



If there is an exception before this point, self._register() will not get the deregistered with self._deregister(). I think for this class and ComputeManager.start() you should do all the registering also within the try block so we can be sure nothing gets left behind

mikemhenry · 2026-06-10T23:01:02Z

Okay pushed two tests and a quick fix that should work, my plan is to make a context manager that should make this easier

mike made the changes himself

The pre-refactor SynchronousComputeService.start() papered over a fragility with an inline check at the top of each main-loop iteration: if the heartbeat thread had died, it was resurrected with a fresh Thread before the next cycle. Mike's _running() refactor (30747fd) drops that check, which is correct structurally but exposes the underlying problem: client.heartbeat() can raise after exhausting its retry policy (sustained outage, auth-token expiry, persistent 5xx), and an uncaught exception in heartbeat() would silently kill the daemon thread while the main loop kept happily claiming and executing tasks. The cleanest fix is at the source rather than restoring the resurrection: catch per-beat failures inside heartbeat(), log them, and try again on the next interval. The thread now only exits on stop() (via SleepInterrupted) or process teardown. No resurrection check needed in start(). Regression test monkeypatches beat() to raise once, runs heartbeat() in a thread for a few intervals, and asserts (a) the thread is still alive, (b) beat was called more than once, (c) the failure was logged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dotsdl

I'm happy with this, thank you @mikemhenry! Merging once CI passes one last time!

dotsdl and others added 2 commits May 27, 2026 19:09

Add news fragment for #503

38c45de

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dotsdl and others added 2 commits May 27, 2026 19:44

dotsdl changed the title ~~Make ComputeManager and compute service sleeps interruptible~~ Improve robustness of service termination May 28, 2026

This was referenced May 28, 2026

Install stop-signal handlers in the SLURM manager CLI datryllic/alchemiscale-hpc#4

Open

Install stop-signal handlers in the K8s manager CLI datryllic/alchemiscale-k8s#27

Open

dotsdl and others added 2 commits May 28, 2026 17:43

dotsdl added this to the Release 0.8.0 - compute introspection and user experience milestone May 29, 2026

dotsdl and others added 4 commits May 28, 2026 18:42

dotsdl added this to alchemiscale : advancement cycles Jun 1, 2026

dotsdl moved this to Sprint - In Review in alchemiscale : advancement cycles Jun 1, 2026

dotsdl commented Jun 2, 2026

View reviewed changes

dotsdl requested review from atravitz, ianmkenney and mikemhenry June 2, 2026 04:54

dotsdl and others added 2 commits June 1, 2026 23:21

Merge branch 'main' into compute-int-sleep

a9769a0

dotsdl self-assigned this Jun 2, 2026

dotsdl and others added 4 commits June 3, 2026 18:20

Merge branch 'main' into compute-int-sleep

6db7bb5

Merge branch 'main' into compute-int-sleep

099f4c5

do min fix + add regression tests

2f29fd1

mikemhenry previously requested changes Jun 10, 2026

View reviewed changes

dotsdl removed request for atravitz and ianmkenney June 10, 2026 23:27

use a context manager instead

30747fd

mikemhenry self-requested a review June 11, 2026 21:48

run black

c8c55a7

dotsdl and others added 3 commits June 15, 2026 18:20

Merge branch 'main' into compute-int-sleep

f28c832

Consistency changes to manager for parity with compute service

30e82d3

dotsdl commented Jun 17, 2026

View reviewed changes

dotsdl merged commit 08a386c into main Jun 17, 2026
7 checks passed

github-project-automation Bot moved this from Cycle - In Review to Done in alchemiscale : advancement cycles Jun 17, 2026

dotsdl deleted the compute-int-sleep branch June 17, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of service termination#503

Improve robustness of service termination#503
dotsdl merged 22 commits into
mainfrom
compute-int-sleep

dotsdl commented May 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

dotsdl left a comment

Uh oh!

mikemhenry left a comment

Uh oh!

mikemhenry Jun 10, 2026

Uh oh!

mikemhenry commented Jun 10, 2026

Uh oh!

dotsdl left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dotsdl commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Companion PRs (separate repos)

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dotsdl left a comment

Choose a reason for hiding this comment

Uh oh!

mikemhenry left a comment

Choose a reason for hiding this comment

Uh oh!

mikemhenry Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

mikemhenry commented Jun 10, 2026

Uh oh!

dotsdl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dotsdl commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading