Skip to content

Improve robustness of service termination#503

Merged
dotsdl merged 22 commits into
mainfrom
compute-int-sleep
Jun 17, 2026
Merged

Improve robustness of service termination#503
dotsdl merged 22 commits into
mainfrom
compute-int-sleep

Conversation

@dotsdl

@dotsdl dotsdl commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

Makes the long-running services — SynchronousComputeService, ComputeManager, and StrategistService — shut down promptly and cleanly when asked to stop, instead of ignoring stop() (and termination signals) until the current sleep interval elapses. Along the way it fixes a latent dead mechanism, removes dead code, and consolidates the sleep primitive into one shared home.

Motivation

All three services run a while-loop that sleeps between work cycles. They relied on a plain, uninterruptible time.sleep(), so a stop() (e.g. from a SIGTERM/SIGINT handler) was not noticed until the sleep finished on its own — up to deep_sleep_interval (300s) for the compute service, sleep_interval (default 1800s) for the manager. For services that hold a registration in the state store, a slow or skipped shutdown means an orphaned ComputeManagerRegistration/ComputeServiceRegistration.

Changes

Interruptible sleep across all three services

  • ComputeManager now sleeps via the InterruptableSleep functor, woken by stop().
  • SynchronousComputeService constructed an InterruptableSleep and interrupted it in stop(), but every actual sleep used plain time.sleep() — the functor was never called and the except SleepInterrupted handler was dead code. Its cycle sleeps now go through it.
  • StrategistService sleeps between cycles via the functor too. No KeyboardInterrupt is raised for it: its cooperative _stop checks and ProcessPoolExecutor teardown are preserved, since raising asynchronously into the pool-managing thread risks orphaning child processes.
  • Each start() clears the sleep event so a stopped service can be cleanly restarted.

Shared primitive

  • Moved InterruptableSleep/SleepInterrupted into a new dependency-free alchemiscale.sleep module (a third consumer in a different subpackage made the previous home in compute.service an awkward coupling). Re-exported from alchemiscale.compute.service for backwards compatibility.

Signal-handling helper

  • Added alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wires SIGHUP/SIGINT/SIGTERM to a service's stop(). Signal disposition is process-global and main-thread-only, so it stays at the entry point — the helper just removes the per-CLI boilerplate (and the risk of forgetting it). The compute service synchronous CLI command now uses it.

Cleanup

  • Removed the unused sched.scheduler instances from SynchronousComputeService/AsynchronousComputeService and the now-unused import sched; corrected the InterruptableSleep docstring, which described a scheduler integration that was never wired up.

Tests

  • Added tests asserting stop() promptly wakes a sleeping manager, compute service, and strategist, plus unit tests for install_stop_handlers.

Companion PRs (separate repos)

The autoscaling manager CLIs in alchemiscale-hpc (SLURM) and alchemiscale-k8s install no signal handlers today, so SIGTERM — the signal systemd/docker stop/scancel/Kubernetes actually send — bypasses the finally block and orphans the manager registration. Those CLIs will be wired to install_stop_handlers in follow-up PRs to their respective repos (they consume the helper added here).

🤖 Generated with Claude Code

dotsdl and others added 2 commits May 27, 2026 19:09
ComputeManager used a plain time.sleep() between cycles, so stop() (and SIGINT) did not take effect until the full sleep_interval elapsed. It now sleeps via the InterruptableSleep functor, woken by stop().

SynchronousComputeService already constructed an InterruptableSleep and interrupted it in stop(), but never slept through it: every sleep used plain time.sleep(), leaving the mechanism (and the SleepInterrupted handler in start()) as dead code. Its cycle sleeps now go through int_sleep, and start() clears the event so a stopped service can be cleanly restarted.

Also removed the unused sched.scheduler instances and corrected the InterruptableSleep docstring, which described a scheduler integration that was never wired up.

Adds tests asserting stop() promptly wakes a sleeping manager/service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.06349% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.54%. Comparing base (b146148) to head (30e82d3).

Files with missing lines Patch % Lines
alchemiscale/cli.py 0.00% 5 Missing ⚠️
alchemiscale/compute/service.py 91.22% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
+ Coverage   79.63%   80.54%   +0.90%     
==========================================
  Files          29       31       +2     
  Lines        4833     4877      +44     
==========================================
+ Hits         3849     3928      +79     
+ Misses        984      949      -35     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dotsdl and others added 2 commits May 27, 2026 19:44
…ep interruptible

Moves InterruptableSleep/SleepInterrupted out of compute.service into a new dependency-free alchemiscale.sleep module, now that a third consumer (the strategist) needs them from a different subpackage. They are re-exported from compute.service for backwards compatibility, and compute.manager imports from the canonical location.

StrategistService now sleeps between cycles via the InterruptableSleep functor (interrupted by stop()), so termination takes effect promptly instead of waiting out the sleep interval. No KeyboardInterrupt is raised: the cooperative _stop checks and ProcessPoolExecutor teardown are preserved, since raising into the pool-managing thread risks orphaning child processes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds alchemiscale.compute.signals.install_stop_handlers, a small reusable helper that wires SIGHUP/SIGINT/SIGTERM to a service's stop(). Signal disposition is process-global and main-thread-only, so it belongs at the entry point; the helper keeps each CLI from re-implementing (and forgetting) the wiring. The compute service 'synchronous' CLI command now uses it.

Also updates the news fragment to cover the broader service-termination work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dotsdl dotsdl changed the title Make ComputeManager and compute service sleeps interruptible Improve robustness of service termination May 28, 2026
dotsdl and others added 2 commits May 28, 2026 17:43
Marks the three new tests that spawn services in background threads
with `@pytest.mark.skip` to isolate whether they are the cause of the
6h hang at `test_validate_network_nonself` on Python 3.11/3.13. If CI
goes green on 3.11/3.13 with this commit, the trigger lives in one of
these tests (or the threads they leave behind on failure). To be
reverted once root cause is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of bisect: previous push (skipping all three new threaded tests)
went green on 3.11/3.13 in ~16m, confirming the hang is triggered by
one of those three tests. Re-enabling the compute-manager test first
because it spawns a non-daemon thread and exercises the most stateful
service (Neo4j registration, polling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dotsdl and others added 4 commits May 28, 2026 18:42
Bisect step 2 (re-enabling test_manager_interruptible_sleep) reproduced
the 3.11/3.13 hang while 3.12 passed cleanly. The test itself runs to
completion on gw0 in ~4.5s, but gw0 stops making progress immediately
after — strongly suggesting an issue with what the test leaves behind,
not what it asserts. See log analysis in PR thread.

Re-skipping so CI is green and the remaining bisect work (strategist
and compute-service threaded tests) can proceed without re-burning 6h.
With n4js_preloaded, the manager's first cycle saw num_tasks > 0 and
called create_compute_services -> multiprocessing.Process(...). Because
integration tests force fork as the global start method (see
running_service in alchemiscale/tests/integration/utils.py) and this
test runs manager.start() in a background thread, that Process forks a
multi-threaded Python worker. Locks held by other threads at the time
of fork are inherited as held-with-no-owner in the child, which then
deadlocks on its first acquire (typically the logging lock). The
deadlocked child hung the xdist worker on Python 3.11 and 3.13; 3.12
happened to miss the trip.

n4js_fresh has no preloaded tasks, so cycle() reports num_tasks == 0
and no Process is spawned. The test still exercises the actual code
path it cares about (start() entering its interruptible sleep, stop()
waking it). Also drops the @pytest.mark.skip added during bisection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ices

Previous attempt swapped n4js_preloaded for n4js_fresh to avoid the
fork-from-thread deadlock, but n4js_fresh leaves Neo4j empty including
the compute identities (n4js_preloaded writes those via
create_credentialed_entity), so the manager's registration POST hit
/token -> 500 and start() never reached its sleep.

Keep n4js_preloaded (credentials are in place), but replace
manager.create_compute_services with a no-op so cycle() does not fork
a Process. The test still exercises register / get_instruction /
update_status / int_sleep / stop / deregister, which is what it's
about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le_sleep

Both were temporarily skipped during the PR #503 bisection. They are
predicted innocent under the fork-from-thread hypothesis:

- test_start_interruptible_sleep (SynchronousComputeService): the
  service runs tasks in-thread, never forks a multiprocessing.Process
  during its cycle. The test thread is also daemon=True.

- test_service_interruptible_sleep (StrategistService): when the
  strategist does spawn subprocesses, it uses mp.get_context("spawn")
  (see strategist/service.py:589), which avoids the fork-from-thread
  pitfall entirely.

Drop the skip markers and let CI confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously heartbeat() slept with plain time.sleep(self.heartbeat_interval),
which stop() could not wake. The daemon heartbeat thread therefore stayed
alive — and the worker stayed multi-threaded — for up to heartbeat_interval
seconds after the main loop had already torn down.

In production this means SIGTERM on the CLI service has a long tail where
fork-from-a-multi-threaded-process is possible (any later multiprocessing
call inherits other threads' locks as held-with-no-owner -> child deadlock).

In tests this is exactly what tripped up test_compute_api /
test_compute_synchronous on 3.13 after test_start_interruptible_sleep ran
on the same xdist worker: those tests fork a server via running_service,
which deadlocked because the worker still had the heartbeat thread alive.
Python flagged it explicitly:

  DeprecationWarning: This process (pid=...) is multi-threaded, use of
  fork() may lead to deadlocks in the child.

Switch heartbeat to the same int_sleep / SleepInterrupted dance the main
loop uses. stop() now wakes both threads in one call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dotsdl dotsdl left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this! Ready for review @ianmkenney, @atravitz, or @mikemhenry!

dotsdl and others added 2 commits June 1, 2026 23:21
- install_stop_handlers: add raise_keyboard_interrupt kwarg (default True)
  so the strategist CLI can opt out. The strategist's stop() triggers a
  ProcessPoolExecutor shutdown that must not be interrupted partway
  through. Strategist CLI now uses install_stop_handlers instead of an
  inline signal.signal loop, which removes the last in-tree use of
  the signal module from cli.py. New unit test covers the
  raise_keyboard_interrupt=False path.

- compute/service.py: document the fact that self.int_sleep is shared
  between the main loop and the heartbeat thread, with the gotcha that
  any future split must interrupt both from stop() --- otherwise the
  heartbeat thread keeps the worker multi-threaded until its sleep
  naturally expires (the bug fixed in be8fc49).

- Rename test_start_interruptible_sleep -> test_service_interruptible_sleep
  for naming parity with test_manager_interruptible_sleep and the
  strategist's test_service_interruptible_sleep.

- Replace direct attribute assignment in the two test monkeypatches
  with monkeypatch.setattr, the idiomatic form in this codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dotsdl dotsdl self-assigned this Jun 2, 2026
dotsdl and others added 4 commits June 3, 2026 18:20
PR #502 ("Move autoscaling sizing logic into ComputeManager base", merged
to main as 7dfb3de) changed ComputeManager.create_compute_services from
``(self, data)`` to ``(self, data, target)``. Our monkeypatch was still
the one-arg lambda; once the post-#502 ``cycle()`` calls it with two
positional arguments, TypeError fires, ``start()``'s ``except Exception``
catches it and ERROR-deregisters, and the loop exits before logging
"Sleeping for ...". The test's 30s deadline then trips with "manager
never reached its sleep".

Switch the no-op to ``lambda data, target: 0`` so the call signature
matches and cycle proceeds normally into the interruptible sleep that
the test is actually exercising.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mikemhenry mikemhenry left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions!

Also, I think a test (even a smoke test) for the CLI would help with coverage and also make sure the changes to the CLI are doing what we expect them to.

# stop conditions will use these
self._tasks_counter = 0
self._start_time = time.time()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an exception before this point, self._register() will not get the deregistered with self._deregister(). I think for this class and ComputeManager.start() you should do all the registering also within the try block so we can be sure nothing gets left behind

@mikemhenry

Copy link
Copy Markdown
Collaborator

Okay pushed two tests and a quick fix that should work, my plan is to make a context manager that should make this easier

@dotsdl dotsdl removed request for atravitz and ianmkenney June 10, 2026 23:27
@mikemhenry mikemhenry self-requested a review June 11, 2026 21:48
@mikemhenry mikemhenry dismissed their stale review June 15, 2026 16:41

mike made the changes himself

dotsdl and others added 3 commits June 15, 2026 18:20
The pre-refactor SynchronousComputeService.start() papered over a fragility
with an inline check at the top of each main-loop iteration: if the heartbeat
thread had died, it was resurrected with a fresh Thread before the next cycle.
Mike's _running() refactor (30747fd) drops that check, which is correct
structurally but exposes the underlying problem: client.heartbeat() can raise
after exhausting its retry policy (sustained outage, auth-token expiry,
persistent 5xx), and an uncaught exception in heartbeat() would silently kill
the daemon thread while the main loop kept happily claiming and executing
tasks.

The cleanest fix is at the source rather than restoring the resurrection:
catch per-beat failures inside heartbeat(), log them, and try again on the
next interval. The thread now only exits on stop() (via SleepInterrupted) or
process teardown. No resurrection check needed in start().

Regression test monkeypatches beat() to raise once, runs heartbeat() in a
thread for a few intervals, and asserts (a) the thread is still alive,
(b) beat was called more than once, (c) the failure was logged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dotsdl dotsdl left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, thank you @mikemhenry! Merging once CI passes one last time!

@dotsdl dotsdl merged commit 08a386c into main Jun 17, 2026
7 checks passed
@github-project-automation github-project-automation Bot moved this from Cycle - In Review to Done in alchemiscale : advancement cycles Jun 17, 2026
@dotsdl dotsdl deleted the compute-int-sleep branch June 17, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants