Guard the cancel and signal-handler job-status writes too (fast-follow on #1338) by mihow · Pull Request #1342 · RolnickLab/antenna

mihow · 2026-06-19T01:11:59Z

Summary

A focused fast-follow on #1338 (now merged). #1338 made the result handler's terminal status write safe under concurrency, but the job's status is written from several places. This PR routes the other lock-free terminal writers — cancel() and the two Celery task-signal handlers — through one shared guarded transition, so a stale writer can't resurrect a job another writer already finished.

The motivation is the same lost-update race as #1337: a plain save() of the whole Job row from an out-of-date snapshot can overwrite a status another worker just set. #1338 stopped the result handler; this stops cancel() and the signal handlers doing it from the other direction (e.g. a late task-success signal flipping a just-cancelled job back to SUCCESS).

On main (no longer stacked — #1338 is merged).

List of Changes

One guarded helper for the lock-free terminal writers. Added Job._guarded_status_update(to_status, from_statuses, *, set_finished=False): a statement-scope UPDATE ... WHERE status IN (from_statuses) that holds no row lock (so it doesn't reintroduce the contention fix(jobs): fixes for concurrent ML processing jobs #1261 removed) and advances the in-memory instance only when a row actually changes. Default from-set is JobState.finalizable_states().
cancel() no longer clobbers a finished job, and async cancel no longer SIGTERMs the bootstrap. CANCELING/REVOKED now go through the guarded helper, so cancelling an already-finished job leaves its terminal status intact. Additionally — folding the one useful change from Improve celery task dispatch and cancellation to prevent stuck jobs #1324 — an ASYNC_API cancel now revokes the local run_job without terminate: that task only queues images and has usually finished, and the remote ADC work is stopped by the NATS/Redis teardown, not by killing the bootstrap. Sync/internal jobs still terminate (that task is the work). Teardown (cleanup_async_job_if_needed) runs unconditionally.
The task-success and task-failure signal handlers can no longer resurrect a terminal job. The terminal SUCCESS write in update_job_status and the FAILURE write in update_job_failure go through the guarded helper; their existing pre-checks are unchanged. Minor behaviour change: a FAILURE set via the task-failure signal now also records finished_at, matching _fail_job and the result handler.
Tests. TestTerminalTransitionChokepoint covers the guard for each writer; TestCancelCompletionRace reproduces the real concurrent interleave between a cancel and a completing result batch in both directions (mirrors Stop a finished job from being pulled back to running by a slower worker #1338's TestConcurrentStatusRace). 107 pass. Also validated on a dev deployment: cancelling a job mid-flight leaves it REVOKED with no resurrection and full NATS/Redis teardown.

Detailed Description

There are six terminal-status writers (the earlier "five" count missed the reaper). This PR brings the three lock-free ones onto the guarded transition; the two lock-based ones are already safe; the reaper is intentionally left broader:

Writer	Discipline
`_update_job_progress` (result handler)	guarded conditional UPDATE (#1338)
`Job.cancel()`	guarded conditional UPDATE (this PR)
`update_job_status` (task_postrun)	guarded conditional UPDATE (this PR)
`update_job_failure` (task_failure)	guarded conditional UPDATE (this PR)
`_fail_job`	`select_for_update` + terminal/CANCELING precondition (already safe)
`check_stale_jobs` (reaper)	`select_for_update`; intentionally keeps a broader from-set so it can force a genuinely stuck CANCELING/UNKNOWN job terminal as last resort

So _guarded_status_update is the chokepoint for the lock-free writers — not a literal "single chokepoint" for all status writes; the two lock-based writers enforce the same no-resurrect invariant under their row lock. (An earlier docstring overclaimed this; corrected here.)

cancel()'s REVOKED transition includes CANCELING in its from-set, since cancel sets CANCELING itself and must complete that progression.

Supersedes the cancel() rewrite in #1324 (its skip-terminate idea is folded in here). #1324's other parts — a no-op CELERY_WORKER_POOL_OPTIMIZATION="fair" setting and a consumer_timeout-sensitive acks_late change — are tracked separately; see the note on #1324.

How to Test the Changes

pytest ami/jobs/tests/test_tasks.py::TestTerminalTransitionChokepoint
pytest ami/jobs/tests/test_tasks.py::TestCancelCompletionRace
Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (107 passed).

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Refs #1337. Supersedes the cancel rewrite in #1324.

netlify · 2026-06-19T01:12:05Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`a8bce8d`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a349762e518b20007c62536

netlify · 2026-06-19T01:12:06Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`f112e6d`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a35e681ec8dcb0008c98014

coderabbitai · 2026-06-19T01:12:07Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2db3698-6290-46d9-a1a0-25ab6d4713c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1337-terminal-transition-chokepoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

netlify · 2026-06-19T21:21:06Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`f112e6d`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a35e681d06aa400088ee4ba

…gnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…for #1337) (#1343) * feat(jobs): log premature-terminal and missing-state failures for diagnosis Observation-only follow-up to #1338/#1342. Now that terminal status transitions are irreversible, surface the two cases where a terminal verdict may have been wrong, instead of letting them disappear silently: 1. When work completes for a job the guard finds already terminal/CANCELING, log a warning. Often legitimate (cancel/reaper won the race) but, if frequent, the signal of a premature terminal verdict. 2. When a result is failed because the job's Redis state is missing, log the job age/status/dispatch first. A small age points to a not-yet-seeded or redelivered-run_job race rather than genuine cleanup. No behaviour change — both warnings sit on existing code paths. Lets us confirm the trigger before adding grace/idempotency logic (see PR body follow-up). Refs #1337, #1219, #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(jobs): downgrade missing-state log to info when the job is already terminal The missing-state diagnostic logged a WARNING saying 'Failing job' for every in-flight result that arrived after a job finished — but _fail_job no-ops on a terminal job, so after a cancel (which deletes the Redis state) this fired once per in-flight batch and misdescribed normal cleanup as a failure. Now: a terminal job logs at info ('ignoring in-flight result for already-terminal job'); only a NON-terminal job with missing state logs the warning, which is the case actually worth investigating. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): make the diagnostic log lines operator-readable and leaner The missing-state and completed-after-terminal logs read like insider notes — ticket numbers and race-theory in the runtime message. Move the rationale and the issue reference into code comments and make the log lines plain operational statements an operator can act on without chasing a ticket. Also drop the redundant dispatch_mode field and the extra status re-query. Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(jobs): address review on missing-state diagnostics - Treat CANCELING as terminal-like in the missing-state classification so a cancel-in-flight result logs the benign info line instead of the misleading 'still running / marking it failed' warning (matches _fail_job's no-op set). Caught by CodeRabbit and Copilot. - Rename the values() dict from 'row' to 'job_values' (per review). - Log the completed-after-terminal case via job.logger and include the stage and attempted terminal state, without an extra status re-query (per Copilot). Refs #1337. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…point Issue #1337 is a lost-update race on the job status column. PR #1338 fixed one writer — `_update_job_progress` — by splitting the terminal status write out of the progress-blob save and performing it as a guarded, statement-scope UPDATE that only fires from a pre-terminal status. The other four terminal writers still did an unguarded full-row `save()` and could clobber a terminal status from the opposite direction: a cancel could overwrite a just-committed SUCCESS with REVOKED, and a stale `task_postrun` SUCCESS or `task_failure` FAILURE could resurrect a job another writer had already revoked. This change adds a single `Job._guarded_status_update(to_status, from_statuses, *, set_finished=False)` helper that performs the guarded UPDATE (no row lock, so it does not reintroduce the contention #1261 removed) and advances the in-memory instance only when the transition actually fires. The remaining terminal writers are routed through it: - `Job.cancel()`: CANCELING and REVOKED are now guarded UPDATEs. The `task.revoke()` and `cleanup_async_job_if_needed()` calls still run regardless of whether the guard fired, since a job may already be terminal but still need its NATS/Redis resources released. - `update_job_status` (task_postrun): only the terminal SUCCESS path is guarded; non-terminal celery states still flow through the dual-use `update_status()` unchanged. - `update_job_failure` (task_failure): the terminal FAILURE write is guarded, keeping the existing in-flight-async deferral guard intact. `_update_job_progress` and `_fail_job` are left as-is: the former is already guarded by #1338, and the latter is already safe via `select_for_update` plus a status precondition. After a guarded transition, callers persist `progress.summary.status` into the JSONB with a narrow `save(update_fields=["progress", ...])` rather than a full save, matching #1338 and avoiding clobbering other columns. The save only happens when the guard fired, so an already-terminal job keeps both its status column and its summary.status. One intentional behavior change: `update_job_failure` now sets `finished_at` when it marks FAILURE (it previously left it unset), making a failed terminal job consistent with `_fail_job` and the result handler. Adds sequential regression tests (postrun/failure cannot resurrect a REVOKED job; cancel of an already-SUCCESS job no-ops on status but still cleans up) and two real-concurrency tests that interleave cancel against a completing result batch in both directions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e into cancel - The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not routed through _guarded_status_update. Correct the docstring's false 'single chokepoint' claim: this helper is the chokepoint for the lock-free writers; _fail_job and the reaper enforce the same no-resurrect invariant under select_for_update (the reaper deliberately keeps a broader from-set so it can still force a stuck CANCELING/UNKNOWN job terminal as last resort). - Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes the local run_job WITHOUT terminate. That task only queues images and has usually finished; the remote ADC work is stopped by the NATS/Redis teardown, not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate. Refs #1337. Supersedes the cancel rewrite in #1324. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mihow mentioned this pull request Jun 19, 2026

Log premature-terminal and missing-state job failures (observability for #1337) #1343

Merged

5 tasks

Base automatically changed from fix/1337-conditional-status-transition to main June 19, 2026 21:18

mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21

This was referenced Jun 20, 2026

Improve celery task dispatch and cancellation to prevent stuck jobs #1324

Open

[Draft] Don't overwrite logs & status in concurrent background tasks #1026

Closed

mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from aaa5365 to f112e6d Compare June 20, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342
mihow wants to merge 2 commits into
mainfrom
fix/1337-terminal-transition-chokepoint

mihow commented Jun 19, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 19, 2026

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Review skipped

Uh oh!

netlify Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihow commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Detailed Description

How to Test the Changes

Checklist

Uh oh!

netlify Bot commented Jun 19, 2026

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

netlify Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented Jun 19, 2026 •

edited

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

netlify Bot commented Jun 19, 2026 •

edited

Loading