Skip to content

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342

Draft
mihow wants to merge 2 commits into
mainfrom
fix/1337-terminal-transition-chokepoint
Draft

Guard the cancel and signal-handler job-status writes too (fast-follow on #1338)#1342
mihow wants to merge 2 commits into
mainfrom
fix/1337-terminal-transition-chokepoint

Conversation

@mihow

@mihow mihow commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

A focused fast-follow on #1338 (now merged). #1338 made the result handler's terminal status write safe under concurrency, but the job's status is written from several places. This PR routes the other lock-free terminal writers — cancel() and the two Celery task-signal handlers — through one shared guarded transition, so a stale writer can't resurrect a job another writer already finished.

The motivation is the same lost-update race as #1337: a plain save() of the whole Job row from an out-of-date snapshot can overwrite a status another worker just set. #1338 stopped the result handler; this stops cancel() and the signal handlers doing it from the other direction (e.g. a late task-success signal flipping a just-cancelled job back to SUCCESS).

On main (no longer stacked — #1338 is merged).

List of Changes

  1. One guarded helper for the lock-free terminal writers. Added Job._guarded_status_update(to_status, from_statuses, *, set_finished=False): a statement-scope UPDATE ... WHERE status IN (from_statuses) that holds no row lock (so it doesn't reintroduce the contention fix(jobs): fixes for concurrent ML processing jobs #1261 removed) and advances the in-memory instance only when a row actually changes. Default from-set is JobState.finalizable_states().
  2. cancel() no longer clobbers a finished job, and async cancel no longer SIGTERMs the bootstrap. CANCELING/REVOKED now go through the guarded helper, so cancelling an already-finished job leaves its terminal status intact. Additionally — folding the one useful change from Improve celery task dispatch and cancellation to prevent stuck jobs #1324 — an ASYNC_API cancel now revokes the local run_job without terminate: that task only queues images and has usually finished, and the remote ADC work is stopped by the NATS/Redis teardown, not by killing the bootstrap. Sync/internal jobs still terminate (that task is the work). Teardown (cleanup_async_job_if_needed) runs unconditionally.
  3. The task-success and task-failure signal handlers can no longer resurrect a terminal job. The terminal SUCCESS write in update_job_status and the FAILURE write in update_job_failure go through the guarded helper; their existing pre-checks are unchanged. Minor behaviour change: a FAILURE set via the task-failure signal now also records finished_at, matching _fail_job and the result handler.
  4. Tests. TestTerminalTransitionChokepoint covers the guard for each writer; TestCancelCompletionRace reproduces the real concurrent interleave between a cancel and a completing result batch in both directions (mirrors Stop a finished job from being pulled back to running by a slower worker #1338's TestConcurrentStatusRace). 107 pass. Also validated on a dev deployment: cancelling a job mid-flight leaves it REVOKED with no resurrection and full NATS/Redis teardown.

Detailed Description

There are six terminal-status writers (the earlier "five" count missed the reaper). This PR brings the three lock-free ones onto the guarded transition; the two lock-based ones are already safe; the reaper is intentionally left broader:

Writer Discipline
_update_job_progress (result handler) guarded conditional UPDATE (#1338)
Job.cancel() guarded conditional UPDATE (this PR)
update_job_status (task_postrun) guarded conditional UPDATE (this PR)
update_job_failure (task_failure) guarded conditional UPDATE (this PR)
_fail_job select_for_update + terminal/CANCELING precondition (already safe)
check_stale_jobs (reaper) select_for_update; intentionally keeps a broader from-set so it can force a genuinely stuck CANCELING/UNKNOWN job terminal as last resort

So _guarded_status_update is the chokepoint for the lock-free writers — not a literal "single chokepoint" for all status writes; the two lock-based writers enforce the same no-resurrect invariant under their row lock. (An earlier docstring overclaimed this; corrected here.)

cancel()'s REVOKED transition includes CANCELING in its from-set, since cancel sets CANCELING itself and must complete that progression.

Supersedes the cancel() rewrite in #1324 (its skip-terminate idea is folded in here). #1324's other parts — a no-op CELERY_WORKER_POOL_OPTIMIZATION="fair" setting and a consumer_timeout-sensitive acks_late change — are tracked separately; see the note on #1324.

How to Test the Changes

  • pytest ami/jobs/tests/test_tasks.py::TestTerminalTransitionChokepoint
  • pytest ami/jobs/tests/test_tasks.py::TestCancelCompletionRace
  • Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (107 passed).

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Refs #1337. Supersedes the cancel rewrite in #1324.

@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit a8bce8d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a349762e518b20007c62536

@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit f112e6d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a35e681ec8dcb0008c98014

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2db3698-6290-46d9-a1a0-25ab6d4713c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1337-terminal-transition-chokepoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from fix/1337-conditional-status-transition to main June 19, 2026 21:18
@mihow mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit f112e6d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a35e681d06aa400088ee4ba

mihow added a commit that referenced this pull request Jun 19, 2026
…gnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mihow added a commit that referenced this pull request Jun 19, 2026
…for #1337) (#1343)

* feat(jobs): log premature-terminal and missing-state failures for diagnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(jobs): downgrade missing-state log to info when the job is already terminal

The missing-state diagnostic logged a WARNING saying 'Failing job' for every
in-flight result that arrived after a job finished — but _fail_job no-ops on a
terminal job, so after a cancel (which deletes the Redis state) this fired once
per in-flight batch and misdescribed normal cleanup as a failure. Now: a
terminal job logs at info ('ignoring in-flight result for already-terminal
job'); only a NON-terminal job with missing state logs the warning, which is the
case actually worth investigating.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): make the diagnostic log lines operator-readable and leaner

The missing-state and completed-after-terminal logs read like insider notes —
ticket numbers and race-theory in the runtime message. Move the rationale and
the issue reference into code comments and make the log lines plain operational
statements an operator can act on without chasing a ticket. Also drop the
redundant dispatch_mode field and the extra status re-query.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(jobs): address review on missing-state diagnostics

- Treat CANCELING as terminal-like in the missing-state classification so a
  cancel-in-flight result logs the benign info line instead of the misleading
  'still running / marking it failed' warning (matches _fail_job's no-op set).
  Caught by CodeRabbit and Copilot.
- Rename the values() dict from 'row' to 'job_values' (per review).
- Log the completed-after-terminal case via job.logger and include the stage and
  attempted terminal state, without an extra status re-query (per Copilot).

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…point

Issue #1337 is a lost-update race on the job status column. PR #1338 fixed
one writer — `_update_job_progress` — by splitting the terminal status write
out of the progress-blob save and performing it as a guarded, statement-scope
UPDATE that only fires from a pre-terminal status. The other four terminal
writers still did an unguarded full-row `save()` and could clobber a terminal
status from the opposite direction: a cancel could overwrite a just-committed
SUCCESS with REVOKED, and a stale `task_postrun` SUCCESS or `task_failure`
FAILURE could resurrect a job another writer had already revoked.

This change adds a single `Job._guarded_status_update(to_status, from_statuses,
*, set_finished=False)` helper that performs the guarded UPDATE (no row lock,
so it does not reintroduce the contention #1261 removed) and advances the
in-memory instance only when the transition actually fires. The remaining
terminal writers are routed through it:

- `Job.cancel()`: CANCELING and REVOKED are now guarded UPDATEs. The
  `task.revoke()` and `cleanup_async_job_if_needed()` calls still run
  regardless of whether the guard fired, since a job may already be terminal
  but still need its NATS/Redis resources released.
- `update_job_status` (task_postrun): only the terminal SUCCESS path is
  guarded; non-terminal celery states still flow through the dual-use
  `update_status()` unchanged.
- `update_job_failure` (task_failure): the terminal FAILURE write is guarded,
  keeping the existing in-flight-async deferral guard intact.

`_update_job_progress` and `_fail_job` are left as-is: the former is already
guarded by #1338, and the latter is already safe via `select_for_update` plus a
status precondition.

After a guarded transition, callers persist `progress.summary.status` into the
JSONB with a narrow `save(update_fields=["progress", ...])` rather than a full
save, matching #1338 and avoiding clobbering other columns. The save only
happens when the guard fired, so an already-terminal job keeps both its status
column and its summary.status.

One intentional behavior change: `update_job_failure` now sets `finished_at`
when it marks FAILURE (it previously left it unset), making a failed terminal
job consistent with `_fail_job` and the result handler.

Adds sequential regression tests (postrun/failure cannot resurrect a REVOKED
job; cancel of an already-SUCCESS job no-ops on status but still cleans up) and
two real-concurrency tests that interleave cancel against a completing result
batch in both directions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e into cancel

- The reaper (check_stale_jobs) is a 6th terminal-status writer, lock-based, not
  routed through _guarded_status_update. Correct the docstring's false 'single
  chokepoint' claim: this helper is the chokepoint for the lock-free writers;
  _fail_job and the reaper enforce the same no-resurrect invariant under
  select_for_update (the reaper deliberately keeps a broader from-set so it can
  still force a stuck CANCELING/UNKNOWN job terminal as last resort).
- Fold the one useful change from #1324: cancel() of an ASYNC_API job now revokes
  the local run_job WITHOUT terminate. That task only queues images and has
  usually finished; the remote ADC work is stopped by the NATS/Redis teardown,
  not by SIGTERM-ing the bootstrap. Sync/internal jobs still terminate.

Refs #1337. Supersedes the cancel rewrite in #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mihow mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from aaa5365 to f112e6d Compare June 20, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant