From 091e2fc7867ec31856793908db8ea098981de363 Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Wed, 29 Apr 2026 14:00:24 -0700 Subject: [PATCH 1/6] docs(jobs): plan reusable Job.is_actually_complete() backed by Redis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Background: production job 2521 ended REVOKED despite all 4510 NATS messages acked and Redis pending sets empty. Root cause is a race in _update_job_progress: it writes the entire Job.progress JSONB blob without a row lock (PR #1261 dropped select_for_update to break a contention bottleneck), and two concurrent workers can each read the same pre-state then commit last-writer-wins. The slower committer with processed=4509 clobbered the faster committer's processed=4510. Job.progress cannot be the source of truth for state-machine decisions. This plan introduces: - AsyncJobStateManager.summarize() — single Redis pipeline call returning a JobCompletionSnapshot with state_present, total, pending_per_stage, failed, is_complete. - Job.is_actually_complete() — Tier 1 Redis → Tier 2 DB detection-count fallback (logs WARNING) → Tier 3 progress.is_complete() for sync_api jobs. - Replaces progress.is_complete() reads at the four state-machine call sites (reaper async_api guard, _update_job_progress cleanup gate, task_postrun, task_failure). Job.progress stays as the cosmetic mirror for the UI. Out of scope (filed as follow-ups in the doc): the underlying race in _update_job_progress, persisting queued_image_ids on Job, fixing the zombie_streams 'CANCELLED' enum crash surfaced during this investigation. Co-Authored-By: Claude --- docs/claude/planning/job-complete-check.md | 305 +++++++++++++++++++++ 1 file changed, 305 insertions(+) create mode 100644 docs/claude/planning/job-complete-check.md diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md new file mode 100644 index 000000000..8c8418ddf --- /dev/null +++ b/docs/claude/planning/job-complete-check.md @@ -0,0 +1,305 @@ +# Plan: reusable `Job.is_actually_complete()` (fix/job-complete-check) + +## Why + +Production job 2521 (project 192, ~4510 images, async_api) finished with all +4510 NATS messages acked and Redis pending sets empty, but `Job.progress` +JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The +`jobs_health_check` reaper at the 10-min cutoff: + +1. `mark_lost_images_failed` correctly skipped the job (Redis pending was + empty, so `lost_ids` was empty, which is the documented skip path). +2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard at + `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which + returned False because of the clobbered `process` stage. +3. Reaper REVOKEd the job. Result: 5058 detections / 8129 classifications + already saved to DB, but the Job row reads REVOKED. + +PR #1244's reconciler is the wrong tool for this symptom — there is nothing +to reconcile in Redis. The bug is upstream: `_update_job_progress` writes +the entire `progress` JSONB blob without a row lock (see PR #1261, which +removed `select_for_update` to fix unrelated row-lock contention), and two +concurrent workers can each read the same pre-state then commit +last-writer-wins. The `max()` guard at `tasks.py:598` only protects +percentage regression *based on what the writer read* — it does not +serialize the readers, so a slower committer with a stale snapshot +clobbers a faster committer's blob. + +`Job.progress` cannot be the source of truth for state-machine decisions. +This PR makes that explicit: introduce a reliable completion check backed +by Redis (which uses atomic SREM/SADD and is not racy), with a DB-count +fallback for jobs whose Redis state has been cleaned up. `Job.progress` +stays as the cosmetic mirror for the UI. + +Related context: + +- Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md` +- Last-writer-wins admission: `ami/jobs/tasks.py:563-571` +- PR #1261 commit `50677444` removed `select_for_update` to fix contention. + +## Scope + +In scope: + +- New `JobCompletionSnapshot` dataclass + `AsyncJobStateManager.summarize()`. +- New `Job.is_actually_complete()` method with Tier 1 (Redis) → Tier 2 + (DB count) → Tier 3 (`progress.is_complete()` for sync_api) fallback. +- WARNING log whenever Tier 2 fires, so we can monitor frequency and + catch unexpected fallbacks. +- Replace `progress.is_complete()` reads at the four state-machine call + sites that currently consult it. +- Tests covering the 4-way matrix (dispatch_mode × Redis-present-or-not). + +Out of scope: + +- Race fix in `_update_job_progress` itself. That is a separate, larger + refactor (move counts to atomic Redis counters, or scope a row lock to + microseconds via raw `jsonb_set`). Filed as follow-up: see + `### Follow-ups` below. +- UI changes. `Job.progress` stays as-is. The new method is backend-only. +- Touching `mark_lost_images_failed`. The reconciler's behavior is + correct for its intended bug (NATS-lost messages); 2521 was a different + failure that this PR addresses upstream. + +## Design + +### Tier 1 — Redis oracle + +`AsyncJobStateManager.summarize() -> JobCompletionSnapshot`. Single +Redis pipeline round-trip: + +```python +pipe.scard(self._get_pending_key("process")) +pipe.scard(self._get_pending_key("results")) +pipe.scard(self._failed_key) +pipe.get(self._total_key) +``` + +```python +@dataclass +class JobCompletionSnapshot: + state_present: bool # total_raw is not None + total: int # 0 when state_present=False + pending_per_stage: dict[str, int] # {"process": N, "results": M} + failed: int + is_complete: bool # state_present AND all pending == 0 + + # Convenience for callers: + @property + def all_acked(self) -> bool: + return self.state_present and all(v == 0 for v in self.pending_per_stage.values()) +``` + +`is_complete` is True only when state_present AND every stage's pending +set is empty. A job whose Redis state is gone returns +`state_present=False`, regardless of stage state — caller decides what to +do via Tier 2 fallback. + +`RedisError` is swallowed inside `summarize()` and returns +`state_present=False`. We do NOT propagate Redis transients up to the +reaper — a Redis blip should not flip a job to REVOKED. + +### Tier 2 — DB Detection-row count + +When Redis state is gone, query DB to determine whether all queued images +have detections from the job's pipeline algorithms: + +```python +def _is_complete_via_db(self) -> bool: + """Slow but durable. Only runs when Redis state is missing + (post-cleanup, post-TTL, or post-restart). Logs a WARNING so + we know how often we fall through to this path. + """ + queued_ids = self._collected_image_ids() # see below + if not queued_ids: + return True # zero-images job is trivially complete + unprocessed = self.pipeline.filter_processed_images( + SourceImage.objects.filter(pk__in=queued_ids) + ).only("pk") + is_complete = not unprocessed.exists() + self.logger.warning( + "job %s: Redis state gone, fell back to DB detection-count check; " + "queued=%d, unprocessed=%d, is_complete=%s", + self.pk, len(queued_ids), unprocessed.count(), is_complete, + ) + return is_complete +``` + +`self._collected_image_ids()`: derive from `Job.source_image_collection`, +`Job.source_image_single`, or `Job.deployment` exactly the way +`MLJob.run`/`pipeline.collect_images` does. Or, if the cost of +re-resolving is too high, persist the queued id list at dispatch time. +The `summarize()` payload includes `total` from Redis but Tier 2 needs +the full id set, so we cannot rely on Redis alone here. + +For the first PR cut, derive on the fly via `pipeline.collect_images()` +to avoid a schema change. If profiling shows this is slow at 5k+ images, +follow-up adds a `Job.queued_image_ids` jsonb column or similar. + +### Tier 3 — sync_api fallback + +`progress.is_complete()` for non-async_api jobs. Unchanged. Sync mode +does not use Redis at all, and its writes are serialized by the single +Celery task driving each stage. + +### `Job.is_actually_complete()` + +```python +def is_actually_complete(self) -> bool: + """Source-of-truth completion check. Use everywhere instead of + self.progress.is_complete() for state-machine decisions. + self.progress.is_complete() remains valid for cosmetic/UI reads. + """ + if self.dispatch_mode == JobDispatchMode.ASYNC_API: + snap = AsyncJobStateManager(self.pk).summarize() + if snap.state_present: + return snap.is_complete + return self._is_complete_via_db() + return self.progress.is_complete() +``` + +Method on the Job model, not a free function — lets call sites read like +`if not job.is_actually_complete(): ...` matching the existing +`progress.is_complete()` pattern. + +### Call sites to update + +| File:line | Current call | Replacement | +|---|---|---| +| `ami/jobs/tasks.py:968` (reaper async_api guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | +| `ami/jobs/tasks.py:654` (`_update_job_progress` cleanup gate) | `if job.progress.is_complete():` | `if job.is_actually_complete():` | +| `ami/jobs/tasks.py:1331` (`task_postrun` SUCCESS guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | +| `ami/jobs/tasks.py:1361` (`task_failure` ASYNC_API guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | +| `ami/jobs/models.py:496` (zero-images path in `MLJob.run`) | `if job.progress.is_complete():` | leave as-is — runs before Redis is initialized | +| `ami/jobs/models.py:159` (`run_job` post-task log) | `not job.progress.is_complete()` | leave as-is — log gating only, not state-machine | + +The four state-machine sites (rows 1-4) are the ones that decide SUCCESS +vs REVOKED vs deferred-cleanup. They all currently misread the clobbered +`progress.is_complete()`. The two leave-as-is rows are pre-Redis +initialization or log-only — the replacement would be no-op or wrong. + +### Tests + +`ami/jobs/tests/test_async_job_state.py` (new file or extend existing): + +- `summarize()` against fresh init (state_present=True, all pending=total) +- `summarize()` mid-flight (some pending, some processed) +- `summarize()` complete (state_present=True, all pending=0, is_complete=True) +- `summarize()` cleaned up (state_present=False, is_complete=False) +- `summarize()` Redis transient (RedisError → state_present=False) + +`ami/jobs/tests/test_job_completion.py` (new): + +- `Job.is_actually_complete()` — async_api + Redis present + all pending=0 → True +- async_api + Redis present + pending>0 → False (mid-flight defer) +- async_api + Redis gone + DB shows all detections present → True +- async_api + Redis gone + DB shows missing detections → False (logs WARNING) +- sync_api → falls through to `progress.is_complete()`, unchanged +- async_api with 2521 shape (Redis empty, progress JSONB stuck at 99.98%) → True + +`ami/jobs/tests/test_tasks.py` extension: + +- Reaper reaches the async_api guard with stale `progress` (process=99.98% + STARTED) but Redis empty → lands SUCCESS, not REVOKED. Mirrors 2521. + +### Logging + +Tier 2 fires a WARNING with `job_id, queued_count, unprocessed_count, +is_complete`. This is the signal we use to monitor: + +- How often do we fall through to DB? (Should be rare. High rate == + Redis cleanup is firing too aggressively, separate bug.) +- When we fall through, do we usually find complete or incomplete? + (Mostly complete == Redis TTL expiring on long-finished jobs. Mostly + incomplete == genuine partial-completion that the reaper will REVOKE, + which is the correct outcome.) + +Tier 1 is silent in the happy path; only ERRORS log on RedisError. + +## Acceptance criteria + +- All four state-machine call sites use `Job.is_actually_complete()`. +- Existing test suite `ami.jobs` stays green. +- New tests for the 6+ matrix cases pass. +- E2E run on a real ADC against a 50+ image collection lands SUCCESS. +- Manual chaos test: dispatch real job, manually clobber `Job.progress` + to mimic 2521 shape via Django shell, wait 11 min, verify reaper lands + SUCCESS not REVOKED. Verify Redis pending was empty at decision time. +- Tier 2 WARNING log appears when expected and only when expected. + +## Rollout + +Single PR. No feature flag. The new method strictly fixes a +known-broken path; no breaking change. Deploy + monitor reaper logs for: + +- Drop in REVOKEd async_api jobs that have `progress=~100%` in their final + Job row state. +- Any unexpected uptick in `fell back to DB detection-count check` + WARNING lines. + +## Risks + +- **DB fallback is slow on big jobs**: bounded — only fires after Redis + cleanup, on a stale job at the reaper tick. Worst case one slow query + per stale job per 15-min beat. +- **`pipeline.collect_images` re-resolves the queued list**: if the + Collection has changed since dispatch (rare for completed jobs but + possible), the recomputed id list may not match the originally queued + list. Acceptable for v1 — the WARNING log will show the count and we + can iterate. +- **Race window between Redis SREM and `summarize()`**: a result that + lands during the reaper's `summarize()` call could leave Redis briefly + in an inconsistent intermediate state. SCARD is atomic per-key, so the + worst case is the snapshot misses a just-completed image and reports + pending=1 → reaper waits another tick → next tick sees 0 → SUCCESS. + Bounded. + +## Follow-ups (separate PRs) + +1. **Make `_update_job_progress` writes atomic.** Either re-introduce a + tightly-scoped `select_for_update`, or move counts (`detections`, + `classifications`, `captures`) to Redis HINCRBY counters and have a + separate sync task mirror them into `Job.progress` periodically. The + cosmetic counter drift admitted at `tasks.py:563-571` becomes invisible + under (1), and the failure mode in this PR's plan goes away even + without `is_actually_complete()`. +2. **Persist `queued_image_ids` on the Job row at dispatch time.** + Replaces the on-the-fly `pipeline.collect_images` call in Tier 2. +3. **Fix `'CANCELLED' is not a valid JobState`** in + `ami/jobs/tasks.py:1193` (`_drain_all`). Separate bug surfaced during + 2521 investigation — `zombie_streams` check is throwing every tick on + a legacy job row whose `status='CANCELLED'` is not in the JobState + enum (only `CANCELING` / `REVOKED` are valid). + +## Diagnostic evidence (job 2521) + +Captured from prod via `ami-devops` on 2026-04-29: + +- Job row at REVOKE time: + - `process=99.98% STARTED processed=4509 remaining=1` + - `results=100% SUCCESS` + - `progress.errors=[]` +- NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0 + num_ack_pending=0 num_redelivered=0` +- Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml + log shows no `Stage 'X' progress lifted to 100% by max() guard` warning + (the diagnostic hook for this race), suggesting both writers had stale + reads. +- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` at + 11:45:45.510, then a later `99.97782%` write to results, then silence + for 14.5 min until reaper at 12:00:00.117. +- Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages: + collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS` + — note the reaper's in-memory snapshot read shows 100% at the moment of + REVOKE. The persisted blob at that moment was 99.98%. The + `progress.is_complete()` call inside the guard re-read the persisted + blob (or evaluated against the 100%-but-STARTED status, which fails the + `final_states` check in `models.py:268`). Either way: guard fired, + REVOKED. + +This is the second case (status STARTED with progress=100% on the same +stage) — it confirms the race is on the `status` field write, not just +the percentage. Both writers had stale snapshots; the slower one +overwrote the SUCCESS-status with STARTED. + +Last updated: 2026-04-29. From 10cdac28372b1dea2ad4ee24d3716e3d38533969 Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Wed, 29 Apr 2026 14:21:40 -0700 Subject: [PATCH 2/6] docs(jobs): narrow plan to reaper-only Redis check via all_tasks_processed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revise per review: the only user-visible site bitten by the _update_job_progress clobber race is the reaper async_api guard at tasks.py:968. Other sites that read progress.is_complete() either run at single-writer moments (task_postrun, task_failure) or self-correct later (cleanup gate, log gating). Narrow the fix accordingly. Also drops the proposed Job.is_actually_complete() wrapper in favor of inlining the Redis check at the one call site that needs it. Pydantic JobProgress.is_complete() stays pure data; AsyncJobStateManager grows all_tasks_processed() with tri-state bool|None semantics. Naming makes the scope explicit: Redis tracks NATS-task SREM completeness for the process and results stages, not job-wide multi-stage completion. DB-count fallback (Tier 2) deferred until monitoring shows the Redis-gone WARNING fires often enough to matter. Sync_api jobs are unaffected — dispatch_mode gate keeps them on the existing trust-Celery path. Co-Authored-By: Claude --- docs/claude/planning/job-complete-check.md | 455 +++++++++++---------- 1 file changed, 234 insertions(+), 221 deletions(-) diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md index 8c8418ddf..bc0847dfb 100644 --- a/docs/claude/planning/job-complete-check.md +++ b/docs/claude/planning/job-complete-check.md @@ -1,4 +1,4 @@ -# Plan: reusable `Job.is_actually_complete()` (fix/job-complete-check) +# Plan: reaper checks Redis directly via `all_tasks_processed()` (fix/job-complete-check) ## Why @@ -8,268 +8,281 @@ JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The `jobs_health_check` reaper at the 10-min cutoff: 1. `mark_lost_images_failed` correctly skipped the job (Redis pending was - empty, so `lost_ids` was empty, which is the documented skip path). -2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard at - `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which + empty, so `lost_ids` was empty — the documented skip path). +2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard + at `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which returned False because of the clobbered `process` stage. -3. Reaper REVOKEd the job. Result: 5058 detections / 8129 classifications - already saved to DB, but the Job row reads REVOKED. +3. Reaper REVOKEd the job. 5058 detections / 8129 classifications already + saved to DB; Job row reads REVOKED. PR #1244's reconciler is the wrong tool for this symptom — there is nothing -to reconcile in Redis. The bug is upstream: `_update_job_progress` writes -the entire `progress` JSONB blob without a row lock (see PR #1261, which -removed `select_for_update` to fix unrelated row-lock contention), and two -concurrent workers can each read the same pre-state then commit -last-writer-wins. The `max()` guard at `tasks.py:598` only protects -percentage regression *based on what the writer read* — it does not -serialize the readers, so a slower committer with a stale snapshot +to reconcile in Redis. The bug is upstream: `_update_job_progress` +(`tasks.py:556`) writes the entire `progress` JSONB blob without a row +lock (PR #1261 dropped `select_for_update` to break unrelated row-lock +contention), and two concurrent workers can each read the same pre-state +then commit last-writer-wins. The `max()` guard at `tasks.py:598` only +protects percentage regression *based on what each writer read* — it does +not serialize the readers, so a slower committer with a stale snapshot clobbers a faster committer's blob. -`Job.progress` cannot be the source of truth for state-machine decisions. -This PR makes that explicit: introduce a reliable completion check backed -by Redis (which uses atomic SREM/SADD and is not racy), with a DB-count -fallback for jobs whose Redis state has been cleaned up. `Job.progress` -stays as the cosmetic mirror for the UI. +`Job.progress` cannot be the source of truth for state-machine decisions +on async_api jobs. This PR makes that explicit at the one user-visible +site: the reaper's async_api guard. Reaper consults Redis directly. +`Job.progress` stays as the cosmetic mirror for the UI. Related context: - Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md` - Last-writer-wins admission: `ami/jobs/tasks.py:563-571` -- PR #1261 commit `50677444` removed `select_for_update` to fix contention. +- PR #1261 commit `50677444` removed `select_for_update`. -## Scope +## Scope (narrow) In scope: -- New `JobCompletionSnapshot` dataclass + `AsyncJobStateManager.summarize()`. -- New `Job.is_actually_complete()` method with Tier 1 (Redis) → Tier 2 - (DB count) → Tier 3 (`progress.is_complete()` for sync_api) fallback. -- WARNING log whenever Tier 2 fires, so we can monitor frequency and - catch unexpected fallbacks. -- Replace `progress.is_complete()` reads at the four state-machine call - sites that currently consult it. -- Tests covering the 4-way matrix (dispatch_mode × Redis-present-or-not). - -Out of scope: - -- Race fix in `_update_job_progress` itself. That is a separate, larger - refactor (move counts to atomic Redis counters, or scope a row lock to - microseconds via raw `jsonb_set`). Filed as follow-up: see - `### Follow-ups` below. -- UI changes. `Job.progress` stays as-is. The new method is backend-only. -- Touching `mark_lost_images_failed`. The reconciler's behavior is - correct for its intended bug (NATS-lost messages); 2521 was a different - failure that this PR addresses upstream. - -## Design - -### Tier 1 — Redis oracle +- New `AsyncJobStateManager.all_tasks_processed() -> bool | None` — + one Redis pipeline call (SCARD × 2 + GET) returning a tri-state. +- Replace the `not job.progress.is_complete()` read at `tasks.py:968` + (reaper async_api guard) with the new Redis-backed check. +- WARNING log when Redis state is absent (`None` return), so we can + monitor how often this fires. +- Tests for the 3-state matrix on `all_tasks_processed()` + a reaper + test mirroring the 2521 shape (Redis empty, progress JSONB stuck at + 99.98% STARTED → reaper lands SUCCESS, not REVOKED). + +Out of scope (filed as follow-ups): + +- Race fix in `_update_job_progress` itself. Move counts to atomic + Redis HINCRBY, or scope a row lock to microseconds via raw `jsonb_set`. +- Updating the other call sites that read `progress.is_complete()` + (`tasks.py:654`, `:1331`, `:1361`, `:159`, `:496`). They are not + affected by the clobber pattern under current operation — see + "Why only the reaper" below. +- DB-count fallback for the `None` (Redis-gone) case. Today the + WARNING log + `progress.is_complete()` fallback is sufficient. If + the WARNING fires often enough to matter, follow-up adds a + `pipeline.filter_processed_images()`-based check. +- Job-level method (`Job.processing_complete()`). Inline at the one + call site that needs it; if a future site needs the same logic, + extract then. +- UI changes. `Job.progress` shape is unchanged. +- `'CANCELLED' is not a valid JobState` crash in `tasks.py:1193` + (`zombie_streams._drain_all`). Separate bug surfaced during 2521 + triage; legacy job row has a status not in the JobState enum. + +## Why only the reaper + +The clobber pattern in `_update_job_progress` exists at six call sites +that read `progress.is_complete()`. Only one is user-visible: + +| Site | Race exposure | User-visible consequence | +|---|---|---| +| `tasks.py:968` reaper async_api guard | **Yes** — runs minutes after the racy writes have settled, sees the clobbered final state | **Yes — REVOKED instead of SUCCESS** | +| `tasks.py:654` cleanup gate inside `_update_job_progress` | Same writer is itself the racer; `is_complete()=False` after clobber prevents early cleanup | No — cleanup fires later from terminal-state path | +| `tasks.py:1331` `task_postrun` SUCCESS guard | None — fires at queue-completion, no NATS results yet, single writer | No | +| `tasks.py:1361` `task_failure` ASYNC_API guard | None — same single-writer moment | No | +| `tasks.py:159` `run_job` post-run log | None — log-gating only | No | +| `tasks.py:496` `MLJob.run` zero-images path | None — runs before any concurrent writers | No | -`AsyncJobStateManager.summarize() -> JobCompletionSnapshot`. Single -Redis pipeline round-trip: +This PR fixes (1). The others either have a single writer at the moment +they run, or their failure mode is internal and self-correcting. If +monitoring later shows another site biting, expand then. -```python -pipe.scard(self._get_pending_key("process")) -pipe.scard(self._get_pending_key("results")) -pipe.scard(self._failed_key) -pipe.get(self._total_key) -``` +## Why not put the check inside `progress.is_complete()` -```python -@dataclass -class JobCompletionSnapshot: - state_present: bool # total_raw is not None - total: int # 0 when state_present=False - pending_per_stage: dict[str, int] # {"process": N, "results": M} - failed: int - is_complete: bool # state_present AND all pending == 0 - - # Convenience for callers: - @property - def all_acked(self) -> bool: - return self.state_present and all(v == 0 for v in self.pending_per_stage.values()) -``` +`JobProgress` is a Pydantic model — pure data. Adding Redis I/O behind +its `is_complete()` method: -`is_complete` is True only when state_present AND every stage's pending -set is empty. A job whose Redis state is gone returns -`state_present=False`, regardless of stage state — caller decides what to -do via Tier 2 fallback. +- couples a serializable data shape to a side-effect-laden oracle +- silently makes every API serialization, log dump, and read trip + Redis (caller doesn't see the I/O cost) +- the Pydantic instance has no `job_id` reference, so the method + signature would have to grow an awkward `job_id=` argument +- breaks `JobProgress` reuse outside the Job context -`RedisError` is swallowed inside `summarize()` and returns -`state_present=False`. We do NOT propagate Redis transients up to the -reaper — a Redis blip should not flip a job to REVOKED. +Keep `JobProgress.is_complete()` as cosmetic Pydantic data check. +Do the Redis call at the state-machine call site that needs it. -### Tier 2 — DB Detection-row count +## Design -When Redis state is gone, query DB to determine whether all queued images -have detections from the job's pipeline algorithms: +### Tier 1 — Redis oracle on `AsyncJobStateManager` ```python -def _is_complete_via_db(self) -> bool: - """Slow but durable. Only runs when Redis state is missing - (post-cleanup, post-TTL, or post-restart). Logs a WARNING so - we know how often we fall through to this path. +# ami/ml/orchestration/async_job_state.py +def all_tasks_processed(self) -> bool | None: + """ + Truth signal for whether all NATS-tracked tasks have been processed + out of both pending sets (process + results stages). + + Returns: + True — both pending sets are empty AND total > 0 (all tasks SREM'd) + False — at least one pending set has members (real work outstanding) + None — Redis state absent (cleaned up, TTL expired, never initialized) + Caller decides what to do; do not assume completeness. + + Scope: Redis tracks the per-image SREM lifecycle for the `process` + and `results` stages only. It does NOT know about the `collect` stage + or any future post-results stages. Callers that need "is the entire + Job complete across ALL stages" must combine this with their own + knowledge of stage layout. + + RedisError (transient connection issues) is logged and returned as + None — a Redis blip should NOT flip a job to REVOKED on its own. """ - queued_ids = self._collected_image_ids() # see below - if not queued_ids: - return True # zero-images job is trivially complete - unprocessed = self.pipeline.filter_processed_images( - SourceImage.objects.filter(pk__in=queued_ids) - ).only("pk") - is_complete = not unprocessed.exists() - self.logger.warning( - "job %s: Redis state gone, fell back to DB detection-count check; " - "queued=%d, unprocessed=%d, is_complete=%s", - self.pk, len(queued_ids), unprocessed.count(), is_complete, - ) - return is_complete + try: + redis = self._get_redis() + with redis.pipeline() as pipe: + for stage in self.STAGES: + pipe.scard(self._get_pending_key(stage)) + pipe.get(self._total_key) + results = pipe.execute() + except RedisError as e: + logger.warning( + f"Redis error reading all_tasks_processed for job {self.job_id}: {e}" + ) + return None + + *pending_counts, total_raw = results + if total_raw is None: + return None + total = int(total_raw) + if total == 0: + # Zero-images job: trivially "processed" — no NATS work to do. + return True + return all(count == 0 for count in pending_counts) ``` -`self._collected_image_ids()`: derive from `Job.source_image_collection`, -`Job.source_image_single`, or `Job.deployment` exactly the way -`MLJob.run`/`pipeline.collect_images` does. Or, if the cost of -re-resolving is too high, persist the queued id list at dispatch time. -The `summarize()` payload includes `total` from Redis but Tier 2 needs -the full id set, so we cannot rely on Redis alone here. +One Redis pipeline round-trip. SCARD is O(1) per key. Negligible at scale. -For the first PR cut, derive on the fly via `pipeline.collect_images()` -to avoid a schema change. If profiling shows this is slow at 5k+ images, -follow-up adds a `Job.queued_image_ids` jsonb column or similar. - -### Tier 3 — sync_api fallback - -`progress.is_complete()` for non-async_api jobs. Unchanged. Sync mode -does not use Redis at all, and its writes are serialized by the single -Celery task driving each stage. - -### `Job.is_actually_complete()` +### Reaper guard replacement ```python -def is_actually_complete(self) -> bool: - """Source-of-truth completion check. Use everywhere instead of - self.progress.is_complete() for state-machine decisions. - self.progress.is_complete() remains valid for cosmetic/UI reads. - """ - if self.dispatch_mode == JobDispatchMode.ASYNC_API: - snap = AsyncJobStateManager(self.pk).summarize() - if snap.state_present: - return snap.is_complete - return self._is_complete_via_db() - return self.progress.is_complete() +# ami/jobs/tasks.py:968 — replacement +is_terminal = celery_state in states.READY_STATES +is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API +if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: + processed = AsyncJobStateManager(job.pk).all_tasks_processed() + if processed is False: + # NATS tasks still pending in Redis. mark_lost_images_failed runs + # before this and would have reconciled if it could; if we land + # here with pending > 0, the reconciler couldn't help (consumer + # gone, etc.) and REVOKE is the correct outcome. + is_terminal = False + elif processed is None: + # Redis state gone — rare. Fall back to the racy progress oracle + # but log so we know how often this path fires. If it fires + # often, follow up with a DB-count fallback. + logger.warning( + "Reaper for job %s: Redis state absent, falling back to " + "progress.is_complete()", + job.pk, + ) + if not job.progress.is_complete(): + is_terminal = False + # processed is True -> trust Celery's terminal state, leave is_terminal as-is ``` -Method on the Job model, not a free function — lets call sites read like -`if not job.is_actually_complete(): ...` matching the existing -`progress.is_complete()` pattern. +Sync_api jobs hit none of this branch — they fall through `is_terminal = +celery_state in READY_STATES` exactly as today. -### Call sites to update +### Behavior change matrix -| File:line | Current call | Replacement | +| Scenario | Today | After this PR | |---|---|---| -| `ami/jobs/tasks.py:968` (reaper async_api guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | -| `ami/jobs/tasks.py:654` (`_update_job_progress` cleanup gate) | `if job.progress.is_complete():` | `if job.is_actually_complete():` | -| `ami/jobs/tasks.py:1331` (`task_postrun` SUCCESS guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | -| `ami/jobs/tasks.py:1361` (`task_failure` ASYNC_API guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` | -| `ami/jobs/models.py:496` (zero-images path in `MLJob.run`) | `if job.progress.is_complete():` | leave as-is — runs before Redis is initialized | -| `ami/jobs/models.py:159` (`run_job` post-task log) | `not job.progress.is_complete()` | leave as-is — log gating only, not state-machine | - -The four state-machine sites (rows 1-4) are the ones that decide SUCCESS -vs REVOKED vs deferred-cleanup. They all currently misread the clobbered -`progress.is_complete()`. The two leave-as-is rows are pre-Redis -initialization or log-only — the replacement would be no-op or wrong. +| async_api job done; Redis empty; `progress` clobbered (2521 shape) | REVOKED | **SUCCESS** | +| async_api job mid-flight; Celery SUCCESS but Redis pending > 0 | REVOKED (was already, but for the wrong reason) | REVOKED (same outcome, correct reason: NATS tasks pending) | +| async_api job done; Redis pending == 0; `progress` accurate | SUCCESS | SUCCESS (no change) | +| async_api job done; Redis state cleaned up early | varies — depends on `progress` | falls back to `progress.is_complete()` with WARNING log | +| sync_api job at reaper | unchanged | unchanged | ### Tests -`ami/jobs/tests/test_async_job_state.py` (new file or extend existing): - -- `summarize()` against fresh init (state_present=True, all pending=total) -- `summarize()` mid-flight (some pending, some processed) -- `summarize()` complete (state_present=True, all pending=0, is_complete=True) -- `summarize()` cleaned up (state_present=False, is_complete=False) -- `summarize()` Redis transient (RedisError → state_present=False) - -`ami/jobs/tests/test_job_completion.py` (new): - -- `Job.is_actually_complete()` — async_api + Redis present + all pending=0 → True -- async_api + Redis present + pending>0 → False (mid-flight defer) -- async_api + Redis gone + DB shows all detections present → True -- async_api + Redis gone + DB shows missing detections → False (logs WARNING) -- sync_api → falls through to `progress.is_complete()`, unchanged -- async_api with 2521 shape (Redis empty, progress JSONB stuck at 99.98%) → True - -`ami/jobs/tests/test_tasks.py` extension: - -- Reaper reaches the async_api guard with stale `progress` (process=99.98% - STARTED) but Redis empty → lands SUCCESS, not REVOKED. Mirrors 2521. +Extend `ami/ml/tests.py` (or `ami/jobs/tests/test_async_job_state.py`): + +- `all_tasks_processed()` after fresh `initialize_job([...])` → False + (all images still in pending sets) +- After SREM-ing every id from both stages → True +- After SREM-ing some → False +- With `total=0` (zero-images path) → True +- With Redis state never initialized → None +- With Redis state cleaned up via `cleanup()` → None +- Simulated `RedisError` on the pipeline → None + WARNING logged + +Extend `ami/jobs/tests/test_tasks.py` — `TestCheckStaleJobs`: + +- async_api job, Celery SUCCESS, Redis pending empty, `progress` + hand-set to `process=99.98% STARTED processed=4509 remaining=1` + (mirroring 2521) → reaper lands SUCCESS, not REVOKED. Asserts + `is_terminal=True` path was taken. +- async_api job, Celery SUCCESS, Redis pending > 0 → REVOKED (no regression + for genuine partial-completion). +- async_api job, Redis state absent, `progress.is_complete()=True`, + Celery SUCCESS → SUCCESS via the fallback path. Assert WARNING log + emitted. +- async_api job, Redis state absent, `progress.is_complete()=False`, + Celery SUCCESS → REVOKED via the fallback path. Assert WARNING log emitted. +- sync_api job at reaper → behavior unchanged (existing tests cover). ### Logging -Tier 2 fires a WARNING with `job_id, queued_count, unprocessed_count, -is_complete`. This is the signal we use to monitor: +`logger.warning(...)` once per stale-async_api job per tick when Redis +state is absent. Format includes `job.pk` so we can grep. -- How often do we fall through to DB? (Should be rare. High rate == - Redis cleanup is firing too aggressively, separate bug.) -- When we fall through, do we usually find complete or incomplete? - (Mostly complete == Redis TTL expiring on long-finished jobs. Mostly - incomplete == genuine partial-completion that the reaper will REVOKE, - which is the correct outcome.) +Monitoring expectation post-deploy: -Tier 1 is silent in the happy path; only ERRORS log on RedisError. +- Drop in REVOKED async_api jobs whose final `Job.progress` snapshot has + `process>=99% STARTED` (the clobber signature). +- Low-rate WARNING lines from the Redis-gone fallback. High rate == + Redis cleanup is firing too aggressively, separate bug to chase. ## Acceptance criteria -- All four state-machine call sites use `Job.is_actually_complete()`. -- Existing test suite `ami.jobs` stays green. -- New tests for the 6+ matrix cases pass. -- E2E run on a real ADC against a 50+ image collection lands SUCCESS. -- Manual chaos test: dispatch real job, manually clobber `Job.progress` - to mimic 2521 shape via Django shell, wait 11 min, verify reaper lands - SUCCESS not REVOKED. Verify Redis pending was empty at decision time. -- Tier 2 WARNING log appears when expected and only when expected. +- `all_tasks_processed()` exists with the three-state semantics. +- Reaper at `tasks.py:968` uses it; sync_api branch unchanged. +- Existing `ami.jobs` test suite stays green. +- New tests for the 3-state matrix + reaper-eats-2521-shape case pass. +- Manual chaos: dispatch real async_api job, wait until completion, then + via Django shell hand-clobber `Job.progress` to mimic 2521. Wait 11 + min. Confirm reaper lands SUCCESS, not REVOKED. Confirm Redis was + clean at decision time. ## Rollout -Single PR. No feature flag. The new method strictly fixes a -known-broken path; no breaking change. Deploy + monitor reaper logs for: - -- Drop in REVOKEd async_api jobs that have `progress=~100%` in their final - Job row state. -- Any unexpected uptick in `fell back to DB detection-count check` - WARNING lines. +Single PR. No feature flag. The change is strictly additive at the call +site (one site replaced; no API or schema change). Deploy + monitor. ## Risks -- **DB fallback is slow on big jobs**: bounded — only fires after Redis - cleanup, on a stale job at the reaper tick. Worst case one slow query - per stale job per 15-min beat. -- **`pipeline.collect_images` re-resolves the queued list**: if the - Collection has changed since dispatch (rare for completed jobs but - possible), the recomputed id list may not match the originally queued - list. Acceptable for v1 — the WARNING log will show the count and we - can iterate. -- **Race window between Redis SREM and `summarize()`**: a result that - lands during the reaper's `summarize()` call could leave Redis briefly - in an inconsistent intermediate state. SCARD is atomic per-key, so the - worst case is the snapshot misses a just-completed image and reports - pending=1 → reaper waits another tick → next tick sees 0 → SUCCESS. - Bounded. +- **Race window between Redis SREM and `all_tasks_processed()`**: a + result that lands during the reaper's call could leave Redis briefly + in an intermediate state. SCARD is atomic per-key; worst case is the + snapshot misses a just-completed image and reports pending=1 → reaper + defers → next tick sees pending=0 → SUCCESS. Bounded. +- **Redis-gone fallback rate**: if WARNING fires frequently in + production, we punted too early on Tier 2 (DB count). Mitigation: + add `pipeline.filter_processed_images()`-based fallback in a + follow-up PR. Bounded — only fires when Redis state is absent AND the + job is past the reaper cutoff. +- **Sync_api jobs unaffected**: dispatch_mode gate ensures non-async + jobs hit the existing path. Verified by inspection + sync_api reaper + tests. ## Follow-ups (separate PRs) -1. **Make `_update_job_progress` writes atomic.** Either re-introduce a - tightly-scoped `select_for_update`, or move counts (`detections`, - `classifications`, `captures`) to Redis HINCRBY counters and have a - separate sync task mirror them into `Job.progress` periodically. The - cosmetic counter drift admitted at `tasks.py:563-571` becomes invisible - under (1), and the failure mode in this PR's plan goes away even - without `is_actually_complete()`. -2. **Persist `queued_image_ids` on the Job row at dispatch time.** - Replaces the on-the-fly `pipeline.collect_images` call in Tier 2. -3. **Fix `'CANCELLED' is not a valid JobState`** in - `ami/jobs/tasks.py:1193` (`_drain_all`). Separate bug surfaced during - 2521 investigation — `zombie_streams` check is throwing every tick on - a legacy job row whose `status='CANCELLED'` is not in the JobState - enum (only `CANCELING` / `REVOKED` are valid). +1. **Make `_update_job_progress` writes atomic.** Either tightly-scoped + `select_for_update`, or move counts (`detections`, `classifications`, + `captures`) to Redis HINCRBY counters with a separate sync task + mirroring them into `Job.progress`. The cosmetic counter drift + admitted at `tasks.py:563-571` becomes invisible, and the failure + mode this PR addresses goes away even without the new method. +2. **DB-count fallback for the Redis-gone case** if monitoring shows + the WARNING firing often. `pipeline.filter_processed_images()` + against the Job's queued image set. +3. **Persist `queued_image_ids` on the Job row at dispatch** to make + (2) cheaper. +4. **Fix `'CANCELLED' is not a valid JobState`** in `tasks.py:1193`. +5. **Expand `all_tasks_processed()` usage** to the other 5 sites if + monitoring shows a clobber-driven failure at any of them. ## Diagnostic evidence (job 2521) @@ -282,23 +295,23 @@ Captured from prod via `ami-devops` on 2026-04-29: - NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0 num_ack_pending=0 num_redelivered=0` - Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml - log shows no `Stage 'X' progress lifted to 100% by max() guard` warning - (the diagnostic hook for this race), suggesting both writers had stale - reads. -- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` at - 11:45:45.510, then a later `99.97782%` write to results, then silence - for 14.5 min until reaper at 12:00:00.117. + log shows no `Stage 'X' progress lifted to 100% by max() guard` + warning (the diagnostic hook for this race), suggesting both writers + had stale reads. +- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` + at 11:45:45.510, then a later `99.97782%` write to results, then + silence for 14.5 min until reaper at 12:00:00.117. - Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages: collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS` - — note the reaper's in-memory snapshot read shows 100% at the moment of - REVOKE. The persisted blob at that moment was 99.98%. The + — note the reaper's in-memory snapshot read shows 100% at the moment + of REVOKE. The persisted blob at that moment was 99.98%. The `progress.is_complete()` call inside the guard re-read the persisted - blob (or evaluated against the 100%-but-STARTED status, which fails the - `final_states` check in `models.py:268`). Either way: guard fired, - REVOKED. + blob (or evaluated against the 100%-but-STARTED status, which fails + the `final_states` check in `models.py:268`). Either way: guard + fired, REVOKED. This is the second case (status STARTED with progress=100% on the same -stage) — it confirms the race is on the `status` field write, not just +stage) — confirms the race is on the `status` field write, not just the percentage. Both writers had stale snapshots; the slower one overwrote the SUCCESS-status with STARTED. From c828d9e1f5b3577a22b9e8fb9f28ef99bc18b0ed Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Wed, 29 Apr 2026 14:28:57 -0700 Subject: [PATCH 3/6] docs(jobs): add NEXT_SESSION_PROMPT for fix/job-complete-check impl MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Resume instructions for the next session — implementation steps for all_tasks_processed() + reaper guard swap, test cases, e2e chaos validation, follow-up tickets, and the saved diagnostic log paths on the ami-devops host. References the plan doc at docs/claude/planning/job-complete-check.md (commit 10cdac28). Co-Authored-By: Claude --- docs/claude/prompts/NEXT_SESSION_PROMPT.md | 218 +++++++++++++++++++++ 1 file changed, 218 insertions(+) create mode 100644 docs/claude/prompts/NEXT_SESSION_PROMPT.md diff --git a/docs/claude/prompts/NEXT_SESSION_PROMPT.md b/docs/claude/prompts/NEXT_SESSION_PROMPT.md new file mode 100644 index 000000000..baca7df2f --- /dev/null +++ b/docs/claude/prompts/NEXT_SESSION_PROMPT.md @@ -0,0 +1,218 @@ +# Next session: implement `fix/job-complete-check` + +Branch: `fix/job-complete-check` at commit `10cdac28` (worktree +`/home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks`). + +Plan: `docs/claude/planning/job-complete-check.md` (read first — has the +full design, risks, follow-ups, and 2521 evidence). + +## TL;DR + +Production job 2521 ended REVOKED because `_update_job_progress` +(`ami/jobs/tasks.py:556`) writes the entire `Job.progress` JSONB blob +without a row lock since PR #1261 dropped `select_for_update`. Two +concurrent workers raced; slower committer with `processed=4509 +remaining=1` clobbered the faster committer's `processed=4510 SUCCESS` +write. NATS + Redis both showed all 4510 tasks done; only `Job.progress` +lied. Reaper at `tasks.py:968` consults `progress.is_complete()` → +False (because clobbered) → REVOKED. + +Fix: add `AsyncJobStateManager.all_tasks_processed() -> bool | None`, +inline at the reaper guard. Sync_api unaffected. Other 5 `progress.is_complete()` +readers untouched (not bitten by clobber). Pydantic `JobProgress` stays +pure data — Redis I/O lives on the manager class. + +## Implementation steps + +### 1. `all_tasks_processed()` + +File: `ami/ml/orchestration/async_job_state.py` + +After `get_pending_image_ids()` (around line 229), add: + +```python +def all_tasks_processed(self) -> bool | None: + """Tri-state truth signal for NATS-task SREM completeness across both + process and results pending sets. + + True — both pending sets empty AND total > 0 (or total == 0) + False — at least one pending set has members + None — Redis state absent (cleaned up, expired, never initialized, + or transient RedisError) + + Scope: tracks NATS task lifecycle only; does not know about `collect` + or any future post-results stages. + """ + try: + redis = self._get_redis() + with redis.pipeline() as pipe: + for stage in self.STAGES: + pipe.scard(self._get_pending_key(stage)) + pipe.get(self._total_key) + results = pipe.execute() + except RedisError as e: + logger.warning( + f"Redis error reading all_tasks_processed for job {self.job_id}: {e}" + ) + return None + + *pending_counts, total_raw = results + if total_raw is None: + return None + if int(total_raw) == 0: + return True + return all(count == 0 for count in pending_counts) +``` + +### 2. Reaper guard replacement + +File: `ami/jobs/tasks.py:963-969` + +Replace: +```python +is_terminal = celery_state in states.READY_STATES +is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API +if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete(): + is_terminal = False +``` + +With: +```python +is_terminal = celery_state in states.READY_STATES +is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API +if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: + processed = AsyncJobStateManager(job.pk).all_tasks_processed() + if processed is False: + is_terminal = False + elif processed is None: + logger.warning( + "Reaper for job %s: Redis state absent, falling back to " + "progress.is_complete()", + job.pk, + ) + if not job.progress.is_complete(): + is_terminal = False + # processed is True -> trust Celery's terminal state +``` + +Need import: `from ami.ml.orchestration.async_job_state import AsyncJobStateManager` +(check if already imported at top of file). + +### 3. Tests + +**`ami/ml/tests.py`** — add `TestAsyncJobStateManagerAllTasksProcessed`: +- fresh `initialize_job([...])` → False +- after SREM-ing every id from both stages → True +- after SREM-ing only one stage → False +- `total=0` zero-images → True +- never initialized → None +- after `cleanup()` → None +- mocked `RedisError` on pipeline → None + WARNING + +**`ami/jobs/tests/test_tasks.py::TestCheckStaleJobs`** — add cases: +- async_api + Celery SUCCESS + Redis empty + `progress` clobbered to + `process=99.98% STARTED processed=4509 remaining=1` → SUCCESS, not REVOKED + (the 2521 case) +- async_api + Celery SUCCESS + Redis pending > 0 → REVOKED +- async_api + Redis state absent + `progress.is_complete()=True` → SUCCESS + via fallback + assert WARNING log +- async_api + Redis state absent + `progress.is_complete()=False` → REVOKED + via fallback + assert WARNING log +- sync_api → existing behavior unchanged + +### 4. Run tests + +```bash +cd /home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks +docker compose -f docker-compose.ci.yml run --rm django \ + python manage.py test ami.ml.tests.TestAsyncJobStateManagerAllTasksProcessed ami.jobs.tests.test_tasks --keepdb +``` + +If tests pass, run the full `ami.jobs` suite: +```bash +docker compose -f docker-compose.ci.yml run --rm django \ + python manage.py test ami.jobs --keepdb +``` + +### 5. E2E chaos validation + +Per the chaos pattern in `docs/claude/debugging/chaos-scenarios.md`: + +1. Dispatch real async_api job (50+ images) on local stack with ADC + running. Wait until SUCCESS naturally. +2. Via Django shell, hand-clobber `Job.progress` to mimic 2521: + ```python + from ami.jobs.models import Job, JobState + j = Job.objects.get(pk=) + j.status = JobState.STARTED + j.progress.update_stage("process", status=JobState.STARTED, progress=0.9998, + processed=N-1, remaining=1, failed=0) + j.updated_at = j.updated_at - datetime.timedelta(minutes=15) + j.save() + ``` +3. Trigger `jobs_health_check` manually: + ```python + from ami.jobs.tasks import jobs_health_check + jobs_health_check() + ``` +4. Verify: job lands SUCCESS (not REVOKED). Verify Redis still empty + (cleanup runs from terminal-state path). + +### 6. PR + +```bash +gh pr create --title "fix(jobs): reaper checks Redis directly via all_tasks_processed" --body "..." +``` + +PR body template — start from the plan doc's "Why", "Behavior change +matrix", and "Risks" sections. Reference job 2521. + +## Pre-requisites and gotchas + +- Worktree currently at `dangling-tasks` path but on branch + `fix/job-complete-check`. `git status` should be clean. +- Pydantic v1 in container — `.dict()` / `.json()` not `.model_dump()`. +- CI compose file is `docker-compose.ci.yml` for tests with `--keepdb`. +- Stale RabbitMQ connections after days: restart Django/Celery if + enqueueing throws `ConnectionResetError`. +- `cachalot_disabled()` decorator wraps `process_nats_pipeline_result` + — not relevant here but flagged in case a related test fails on cache. + +## Saved diagnostic logs from 2521 + +On the ami-devops host (not local): +- `/tmp/job2521_celery_72h.log` +- `/tmp/job2521_worker_jobs_72h.log` +- `/tmp/job2521_worker_ml_72h.log` +- `/tmp/job2521_worker_default_72h.log` +- `/tmp/job2521_beat_72h.log` + +If next session needs to re-grep, dispatch a subagent in +`~/Projects/AMI/ami-devops/` rather than re-fetching from prod. + +## Open follow-up tickets to file (not in this PR) + +1. Make `_update_job_progress` writes atomic. Either re-introduce a + tightly-scoped `select_for_update`, or move counts to Redis HINCRBY + counters with separate sync task. Solves the underlying race that + this PR works around. +2. DB-count fallback for the Redis-gone case. Add only if the WARNING + log fires often in production. +3. Persist `queued_image_ids` on Job at dispatch — makes (2) cheaper. +4. Fix `'CANCELLED' is not a valid JobState` crash in + `ami/jobs/tasks.py:1193` (`zombie_streams._drain_all`). Throws + every beat tick on a legacy job row whose status is not in the + JobState enum. +5. Expand `all_tasks_processed()` usage to other 5 `progress.is_complete()` + sites if monitoring shows clobber-driven failures elsewhere. + +## Branch state + +``` +$ git log --oneline fix/job-complete-check ^main +10cdac28 docs(jobs): narrow plan to reaper-only Redis check via all_tasks_processed +091e2fc7 docs(jobs): plan reusable Job.is_actually_complete() backed by Redis +``` + +(091e2fc7 is the original broader plan — superseded by 10cdac28. Do NOT +implement from 091e2fc7.) From 36b17c420855d3eb6a5b265cc2231d6a99143bb3 Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Wed, 29 Apr 2026 14:49:53 -0700 Subject: [PATCH 4/6] fix(jobs): reaper checks Redis directly via all_tasks_processed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Job 2521 ended REVOKED despite NATS+Redis both confirming all 4510 tasks processed. Root cause: `_update_job_progress` writes the entire `Job.progress` JSONB blob without a row lock since #1261 dropped `select_for_update`. Concurrent workers raced; a slower committer clobbered the faster committer's SUCCESS write, leaving progress at `processed=4509 remaining=1 STARTED`. The reaper read the clobbered `progress.is_complete()` -> False -> REVOKE. Add `AsyncJobStateManager.all_tasks_processed() -> bool | None` (tri-state) backed directly by Redis SCARD across both pending sets, and inline it at the reaper guard in `check_stale_jobs`. Sync_api jobs unchanged (dispatch_mode gate). When Redis state is absent (cleanup, expiry, never initialized, or transient RedisError), fall back to `progress.is_complete()` and emit a WARNING so monitoring can flag the fallback path. Other 5 `progress.is_complete()` callers stay as-is — only the reaper guard is bitten by the clobber. Atomic-write follow-up tracked separately. Co-Authored-By: Claude --- ami/jobs/tasks.py | 28 +++- ami/jobs/tests/test_tasks.py | 195 ++++++++++++++++++++++++ ami/ml/orchestration/async_job_state.py | 30 ++++ ami/ml/tests.py | 54 +++++++ 4 files changed, 300 insertions(+), 7 deletions(-) diff --git a/ami/jobs/tasks.py b/ami/jobs/tasks.py index f3c996f86..554c3ba18 100644 --- a/ami/jobs/tasks.py +++ b/ami/jobs/tasks.py @@ -909,9 +909,10 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[ For each stale job, checks Celery for a terminal task status. REVOKED is always trusted. For async_api jobs, SUCCESS and FAILURE are only accepted - when job.progress.is_complete() — NATS workers may still be delivering - results after the Celery task finishes. All other cases result in revocation. - Async resources (NATS/Redis) are cleaned up in both branches. + when AsyncJobStateManager.all_tasks_processed() reports True (i.e. the + Redis pending sets are drained). When Redis state is absent, falls back to + job.progress.is_complete(). All other cases result in revocation. Async + resources (NATS/Redis) are cleaned up in both branches. Returns a list of dicts describing what was done to each job. """ @@ -961,12 +962,25 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[ # Treat as unknown state — job will be revoked below. # Only trust terminal Celery states. For async_api jobs, SUCCESS and - # FAILURE are only accepted when progress is complete — NATS workers may - # still be delivering results after the Celery task finishes. + # FAILURE are only accepted when all NATS tasks are processed — workers + # may still be delivering results after the Celery task finishes. + # Consult Redis (source of truth for SREM completeness) directly rather + # than Job.progress.is_complete(), which mirrors a JSONB blob racy under + # concurrent _update_job_progress writes since #1261. is_terminal = celery_state in states.READY_STATES is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API - if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete(): - is_terminal = False + if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: + processed = AsyncJobStateManager(job.pk).all_tasks_processed() + if processed is False: + is_terminal = False + elif processed is None: + logger.warning( + "Reaper for job %s: Redis state absent, falling back to " "progress.is_complete()", + job.pk, + ) + if not job.progress.is_complete(): + is_terminal = False + # processed is True -> trust Celery's terminal state previous_status = job.status if is_terminal: diff --git a/ami/jobs/tests/test_tasks.py b/ami/jobs/tests/test_tasks.py index 89587f3f1..3126f5486 100644 --- a/ami/jobs/tests/test_tasks.py +++ b/ami/jobs/tests/test_tasks.py @@ -1229,3 +1229,198 @@ def test_jobs_health_check_runs_lost_images_before_stale_jobs(self, mock_manager job.refresh_from_db() self.assertEqual(job.status, JobState.SUCCESS.value) + + +class TestCheckStaleJobsReaperGuard(TransactionTestCase): + """Reaper guard for async_api jobs: SUCCESS/FAILURE Celery state is only + accepted when AsyncJobStateManager.all_tasks_processed() reports True. The + earlier guard read Job.progress.is_complete() — racy under concurrent + _update_job_progress writes since #1261 dropped select_for_update. Job 2521 + landed REVOKED with NATS+Redis fully drained because a slower committer + clobbered the SUCCESS write. This class verifies the new Redis-direct path, + the absent-state fallback to progress.is_complete() with a WARNING, and + that sync_api jobs are unaffected. + """ + + def setUp(self): + cache.clear() + self.project = Project.objects.create(name="Reaper Guard Project") + self.pipeline = Pipeline.objects.create(name="Reaper Pipeline", slug="reaper-pipeline") + self.pipeline.projects.add(self.project) + self.collection = SourceImageCollection.objects.create(name="Reaper Coll", project=self.project) + + def tearDown(self): + cache.clear() + + def _stale_async_job(self, *, task_id: str = "reaper-task") -> Job: + job = Job.objects.create( + job_type_key=MLJob.key, + project=self.project, + name="reaper async job", + pipeline=self.pipeline, + source_image_collection=self.collection, + dispatch_mode=JobDispatchMode.ASYNC_API, + ) + job.task_id = task_id + job.update_status(JobState.STARTED, save=True) + return job + + def _mark_stale(self, job: Job) -> None: + """Push updated_at back past STALLED_JOBS_MAX_MINUTES via raw update so + Job.save() side-effects (auto_now) don't undo it. Call AFTER any helper + that touches the job model.""" + Job.objects.filter(pk=job.pk).update( + updated_at=datetime.datetime.now() - datetime.timedelta(minutes=Job.STALLED_JOBS_MAX_MINUTES + 1) + ) + job.refresh_from_db() + + def _set_progress_clobbered(self, job: Job, total: int, processed: int) -> None: + """Mimic the 2521 Job.progress shape: process stage at processed/total + with status STARTED, even though Redis was actually fully drained.""" + progress = job.progress + collect = progress.get_stage("collect") + collect.progress = 1.0 + collect.status = JobState.SUCCESS + progress.update_stage( + "process", + progress=processed / total, + status=JobState.STARTED, + processed=processed, + remaining=total - processed, + failed=0, + ) + progress.update_stage( + "results", + progress=processed / total, + status=JobState.STARTED, + captures=processed, + ) + job.save() + + def _set_progress_complete(self, job: Job) -> None: + progress = job.progress + for key in ("collect", "process", "results"): + stage = progress.get_stage(key) + stage.progress = 1.0 + stage.status = JobState.SUCCESS + job.save() + + @patch("celery.result.AsyncResult") + def test_async_celery_success_redis_empty_progress_clobbered_lands_success(self, mock_async_result): + """The 2521 case. Pre-fix, this came back REVOKED because the reaper + consulted progress.is_complete() (False, due to clobber). Post-fix, + Redis says all_tasks_processed() → True, so SUCCESS is honored.""" + from ami.jobs.tasks import check_stale_jobs + + job = self._stale_async_job() + ids = [str(i) for i in range(10)] + manager = AsyncJobStateManager(job.pk) + manager.initialize_job(ids) + manager.update_state(set(ids), stage="process") + manager.update_state(set(ids), stage="results") + # Clobber: progress shows mid-flight even though Redis is drained. + self._set_progress_clobbered(job, total=10, processed=9) + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.SUCCESS + + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual( + job.status, + JobState.SUCCESS.value, + f"clobbered progress should not block SUCCESS when Redis says drained; got {job.status}", + ) + + @patch("celery.result.AsyncResult") + def test_async_celery_success_redis_pending_lands_revoked(self, mock_async_result): + """Redis still has pending ids → genuine in-flight; reaper revokes.""" + from ami.jobs.tasks import check_stale_jobs + + job = self._stale_async_job() + ids = [str(i) for i in range(10)] + AsyncJobStateManager(job.pk).initialize_job(ids) + # No SREMs — pending sets still full. + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.SUCCESS + + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual(job.status, JobState.REVOKED.value) + + @patch("celery.result.AsyncResult") + def test_async_celery_success_redis_absent_progress_complete_lands_success(self, mock_async_result): + """Redis state is gone (cleaned up / never initialized). Reaper falls + back to progress.is_complete(). Complete progress → SUCCESS + WARNING.""" + from ami.jobs.tasks import check_stale_jobs + + job = self._stale_async_job() + # Don't initialize Redis state. + self._set_progress_complete(job) + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.SUCCESS + + with self.assertLogs("ami.jobs.tasks", level="WARNING") as cm: + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual(job.status, JobState.SUCCESS.value) + self.assertTrue(any("Redis state absent" in m for m in cm.output)) + + @patch("celery.result.AsyncResult") + def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(self, mock_async_result): + """Redis absent + progress incomplete → REVOKED via fallback + WARNING.""" + from ami.jobs.tasks import check_stale_jobs + + job = self._stale_async_job() + # Don't initialize Redis. Default progress is fresh (not complete). + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.SUCCESS + + with self.assertLogs("ami.jobs.tasks", level="WARNING") as cm: + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual(job.status, JobState.REVOKED.value) + self.assertTrue(any("Redis state absent" in m for m in cm.output)) + + @patch("celery.result.AsyncResult") + def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_async_result): + """sync_api jobs skip the Redis guard entirely — Celery's terminal + state is authoritative. No Redis state initialized; if the new path + leaked into sync_api this would REVOKE.""" + from ami.jobs.tasks import check_stale_jobs + + job = Job.objects.create( + job_type_key=MLJob.key, + project=self.project, + name="reaper sync job", + pipeline=self.pipeline, + source_image_collection=self.collection, + dispatch_mode=JobDispatchMode.SYNC_API, + ) + job.task_id = "reaper-sync-task" + job.update_status(JobState.STARTED, save=True) + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.SUCCESS + + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual(job.status, JobState.SUCCESS.value) diff --git a/ami/ml/orchestration/async_job_state.py b/ami/ml/orchestration/async_job_state.py index 26e7c3024..4c9502a60 100644 --- a/ami/ml/orchestration/async_job_state.py +++ b/ami/ml/orchestration/async_job_state.py @@ -228,6 +228,36 @@ def get_pending_image_ids(self) -> set[str]: return set() return {m.decode() if isinstance(m, (bytes, bytearray)) else str(m) for m in members} + def all_tasks_processed(self) -> bool | None: + """Tri-state truth signal for NATS-task SREM completeness across both + process and results pending sets. + + True — both pending sets empty AND total > 0 (or total == 0) + False — at least one pending set has members + None — Redis state absent (cleaned up, expired, never initialized, + or transient RedisError) + + Scope: tracks NATS task lifecycle only; does not know about `collect` + or any future post-results stages. + """ + try: + redis = self._get_redis() + with redis.pipeline() as pipe: + for stage in self.STAGES: + pipe.scard(self._get_pending_key(stage)) + pipe.get(self._total_key) + results = pipe.execute() + except RedisError as e: + logger.warning(f"Redis error reading all_tasks_processed for job {self.job_id}: {e}") + return None + + *pending_counts, total_raw = results + if total_raw is None: + return None + if int(total_raw) == 0: + return True + return all(count == 0 for count in pending_counts) + def cleanup(self) -> None: """ Delete all Redis keys associated with this job. diff --git a/ami/ml/tests.py b/ami/ml/tests.py index 18549d8cb..0297e4d8f 100644 --- a/ami/ml/tests.py +++ b/ami/ml/tests.py @@ -1407,6 +1407,60 @@ def test_update_state_returns_none_when_state_genuinely_missing(self): progress = self.manager.update_state({"img1", "img2"}, "process") self.assertIsNone(progress) + def test_all_tasks_processed_fresh_init_returns_false(self): + """Just-initialized job has all images pending in both stages.""" + self._init_and_verify(self.image_ids) + self.assertFalse(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_after_drain_returns_true(self): + """SREM-ing every id from both stages → True.""" + self._init_and_verify(self.image_ids) + self.manager.update_state(set(self.image_ids), "process") + self.manager.update_state(set(self.image_ids), "results") + self.assertTrue(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_partial_stage_returns_false(self): + """Process drained but results still pending → False.""" + self._init_and_verify(self.image_ids) + self.manager.update_state(set(self.image_ids), "process") + # results stage untouched + self.assertFalse(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_zero_total_returns_true(self): + """A zero-images job is trivially complete.""" + self.manager.initialize_job([]) + self.assertTrue(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_never_initialized_returns_none(self): + """No init → total key absent → None (caller falls back).""" + # Do NOT call initialize_job + self.assertIsNone(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_after_cleanup_returns_none(self): + """After cleanup() the total key is gone → None.""" + self._init_and_verify(self.image_ids) + self.manager.cleanup() + self.assertIsNone(self.manager.all_tasks_processed()) + + def test_all_tasks_processed_returns_none_on_redis_error(self): + """Transient RedisError → WARNING + None (caller falls back).""" + from unittest.mock import MagicMock, patch + + from redis.exceptions import RedisError + + self._init_and_verify(self.image_ids) + + pipe = MagicMock() + pipe.execute.side_effect = RedisError("Connection reset by peer") + fake_redis = MagicMock() + fake_redis.pipeline.return_value.__enter__.return_value = pipe + + with patch.object(self.manager, "_get_redis", return_value=fake_redis): + with self.assertLogs("ami.ml.orchestration.async_job_state", level="WARNING") as cm: + result = self.manager.all_tasks_processed() + self.assertIsNone(result) + self.assertTrue(any("Redis error reading all_tasks_processed" in m for m in cm.output)) + class TestSaveResultsRefreshesDeploymentCounts(TestCase): """save_results must refresh Deployment cached counts, not just Event counts. From 0a7480b87dad54b6d4f4e3018235e9e02214de4e Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Thu, 30 Apr 2026 16:22:39 -0700 Subject: [PATCH 5/6] docs(jobs): capture state-triangulation discussion (DB + Redis + NATS) Low-priority follow-up to #1276. Three sources of truth for async_api job lifecycle (Job model, Redis sets, NATS consumer state) lie differently; unifying them into a diagnostic view catches bug classes Redis alone can't see (#1247-style stranding, result-handler swallow, hung in-flight workers). Doc proposes snapshot-on-terminal, diagnostics endpoint, JobReconciler.diagnose(), and an admin UI panel. Tracked in #1285. Co-Authored-By: Claude --- .../planning/job-state-triangulation.md | 145 ++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 docs/claude/planning/job-state-triangulation.md diff --git a/docs/claude/planning/job-state-triangulation.md b/docs/claude/planning/job-state-triangulation.md new file mode 100644 index 000000000..0c719f1a4 --- /dev/null +++ b/docs/claude/planning/job-state-triangulation.md @@ -0,0 +1,145 @@ +# Job state triangulation: DB + Redis + NATS + +**Status:** captured for future work — low priority. Tracked in +[#1285](https://github.com/RolnickLab/antenna/issues/1285). Discussion +artifact from session 2026-04-30 following PR #1276. + +## Problem + +The platform has three distinct sources of truth for the lifecycle of an +async_api job: + +1. **DB (`ami.jobs.models.Job`)** — `Job.status` (lifecycle enum) and + `Job.progress` (Pydantic `JobProgress` JSONB). External-facing; what UI + reads. Clobber-prone since #1261 dropped `select_for_update` from + `_update_job_progress` (`ami/jobs/tasks.py`). +2. **Redis (`AsyncJobStateManager`)** — atomic per-image lifecycle: pending + set per stage, failed set, total. Ephemeral (7-day TTL). Single source of + truth used by reaper guard after PR #1276. +3. **NATS JetStream consumer state** — message lifecycle: `delivered`, + `ack_floor`, `num_pending`, `num_ack_pending`, `num_redelivered`, + `num_waiting`. Authoritative for what was dispatched and acked. + +Each lies differently. None alone catches every bug class observed in +production. + +## What each source knows that others don't + +| Bug class | DB | Redis | NATS | Diagnosis signal | +|---|---|---|---|---| +| 2521 clobber (#1276 root cause) | progress incomplete | empty | `ack_floor == delivered` | last-writer race in `_update_job_progress` | +| Lost images (#1247) | stuck STARTED | non-empty | `num_pending > 0`, redelivery exhausted | NATS gave up, Redis still tracks | +| Worker crash loop | stuck STARTED | non-empty | `num_redelivered` climbing | infrastructure issue | +| Result-handler swallow | SUCCESS | failed set non-empty | drained | error path bug | +| Hung in-flight worker | progress 95% | non-empty | `ack_pending > 0` for minutes | worker stalled mid-batch | + +DB alone catches none of these reliably. Redis catches most. Redis+NATS +catches all. + +## Proposed: diagnostics endpoint + +`GET /api/v2/jobs//diagnostics` returning structured triangulation: + +```json +{ + "db": { + "status": "STARTED", + "celery_state": "SUCCESS", + "progress_complete": false, + "stages": {"process": 1.0, "results": 1.0, "collect": 0.0} + }, + "redis": { + "process": {"pending": 0, "total": 450}, + "results": {"pending": 0, "total": 450}, + "failed": [], + "all_tasks_processed": true, + "ttl_seconds": 580000 + }, + "nats": { + "stream": "...", + "consumer": "...", + "delivered": 450, + "ack_floor": 450, + "num_pending": 0, + "num_ack_pending": 0, + "num_redelivered": 12, + "num_waiting": 4 + }, + "divergence": [ + { + "severity": "warn", + "msg": "DB progress incomplete but Redis drained and NATS fully acked — likely _update_job_progress clobber" + } + ] +} +``` + +Cache 5–10s server-side (NATS info is async socket round-trip). + +## Proposed: UI diagnostic panel + +Default jobs detail view: keep existing progress bars (simple). + +Add expandable **Diagnostics** panel (admin/power-user). Per stage row: + +``` +process dispatched acked pending failed redelivered + 450/450 445 5 0 12 + ── NATS ── ─NATS─ ─Redis─ Redis NATS +``` + +Header strip: NATS consumer name, Redis TTL, DB vs Celery vs Redis truth +mismatch flags. + +"Force reconcile" admin button = re-run reaper guard logic on demand. + +## Proposed: snapshot on terminal transition + +On terminal state change (SUCCESS / FAILURE / REVOKED), persist NATS +snapshot into `Job.progress.diagnostics` (or a new `Job.diagnostics` JSONB +field). Forensic record for post-mortems — NATS consumer state is +short-lived after job completion. + +## Reuse for reaper + +`AsyncJobStateManager.all_tasks_processed()` (added in #1276) is half of +this. Promote to `JobReconciler.diagnose()` returning the full structured +triangulation. Reaper consumes `diagnose().is_terminal_safe`. Diagnostics +endpoint consumes the same call. Single source of triangulation logic. + +## Tradeoffs + +- NATS info call needs JetStream credentials in Django (already have via + natsconn work). +- More numbers = more confusion for non-admins → gate diag panel by + permission. +- NATS consumer state goes stale fast post-completion (consumers are + deleted). Snapshot-on-terminal mitigates. +- Adds a per-job NATS round-trip on diag endpoint hit. Cache + admin-only + surface keeps cost bounded. + +## Recommended implementation order + +1. **Snapshot-on-terminal** first — cheap, no UI work, big forensic value. +2. **Diagnostics endpoint** + JSON shape. Lets reaper + admin tooling + consume it; no UI needed yet. +3. **Promote reaper to consume `JobReconciler.diagnose()`** — replaces + inline tri-state branch. +4. **UI diagnostic panel** last, once endpoint shape is stable. + +## Related tickets + +- #1232 — Redis-only follow-ups from PR #1231 retry-conflation work +- #721 — Backend bug: status stuck STARTED (user-facing symptom) +- #1085 — PSv2: integrate incomplete job monitoring +- #1134 — COMPLETED job state for partial-error jobs +- #1112 — Worker visibility per job +- #1166 — Processing service status indicator (UI side) +- #1168 — Async jobs hang when NATS exhausts max_deliver + +## Why this is low priority + +PR #1276 closes the immediate clobber-revoke loop with Redis-direct check. +Triangulation is the next step up: forensic visibility + catching adjacent +bug classes (#1247-style, result-handler swallow). Not blocking; pull when +the next stuck-job report lands and Redis-only signal is insufficient. From 563530f9db8f44ed95d5a2d6737e2de1f7cca116 Mon Sep 17 00:00:00 2001 From: Michael Bunsen Date: Wed, 17 Jun 2026 16:27:28 -0700 Subject: [PATCH 6/6] fix(jobs): scope reaper fast-path to Celery SUCCESS, not FAILURE Address review feedback on the stale-job reaper guard. CodeRabbit (major): the reaper trusted both Celery SUCCESS and FAILURE as terminal for drained async_api jobs. But update_job_failure() defers post-queue run_job failures for async_api jobs to the async result handler, which decides the final status from the processed/failed counts against FAILURE_THRESHOLD (a drained-but-failed task can still resolve to SUCCESS). Trusting Celery FAILURE here would force the job to FAILURE and bypass that logic. Scope the Redis-backed fast-path to SUCCESS only; a stale async_api job whose Celery task ended FAILURE now falls through to the revoke branch instead. Add a regression test. Copilot (nit): the fallback WARNING said "Redis state absent" but all_tasks_processed() also returns None on transient RedisError. Reworded to "Redis state unavailable" to cover both causes; updated the assertions. Narrow both reaper save points to update_fields=[status, progress, finished_at, updated_at]. A full save() re-writes the whole row from the snapshot fetched at select_for_update() time, which can clobber `logs` and `progress.errors` written by a concurrent _update_job_progress (no row lock since #1261). Mirrors the narrowed-write pattern already used in _update_job_progress. Also remove three docs/claude session-artifact files that were added on this branch. They contain production job IDs, internal infrastructure references, and other operational detail that does not belong in a public repository. Co-Authored-By: Claude Opus 4.8 (1M context) --- ami/jobs/tasks.py | 60 +++- ami/jobs/tests/test_tasks.py | 55 ++- docs/claude/planning/job-complete-check.md | 318 ------------------ .../planning/job-state-triangulation.md | 145 -------- docs/claude/prompts/NEXT_SESSION_PROMPT.md | 218 ------------ 5 files changed, 89 insertions(+), 707 deletions(-) delete mode 100644 docs/claude/planning/job-complete-check.md delete mode 100644 docs/claude/planning/job-state-triangulation.md delete mode 100644 docs/claude/prompts/NEXT_SESSION_PROMPT.md diff --git a/ami/jobs/tasks.py b/ami/jobs/tasks.py index 554c3ba18..f8f199045 100644 --- a/ami/jobs/tasks.py +++ b/ami/jobs/tasks.py @@ -908,11 +908,15 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[ healthy. Default cutoff is :attr:`Job.STALLED_JOBS_MAX_MINUTES`. For each stale job, checks Celery for a terminal task status. REVOKED is - always trusted. For async_api jobs, SUCCESS and FAILURE are only accepted - when AsyncJobStateManager.all_tasks_processed() reports True (i.e. the - Redis pending sets are drained). When Redis state is absent, falls back to - job.progress.is_complete(). All other cases result in revocation. Async - resources (NATS/Redis) are cleaned up in both branches. + always trusted. For async_api jobs, only SUCCESS is fast-pathed to a + terminal status, and only when AsyncJobStateManager.all_tasks_processed() + reports True (i.e. the Redis pending sets are drained); when Redis state is + unavailable it falls back to job.progress.is_complete(). Celery FAILURE is + not trusted for async_api jobs — update_job_failure() defers the terminal + outcome to the async result handler (FAILURE_THRESHOLD logic), so a stale + FAILED async_api job is revoked rather than forced to FAILURE. All other + cases result in revocation. Async resources (NATS/Redis) are cleaned up in + both branches. Returns a list of dicts describing what was done to each job. """ @@ -961,33 +965,54 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[ ) # Treat as unknown state — job will be revoked below. - # Only trust terminal Celery states. For async_api jobs, SUCCESS and - # FAILURE are only accepted when all NATS tasks are processed — workers - # may still be delivering results after the Celery task finishes. - # Consult Redis (source of truth for SREM completeness) directly rather - # than Job.progress.is_complete(), which mirrors a JSONB blob racy under - # concurrent _update_job_progress writes since #1261. + # Only trust terminal Celery states. For async_api jobs, a SUCCESS + # Celery state is only accepted when all NATS tasks are processed — + # workers may still be delivering results after the Celery task + # finishes. Consult Redis (source of truth for SREM completeness) + # directly rather than Job.progress.is_complete(), which mirrors a + # JSONB blob racy under concurrent _update_job_progress writes since + # #1261. + # + # Celery FAILURE is deliberately NOT fast-pathed to a terminal + # status here: update_job_failure() defers post-queue run_job + # failures for async_api jobs to the async result handler, which + # decides the terminal outcome from the final processed/failed + # counts against FAILURE_THRESHOLD (a drained-but-failed Celery task + # can still resolve to SUCCESS). Trusting Celery FAILURE here would + # force the job to FAILURE and bypass that threshold logic, so a + # stale async_api job whose Celery task ended FAILURE falls through + # to the revoke branch instead. is_terminal = celery_state in states.READY_STATES is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API - if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: + if is_async_api and celery_state == states.SUCCESS: processed = AsyncJobStateManager(job.pk).all_tasks_processed() if processed is False: is_terminal = False elif processed is None: logger.warning( - "Reaper for job %s: Redis state absent, falling back to " "progress.is_complete()", + "Reaper for job %s: Redis state unavailable, falling back to " "progress.is_complete()", job.pk, ) if not job.progress.is_complete(): is_terminal = False - # processed is True -> trust Celery's terminal state + # processed is True -> trust Celery SUCCESS + elif is_async_api and celery_state == states.FAILURE: + # Don't treat Celery FAILURE as authoritative for async_api jobs + # (see comment above); revoke instead of forcing FAILURE. + is_terminal = False previous_status = job.status if is_terminal: if not dry_run: job.update_status(celery_state, save=False) job.finished_at = datetime.datetime.now() - job.save() + # Narrow the write to the fields the reaper actually mutates. + # A full save() would re-write the whole row from the snapshot + # fetched at select_for_update() time, clobbering `logs` and + # `progress.errors` that a concurrent _update_job_progress (no + # row lock since #1261) may commit. update_status() touches + # status + progress.summary.status, so `progress` is included. + job.save(update_fields=["status", "progress", "finished_at", "updated_at"]) else: # Per-job diagnostic: surface enough state at revoke time that an # operator can answer "why was this stalled?" without grepping @@ -1009,7 +1034,10 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[ if not dry_run: job.update_status(JobState.REVOKED, save=False) job.finished_at = datetime.datetime.now() - job.save() + # See note on the terminal branch above: narrow the write so + # the reaper doesn't clobber a concurrent progress writer's + # `logs` / `progress.errors`. + job.save(update_fields=["status", "progress", "finished_at", "updated_at"]) # Async resource cleanup runs outside the transaction — it makes network # calls (NATS/Redis) that should not hold the DB row lock. diff --git a/ami/jobs/tests/test_tasks.py b/ami/jobs/tests/test_tasks.py index 3126f5486..343d54d8a 100644 --- a/ami/jobs/tests/test_tasks.py +++ b/ami/jobs/tests/test_tasks.py @@ -1232,14 +1232,15 @@ def test_jobs_health_check_runs_lost_images_before_stale_jobs(self, mock_manager class TestCheckStaleJobsReaperGuard(TransactionTestCase): - """Reaper guard for async_api jobs: SUCCESS/FAILURE Celery state is only - accepted when AsyncJobStateManager.all_tasks_processed() reports True. The - earlier guard read Job.progress.is_complete() — racy under concurrent - _update_job_progress writes since #1261 dropped select_for_update. Job 2521 - landed REVOKED with NATS+Redis fully drained because a slower committer - clobbered the SUCCESS write. This class verifies the new Redis-direct path, - the absent-state fallback to progress.is_complete() with a WARNING, and - that sync_api jobs are unaffected. + """Reaper guard for async_api jobs: a SUCCESS Celery state is only accepted + when AsyncJobStateManager.all_tasks_processed() reports True. The earlier + guard read Job.progress.is_complete() — racy under concurrent + _update_job_progress writes since #1261 dropped select_for_update. A + production job once landed REVOKED with NATS+Redis fully drained because a + slower committer clobbered the SUCCESS write. This class verifies the new + Redis-direct path, the unavailable-state fallback to progress.is_complete() + with a WARNING, that Celery FAILURE is not fast-pathed to a terminal + FAILURE for async_api jobs, and that sync_api jobs are unaffected. """ def setUp(self): @@ -1375,7 +1376,7 @@ def test_async_celery_success_redis_absent_progress_complete_lands_success(self, job.refresh_from_db() self.assertEqual(job.status, JobState.SUCCESS.value) - self.assertTrue(any("Redis state absent" in m for m in cm.output)) + self.assertTrue(any("Redis state unavailable" in m for m in cm.output)) @patch("celery.result.AsyncResult") def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(self, mock_async_result): @@ -1395,7 +1396,7 @@ def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(sel job.refresh_from_db() self.assertEqual(job.status, JobState.REVOKED.value) - self.assertTrue(any("Redis state absent" in m for m in cm.output)) + self.assertTrue(any("Redis state unavailable" in m for m in cm.output)) @patch("celery.result.AsyncResult") def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_async_result): @@ -1424,3 +1425,37 @@ def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_as job.refresh_from_db() self.assertEqual(job.status, JobState.SUCCESS.value) + + @patch("celery.result.AsyncResult") + def test_async_celery_failure_redis_drained_lands_revoked_not_failure(self, mock_async_result): + """A Celery FAILURE on an async_api job is never fast-pathed to a + terminal FAILURE here, even when Redis reports the pending sets drained. + + update_job_failure() deliberately defers post-queue run_job failures for + async_api jobs to the async result handler, which decides the terminal + outcome from the final processed/failed counts against FAILURE_THRESHOLD + (a drained-but-failed task can still resolve to SUCCESS). The reaper must + not pre-empt that by forcing FAILURE, so the stale job is REVOKED instead. + """ + from ami.jobs.tasks import check_stale_jobs + + job = self._stale_async_job(task_id="reaper-failure-task") + ids = [str(i) for i in range(10)] + manager = AsyncJobStateManager(job.pk) + manager.initialize_job(ids) + manager.update_state(set(ids), stage="process") + manager.update_state(set(ids), stage="results") + self._mark_stale(job) + + from celery import states as celery_states + + mock_async_result.return_value.state = celery_states.FAILURE + + check_stale_jobs() + + job.refresh_from_db() + self.assertEqual( + job.status, + JobState.REVOKED.value, + f"Celery FAILURE must not be forced to terminal FAILURE for async_api; got {job.status}", + ) diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md deleted file mode 100644 index bc0847dfb..000000000 --- a/docs/claude/planning/job-complete-check.md +++ /dev/null @@ -1,318 +0,0 @@ -# Plan: reaper checks Redis directly via `all_tasks_processed()` (fix/job-complete-check) - -## Why - -Production job 2521 (project 192, ~4510 images, async_api) finished with all -4510 NATS messages acked and Redis pending sets empty, but `Job.progress` -JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The -`jobs_health_check` reaper at the 10-min cutoff: - -1. `mark_lost_images_failed` correctly skipped the job (Redis pending was - empty, so `lost_ids` was empty — the documented skip path). -2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard - at `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which - returned False because of the clobbered `process` stage. -3. Reaper REVOKEd the job. 5058 detections / 8129 classifications already - saved to DB; Job row reads REVOKED. - -PR #1244's reconciler is the wrong tool for this symptom — there is nothing -to reconcile in Redis. The bug is upstream: `_update_job_progress` -(`tasks.py:556`) writes the entire `progress` JSONB blob without a row -lock (PR #1261 dropped `select_for_update` to break unrelated row-lock -contention), and two concurrent workers can each read the same pre-state -then commit last-writer-wins. The `max()` guard at `tasks.py:598` only -protects percentage regression *based on what each writer read* — it does -not serialize the readers, so a slower committer with a stale snapshot -clobbers a faster committer's blob. - -`Job.progress` cannot be the source of truth for state-machine decisions -on async_api jobs. This PR makes that explicit at the one user-visible -site: the reaper's async_api guard. Reaper consults Redis directly. -`Job.progress` stays as the cosmetic mirror for the UI. - -Related context: - -- Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md` -- Last-writer-wins admission: `ami/jobs/tasks.py:563-571` -- PR #1261 commit `50677444` removed `select_for_update`. - -## Scope (narrow) - -In scope: - -- New `AsyncJobStateManager.all_tasks_processed() -> bool | None` — - one Redis pipeline call (SCARD × 2 + GET) returning a tri-state. -- Replace the `not job.progress.is_complete()` read at `tasks.py:968` - (reaper async_api guard) with the new Redis-backed check. -- WARNING log when Redis state is absent (`None` return), so we can - monitor how often this fires. -- Tests for the 3-state matrix on `all_tasks_processed()` + a reaper - test mirroring the 2521 shape (Redis empty, progress JSONB stuck at - 99.98% STARTED → reaper lands SUCCESS, not REVOKED). - -Out of scope (filed as follow-ups): - -- Race fix in `_update_job_progress` itself. Move counts to atomic - Redis HINCRBY, or scope a row lock to microseconds via raw `jsonb_set`. -- Updating the other call sites that read `progress.is_complete()` - (`tasks.py:654`, `:1331`, `:1361`, `:159`, `:496`). They are not - affected by the clobber pattern under current operation — see - "Why only the reaper" below. -- DB-count fallback for the `None` (Redis-gone) case. Today the - WARNING log + `progress.is_complete()` fallback is sufficient. If - the WARNING fires often enough to matter, follow-up adds a - `pipeline.filter_processed_images()`-based check. -- Job-level method (`Job.processing_complete()`). Inline at the one - call site that needs it; if a future site needs the same logic, - extract then. -- UI changes. `Job.progress` shape is unchanged. -- `'CANCELLED' is not a valid JobState` crash in `tasks.py:1193` - (`zombie_streams._drain_all`). Separate bug surfaced during 2521 - triage; legacy job row has a status not in the JobState enum. - -## Why only the reaper - -The clobber pattern in `_update_job_progress` exists at six call sites -that read `progress.is_complete()`. Only one is user-visible: - -| Site | Race exposure | User-visible consequence | -|---|---|---| -| `tasks.py:968` reaper async_api guard | **Yes** — runs minutes after the racy writes have settled, sees the clobbered final state | **Yes — REVOKED instead of SUCCESS** | -| `tasks.py:654` cleanup gate inside `_update_job_progress` | Same writer is itself the racer; `is_complete()=False` after clobber prevents early cleanup | No — cleanup fires later from terminal-state path | -| `tasks.py:1331` `task_postrun` SUCCESS guard | None — fires at queue-completion, no NATS results yet, single writer | No | -| `tasks.py:1361` `task_failure` ASYNC_API guard | None — same single-writer moment | No | -| `tasks.py:159` `run_job` post-run log | None — log-gating only | No | -| `tasks.py:496` `MLJob.run` zero-images path | None — runs before any concurrent writers | No | - -This PR fixes (1). The others either have a single writer at the moment -they run, or their failure mode is internal and self-correcting. If -monitoring later shows another site biting, expand then. - -## Why not put the check inside `progress.is_complete()` - -`JobProgress` is a Pydantic model — pure data. Adding Redis I/O behind -its `is_complete()` method: - -- couples a serializable data shape to a side-effect-laden oracle -- silently makes every API serialization, log dump, and read trip - Redis (caller doesn't see the I/O cost) -- the Pydantic instance has no `job_id` reference, so the method - signature would have to grow an awkward `job_id=` argument -- breaks `JobProgress` reuse outside the Job context - -Keep `JobProgress.is_complete()` as cosmetic Pydantic data check. -Do the Redis call at the state-machine call site that needs it. - -## Design - -### Tier 1 — Redis oracle on `AsyncJobStateManager` - -```python -# ami/ml/orchestration/async_job_state.py -def all_tasks_processed(self) -> bool | None: - """ - Truth signal for whether all NATS-tracked tasks have been processed - out of both pending sets (process + results stages). - - Returns: - True — both pending sets are empty AND total > 0 (all tasks SREM'd) - False — at least one pending set has members (real work outstanding) - None — Redis state absent (cleaned up, TTL expired, never initialized) - Caller decides what to do; do not assume completeness. - - Scope: Redis tracks the per-image SREM lifecycle for the `process` - and `results` stages only. It does NOT know about the `collect` stage - or any future post-results stages. Callers that need "is the entire - Job complete across ALL stages" must combine this with their own - knowledge of stage layout. - - RedisError (transient connection issues) is logged and returned as - None — a Redis blip should NOT flip a job to REVOKED on its own. - """ - try: - redis = self._get_redis() - with redis.pipeline() as pipe: - for stage in self.STAGES: - pipe.scard(self._get_pending_key(stage)) - pipe.get(self._total_key) - results = pipe.execute() - except RedisError as e: - logger.warning( - f"Redis error reading all_tasks_processed for job {self.job_id}: {e}" - ) - return None - - *pending_counts, total_raw = results - if total_raw is None: - return None - total = int(total_raw) - if total == 0: - # Zero-images job: trivially "processed" — no NATS work to do. - return True - return all(count == 0 for count in pending_counts) -``` - -One Redis pipeline round-trip. SCARD is O(1) per key. Negligible at scale. - -### Reaper guard replacement - -```python -# ami/jobs/tasks.py:968 — replacement -is_terminal = celery_state in states.READY_STATES -is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API -if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: - processed = AsyncJobStateManager(job.pk).all_tasks_processed() - if processed is False: - # NATS tasks still pending in Redis. mark_lost_images_failed runs - # before this and would have reconciled if it could; if we land - # here with pending > 0, the reconciler couldn't help (consumer - # gone, etc.) and REVOKE is the correct outcome. - is_terminal = False - elif processed is None: - # Redis state gone — rare. Fall back to the racy progress oracle - # but log so we know how often this path fires. If it fires - # often, follow up with a DB-count fallback. - logger.warning( - "Reaper for job %s: Redis state absent, falling back to " - "progress.is_complete()", - job.pk, - ) - if not job.progress.is_complete(): - is_terminal = False - # processed is True -> trust Celery's terminal state, leave is_terminal as-is -``` - -Sync_api jobs hit none of this branch — they fall through `is_terminal = -celery_state in READY_STATES` exactly as today. - -### Behavior change matrix - -| Scenario | Today | After this PR | -|---|---|---| -| async_api job done; Redis empty; `progress` clobbered (2521 shape) | REVOKED | **SUCCESS** | -| async_api job mid-flight; Celery SUCCESS but Redis pending > 0 | REVOKED (was already, but for the wrong reason) | REVOKED (same outcome, correct reason: NATS tasks pending) | -| async_api job done; Redis pending == 0; `progress` accurate | SUCCESS | SUCCESS (no change) | -| async_api job done; Redis state cleaned up early | varies — depends on `progress` | falls back to `progress.is_complete()` with WARNING log | -| sync_api job at reaper | unchanged | unchanged | - -### Tests - -Extend `ami/ml/tests.py` (or `ami/jobs/tests/test_async_job_state.py`): - -- `all_tasks_processed()` after fresh `initialize_job([...])` → False - (all images still in pending sets) -- After SREM-ing every id from both stages → True -- After SREM-ing some → False -- With `total=0` (zero-images path) → True -- With Redis state never initialized → None -- With Redis state cleaned up via `cleanup()` → None -- Simulated `RedisError` on the pipeline → None + WARNING logged - -Extend `ami/jobs/tests/test_tasks.py` — `TestCheckStaleJobs`: - -- async_api job, Celery SUCCESS, Redis pending empty, `progress` - hand-set to `process=99.98% STARTED processed=4509 remaining=1` - (mirroring 2521) → reaper lands SUCCESS, not REVOKED. Asserts - `is_terminal=True` path was taken. -- async_api job, Celery SUCCESS, Redis pending > 0 → REVOKED (no regression - for genuine partial-completion). -- async_api job, Redis state absent, `progress.is_complete()=True`, - Celery SUCCESS → SUCCESS via the fallback path. Assert WARNING log - emitted. -- async_api job, Redis state absent, `progress.is_complete()=False`, - Celery SUCCESS → REVOKED via the fallback path. Assert WARNING log emitted. -- sync_api job at reaper → behavior unchanged (existing tests cover). - -### Logging - -`logger.warning(...)` once per stale-async_api job per tick when Redis -state is absent. Format includes `job.pk` so we can grep. - -Monitoring expectation post-deploy: - -- Drop in REVOKED async_api jobs whose final `Job.progress` snapshot has - `process>=99% STARTED` (the clobber signature). -- Low-rate WARNING lines from the Redis-gone fallback. High rate == - Redis cleanup is firing too aggressively, separate bug to chase. - -## Acceptance criteria - -- `all_tasks_processed()` exists with the three-state semantics. -- Reaper at `tasks.py:968` uses it; sync_api branch unchanged. -- Existing `ami.jobs` test suite stays green. -- New tests for the 3-state matrix + reaper-eats-2521-shape case pass. -- Manual chaos: dispatch real async_api job, wait until completion, then - via Django shell hand-clobber `Job.progress` to mimic 2521. Wait 11 - min. Confirm reaper lands SUCCESS, not REVOKED. Confirm Redis was - clean at decision time. - -## Rollout - -Single PR. No feature flag. The change is strictly additive at the call -site (one site replaced; no API or schema change). Deploy + monitor. - -## Risks - -- **Race window between Redis SREM and `all_tasks_processed()`**: a - result that lands during the reaper's call could leave Redis briefly - in an intermediate state. SCARD is atomic per-key; worst case is the - snapshot misses a just-completed image and reports pending=1 → reaper - defers → next tick sees pending=0 → SUCCESS. Bounded. -- **Redis-gone fallback rate**: if WARNING fires frequently in - production, we punted too early on Tier 2 (DB count). Mitigation: - add `pipeline.filter_processed_images()`-based fallback in a - follow-up PR. Bounded — only fires when Redis state is absent AND the - job is past the reaper cutoff. -- **Sync_api jobs unaffected**: dispatch_mode gate ensures non-async - jobs hit the existing path. Verified by inspection + sync_api reaper - tests. - -## Follow-ups (separate PRs) - -1. **Make `_update_job_progress` writes atomic.** Either tightly-scoped - `select_for_update`, or move counts (`detections`, `classifications`, - `captures`) to Redis HINCRBY counters with a separate sync task - mirroring them into `Job.progress`. The cosmetic counter drift - admitted at `tasks.py:563-571` becomes invisible, and the failure - mode this PR addresses goes away even without the new method. -2. **DB-count fallback for the Redis-gone case** if monitoring shows - the WARNING firing often. `pipeline.filter_processed_images()` - against the Job's queued image set. -3. **Persist `queued_image_ids` on the Job row at dispatch** to make - (2) cheaper. -4. **Fix `'CANCELLED' is not a valid JobState`** in `tasks.py:1193`. -5. **Expand `all_tasks_processed()` usage** to the other 5 sites if - monitoring shows a clobber-driven failure at any of them. - -## Diagnostic evidence (job 2521) - -Captured from prod via `ami-devops` on 2026-04-29: - -- Job row at REVOKE time: - - `process=99.98% STARTED processed=4509 remaining=1` - - `results=100% SUCCESS` - - `progress.errors=[]` -- NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0 - num_ack_pending=0 num_redelivered=0` -- Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml - log shows no `Stage 'X' progress lifted to 100% by max() guard` - warning (the diagnostic hook for this race), suggesting both writers - had stale reads. -- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` - at 11:45:45.510, then a later `99.97782%` write to results, then - silence for 14.5 min until reaper at 12:00:00.117. -- Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages: - collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS` - — note the reaper's in-memory snapshot read shows 100% at the moment - of REVOKE. The persisted blob at that moment was 99.98%. The - `progress.is_complete()` call inside the guard re-read the persisted - blob (or evaluated against the 100%-but-STARTED status, which fails - the `final_states` check in `models.py:268`). Either way: guard - fired, REVOKED. - -This is the second case (status STARTED with progress=100% on the same -stage) — confirms the race is on the `status` field write, not just -the percentage. Both writers had stale snapshots; the slower one -overwrote the SUCCESS-status with STARTED. - -Last updated: 2026-04-29. diff --git a/docs/claude/planning/job-state-triangulation.md b/docs/claude/planning/job-state-triangulation.md deleted file mode 100644 index 0c719f1a4..000000000 --- a/docs/claude/planning/job-state-triangulation.md +++ /dev/null @@ -1,145 +0,0 @@ -# Job state triangulation: DB + Redis + NATS - -**Status:** captured for future work — low priority. Tracked in -[#1285](https://github.com/RolnickLab/antenna/issues/1285). Discussion -artifact from session 2026-04-30 following PR #1276. - -## Problem - -The platform has three distinct sources of truth for the lifecycle of an -async_api job: - -1. **DB (`ami.jobs.models.Job`)** — `Job.status` (lifecycle enum) and - `Job.progress` (Pydantic `JobProgress` JSONB). External-facing; what UI - reads. Clobber-prone since #1261 dropped `select_for_update` from - `_update_job_progress` (`ami/jobs/tasks.py`). -2. **Redis (`AsyncJobStateManager`)** — atomic per-image lifecycle: pending - set per stage, failed set, total. Ephemeral (7-day TTL). Single source of - truth used by reaper guard after PR #1276. -3. **NATS JetStream consumer state** — message lifecycle: `delivered`, - `ack_floor`, `num_pending`, `num_ack_pending`, `num_redelivered`, - `num_waiting`. Authoritative for what was dispatched and acked. - -Each lies differently. None alone catches every bug class observed in -production. - -## What each source knows that others don't - -| Bug class | DB | Redis | NATS | Diagnosis signal | -|---|---|---|---|---| -| 2521 clobber (#1276 root cause) | progress incomplete | empty | `ack_floor == delivered` | last-writer race in `_update_job_progress` | -| Lost images (#1247) | stuck STARTED | non-empty | `num_pending > 0`, redelivery exhausted | NATS gave up, Redis still tracks | -| Worker crash loop | stuck STARTED | non-empty | `num_redelivered` climbing | infrastructure issue | -| Result-handler swallow | SUCCESS | failed set non-empty | drained | error path bug | -| Hung in-flight worker | progress 95% | non-empty | `ack_pending > 0` for minutes | worker stalled mid-batch | - -DB alone catches none of these reliably. Redis catches most. Redis+NATS -catches all. - -## Proposed: diagnostics endpoint - -`GET /api/v2/jobs//diagnostics` returning structured triangulation: - -```json -{ - "db": { - "status": "STARTED", - "celery_state": "SUCCESS", - "progress_complete": false, - "stages": {"process": 1.0, "results": 1.0, "collect": 0.0} - }, - "redis": { - "process": {"pending": 0, "total": 450}, - "results": {"pending": 0, "total": 450}, - "failed": [], - "all_tasks_processed": true, - "ttl_seconds": 580000 - }, - "nats": { - "stream": "...", - "consumer": "...", - "delivered": 450, - "ack_floor": 450, - "num_pending": 0, - "num_ack_pending": 0, - "num_redelivered": 12, - "num_waiting": 4 - }, - "divergence": [ - { - "severity": "warn", - "msg": "DB progress incomplete but Redis drained and NATS fully acked — likely _update_job_progress clobber" - } - ] -} -``` - -Cache 5–10s server-side (NATS info is async socket round-trip). - -## Proposed: UI diagnostic panel - -Default jobs detail view: keep existing progress bars (simple). - -Add expandable **Diagnostics** panel (admin/power-user). Per stage row: - -``` -process dispatched acked pending failed redelivered - 450/450 445 5 0 12 - ── NATS ── ─NATS─ ─Redis─ Redis NATS -``` - -Header strip: NATS consumer name, Redis TTL, DB vs Celery vs Redis truth -mismatch flags. - -"Force reconcile" admin button = re-run reaper guard logic on demand. - -## Proposed: snapshot on terminal transition - -On terminal state change (SUCCESS / FAILURE / REVOKED), persist NATS -snapshot into `Job.progress.diagnostics` (or a new `Job.diagnostics` JSONB -field). Forensic record for post-mortems — NATS consumer state is -short-lived after job completion. - -## Reuse for reaper - -`AsyncJobStateManager.all_tasks_processed()` (added in #1276) is half of -this. Promote to `JobReconciler.diagnose()` returning the full structured -triangulation. Reaper consumes `diagnose().is_terminal_safe`. Diagnostics -endpoint consumes the same call. Single source of triangulation logic. - -## Tradeoffs - -- NATS info call needs JetStream credentials in Django (already have via - natsconn work). -- More numbers = more confusion for non-admins → gate diag panel by - permission. -- NATS consumer state goes stale fast post-completion (consumers are - deleted). Snapshot-on-terminal mitigates. -- Adds a per-job NATS round-trip on diag endpoint hit. Cache + admin-only - surface keeps cost bounded. - -## Recommended implementation order - -1. **Snapshot-on-terminal** first — cheap, no UI work, big forensic value. -2. **Diagnostics endpoint** + JSON shape. Lets reaper + admin tooling - consume it; no UI needed yet. -3. **Promote reaper to consume `JobReconciler.diagnose()`** — replaces - inline tri-state branch. -4. **UI diagnostic panel** last, once endpoint shape is stable. - -## Related tickets - -- #1232 — Redis-only follow-ups from PR #1231 retry-conflation work -- #721 — Backend bug: status stuck STARTED (user-facing symptom) -- #1085 — PSv2: integrate incomplete job monitoring -- #1134 — COMPLETED job state for partial-error jobs -- #1112 — Worker visibility per job -- #1166 — Processing service status indicator (UI side) -- #1168 — Async jobs hang when NATS exhausts max_deliver - -## Why this is low priority - -PR #1276 closes the immediate clobber-revoke loop with Redis-direct check. -Triangulation is the next step up: forensic visibility + catching adjacent -bug classes (#1247-style, result-handler swallow). Not blocking; pull when -the next stuck-job report lands and Redis-only signal is insufficient. diff --git a/docs/claude/prompts/NEXT_SESSION_PROMPT.md b/docs/claude/prompts/NEXT_SESSION_PROMPT.md deleted file mode 100644 index baca7df2f..000000000 --- a/docs/claude/prompts/NEXT_SESSION_PROMPT.md +++ /dev/null @@ -1,218 +0,0 @@ -# Next session: implement `fix/job-complete-check` - -Branch: `fix/job-complete-check` at commit `10cdac28` (worktree -`/home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks`). - -Plan: `docs/claude/planning/job-complete-check.md` (read first — has the -full design, risks, follow-ups, and 2521 evidence). - -## TL;DR - -Production job 2521 ended REVOKED because `_update_job_progress` -(`ami/jobs/tasks.py:556`) writes the entire `Job.progress` JSONB blob -without a row lock since PR #1261 dropped `select_for_update`. Two -concurrent workers raced; slower committer with `processed=4509 -remaining=1` clobbered the faster committer's `processed=4510 SUCCESS` -write. NATS + Redis both showed all 4510 tasks done; only `Job.progress` -lied. Reaper at `tasks.py:968` consults `progress.is_complete()` → -False (because clobbered) → REVOKED. - -Fix: add `AsyncJobStateManager.all_tasks_processed() -> bool | None`, -inline at the reaper guard. Sync_api unaffected. Other 5 `progress.is_complete()` -readers untouched (not bitten by clobber). Pydantic `JobProgress` stays -pure data — Redis I/O lives on the manager class. - -## Implementation steps - -### 1. `all_tasks_processed()` - -File: `ami/ml/orchestration/async_job_state.py` - -After `get_pending_image_ids()` (around line 229), add: - -```python -def all_tasks_processed(self) -> bool | None: - """Tri-state truth signal for NATS-task SREM completeness across both - process and results pending sets. - - True — both pending sets empty AND total > 0 (or total == 0) - False — at least one pending set has members - None — Redis state absent (cleaned up, expired, never initialized, - or transient RedisError) - - Scope: tracks NATS task lifecycle only; does not know about `collect` - or any future post-results stages. - """ - try: - redis = self._get_redis() - with redis.pipeline() as pipe: - for stage in self.STAGES: - pipe.scard(self._get_pending_key(stage)) - pipe.get(self._total_key) - results = pipe.execute() - except RedisError as e: - logger.warning( - f"Redis error reading all_tasks_processed for job {self.job_id}: {e}" - ) - return None - - *pending_counts, total_raw = results - if total_raw is None: - return None - if int(total_raw) == 0: - return True - return all(count == 0 for count in pending_counts) -``` - -### 2. Reaper guard replacement - -File: `ami/jobs/tasks.py:963-969` - -Replace: -```python -is_terminal = celery_state in states.READY_STATES -is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API -if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete(): - is_terminal = False -``` - -With: -```python -is_terminal = celery_state in states.READY_STATES -is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API -if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}: - processed = AsyncJobStateManager(job.pk).all_tasks_processed() - if processed is False: - is_terminal = False - elif processed is None: - logger.warning( - "Reaper for job %s: Redis state absent, falling back to " - "progress.is_complete()", - job.pk, - ) - if not job.progress.is_complete(): - is_terminal = False - # processed is True -> trust Celery's terminal state -``` - -Need import: `from ami.ml.orchestration.async_job_state import AsyncJobStateManager` -(check if already imported at top of file). - -### 3. Tests - -**`ami/ml/tests.py`** — add `TestAsyncJobStateManagerAllTasksProcessed`: -- fresh `initialize_job([...])` → False -- after SREM-ing every id from both stages → True -- after SREM-ing only one stage → False -- `total=0` zero-images → True -- never initialized → None -- after `cleanup()` → None -- mocked `RedisError` on pipeline → None + WARNING - -**`ami/jobs/tests/test_tasks.py::TestCheckStaleJobs`** — add cases: -- async_api + Celery SUCCESS + Redis empty + `progress` clobbered to - `process=99.98% STARTED processed=4509 remaining=1` → SUCCESS, not REVOKED - (the 2521 case) -- async_api + Celery SUCCESS + Redis pending > 0 → REVOKED -- async_api + Redis state absent + `progress.is_complete()=True` → SUCCESS - via fallback + assert WARNING log -- async_api + Redis state absent + `progress.is_complete()=False` → REVOKED - via fallback + assert WARNING log -- sync_api → existing behavior unchanged - -### 4. Run tests - -```bash -cd /home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks -docker compose -f docker-compose.ci.yml run --rm django \ - python manage.py test ami.ml.tests.TestAsyncJobStateManagerAllTasksProcessed ami.jobs.tests.test_tasks --keepdb -``` - -If tests pass, run the full `ami.jobs` suite: -```bash -docker compose -f docker-compose.ci.yml run --rm django \ - python manage.py test ami.jobs --keepdb -``` - -### 5. E2E chaos validation - -Per the chaos pattern in `docs/claude/debugging/chaos-scenarios.md`: - -1. Dispatch real async_api job (50+ images) on local stack with ADC - running. Wait until SUCCESS naturally. -2. Via Django shell, hand-clobber `Job.progress` to mimic 2521: - ```python - from ami.jobs.models import Job, JobState - j = Job.objects.get(pk=) - j.status = JobState.STARTED - j.progress.update_stage("process", status=JobState.STARTED, progress=0.9998, - processed=N-1, remaining=1, failed=0) - j.updated_at = j.updated_at - datetime.timedelta(minutes=15) - j.save() - ``` -3. Trigger `jobs_health_check` manually: - ```python - from ami.jobs.tasks import jobs_health_check - jobs_health_check() - ``` -4. Verify: job lands SUCCESS (not REVOKED). Verify Redis still empty - (cleanup runs from terminal-state path). - -### 6. PR - -```bash -gh pr create --title "fix(jobs): reaper checks Redis directly via all_tasks_processed" --body "..." -``` - -PR body template — start from the plan doc's "Why", "Behavior change -matrix", and "Risks" sections. Reference job 2521. - -## Pre-requisites and gotchas - -- Worktree currently at `dangling-tasks` path but on branch - `fix/job-complete-check`. `git status` should be clean. -- Pydantic v1 in container — `.dict()` / `.json()` not `.model_dump()`. -- CI compose file is `docker-compose.ci.yml` for tests with `--keepdb`. -- Stale RabbitMQ connections after days: restart Django/Celery if - enqueueing throws `ConnectionResetError`. -- `cachalot_disabled()` decorator wraps `process_nats_pipeline_result` - — not relevant here but flagged in case a related test fails on cache. - -## Saved diagnostic logs from 2521 - -On the ami-devops host (not local): -- `/tmp/job2521_celery_72h.log` -- `/tmp/job2521_worker_jobs_72h.log` -- `/tmp/job2521_worker_ml_72h.log` -- `/tmp/job2521_worker_default_72h.log` -- `/tmp/job2521_beat_72h.log` - -If next session needs to re-grep, dispatch a subagent in -`~/Projects/AMI/ami-devops/` rather than re-fetching from prod. - -## Open follow-up tickets to file (not in this PR) - -1. Make `_update_job_progress` writes atomic. Either re-introduce a - tightly-scoped `select_for_update`, or move counts to Redis HINCRBY - counters with separate sync task. Solves the underlying race that - this PR works around. -2. DB-count fallback for the Redis-gone case. Add only if the WARNING - log fires often in production. -3. Persist `queued_image_ids` on Job at dispatch — makes (2) cheaper. -4. Fix `'CANCELLED' is not a valid JobState` crash in - `ami/jobs/tasks.py:1193` (`zombie_streams._drain_all`). Throws - every beat tick on a legacy job row whose status is not in the - JobState enum. -5. Expand `all_tasks_processed()` usage to other 5 `progress.is_complete()` - sites if monitoring shows clobber-driven failures elsewhere. - -## Branch state - -``` -$ git log --oneline fix/job-complete-check ^main -10cdac28 docs(jobs): narrow plan to reaper-only Redis check via all_tasks_processed -091e2fc7 docs(jobs): plan reusable Job.is_actually_complete() backed by Redis -``` - -(091e2fc7 is the original broader plan — superseded by 10cdac28. Do NOT -implement from 091e2fc7.)