From 091e2fc7867ec31856793908db8ea098981de363 Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Wed, 29 Apr 2026 14:00:24 -0700
Subject: [PATCH 1/6] docs(jobs): plan reusable Job.is_actually_complete()
 backed by Redis
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Background: production job 2521 ended REVOKED despite all 4510 NATS
messages acked and Redis pending sets empty. Root cause is a race in
_update_job_progress: it writes the entire Job.progress JSONB blob
without a row lock (PR #1261 dropped select_for_update to break a
contention bottleneck), and two concurrent workers can each read the
same pre-state then commit last-writer-wins. The slower committer with
processed=4509 clobbered the faster committer's processed=4510.

Job.progress cannot be the source of truth for state-machine decisions.
This plan introduces:

- AsyncJobStateManager.summarize() — single Redis pipeline call
  returning a JobCompletionSnapshot with state_present, total,
  pending_per_stage, failed, is_complete.
- Job.is_actually_complete() — Tier 1 Redis → Tier 2 DB
  detection-count fallback (logs WARNING) → Tier 3 progress.is_complete()
  for sync_api jobs.
- Replaces progress.is_complete() reads at the four state-machine call
  sites (reaper async_api guard, _update_job_progress cleanup gate,
  task_postrun, task_failure). Job.progress stays as the cosmetic
  mirror for the UI.

Out of scope (filed as follow-ups in the doc): the underlying race in
_update_job_progress, persisting queued_image_ids on Job, fixing the
zombie_streams 'CANCELLED' enum crash surfaced during this
investigation.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 docs/claude/planning/job-complete-check.md | 305 +++++++++++++++++++++
 1 file changed, 305 insertions(+)
 create mode 100644 docs/claude/planning/job-complete-check.md

diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md
new file mode 100644
index 000000000..8c8418ddf
--- /dev/null
+++ b/docs/claude/planning/job-complete-check.md
@@ -0,0 +1,305 @@
+# Plan: reusable `Job.is_actually_complete()` (fix/job-complete-check)
+
+## Why
+
+Production job 2521 (project 192, ~4510 images, async_api) finished with all
+4510 NATS messages acked and Redis pending sets empty, but `Job.progress`
+JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The
+`jobs_health_check` reaper at the 10-min cutoff:
+
+1. `mark_lost_images_failed` correctly skipped the job (Redis pending was
+   empty, so `lost_ids` was empty, which is the documented skip path).
+2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard at
+   `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which
+   returned False because of the clobbered `process` stage.
+3. Reaper REVOKEd the job. Result: 5058 detections / 8129 classifications
+   already saved to DB, but the Job row reads REVOKED.
+
+PR #1244's reconciler is the wrong tool for this symptom — there is nothing
+to reconcile in Redis. The bug is upstream: `_update_job_progress` writes
+the entire `progress` JSONB blob without a row lock (see PR #1261, which
+removed `select_for_update` to fix unrelated row-lock contention), and two
+concurrent workers can each read the same pre-state then commit
+last-writer-wins. The `max()` guard at `tasks.py:598` only protects
+percentage regression *based on what the writer read* — it does not
+serialize the readers, so a slower committer with a stale snapshot
+clobbers a faster committer's blob.
+
+`Job.progress` cannot be the source of truth for state-machine decisions.
+This PR makes that explicit: introduce a reliable completion check backed
+by Redis (which uses atomic SREM/SADD and is not racy), with a DB-count
+fallback for jobs whose Redis state has been cleaned up. `Job.progress`
+stays as the cosmetic mirror for the UI.
+
+Related context:
+
+- Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md`
+- Last-writer-wins admission: `ami/jobs/tasks.py:563-571`
+- PR #1261 commit `50677444` removed `select_for_update` to fix contention.
+
+## Scope
+
+In scope:
+
+- New `JobCompletionSnapshot` dataclass + `AsyncJobStateManager.summarize()`.
+- New `Job.is_actually_complete()` method with Tier 1 (Redis) → Tier 2
+  (DB count) → Tier 3 (`progress.is_complete()` for sync_api) fallback.
+- WARNING log whenever Tier 2 fires, so we can monitor frequency and
+  catch unexpected fallbacks.
+- Replace `progress.is_complete()` reads at the four state-machine call
+  sites that currently consult it.
+- Tests covering the 4-way matrix (dispatch_mode × Redis-present-or-not).
+
+Out of scope:
+
+- Race fix in `_update_job_progress` itself. That is a separate, larger
+  refactor (move counts to atomic Redis counters, or scope a row lock to
+  microseconds via raw `jsonb_set`). Filed as follow-up: see
+  `### Follow-ups` below.
+- UI changes. `Job.progress` stays as-is. The new method is backend-only.
+- Touching `mark_lost_images_failed`. The reconciler's behavior is
+  correct for its intended bug (NATS-lost messages); 2521 was a different
+  failure that this PR addresses upstream.
+
+## Design
+
+### Tier 1 — Redis oracle
+
+`AsyncJobStateManager.summarize() -> JobCompletionSnapshot`. Single
+Redis pipeline round-trip:
+
+```python
+pipe.scard(self._get_pending_key("process"))
+pipe.scard(self._get_pending_key("results"))
+pipe.scard(self._failed_key)
+pipe.get(self._total_key)
+```
+
+```python
+@dataclass
+class JobCompletionSnapshot:
+    state_present: bool                 # total_raw is not None
+    total: int                          # 0 when state_present=False
+    pending_per_stage: dict[str, int]   # {"process": N, "results": M}
+    failed: int
+    is_complete: bool                   # state_present AND all pending == 0
+
+    # Convenience for callers:
+    @property
+    def all_acked(self) -> bool:
+        return self.state_present and all(v == 0 for v in self.pending_per_stage.values())
+```
+
+`is_complete` is True only when state_present AND every stage's pending
+set is empty. A job whose Redis state is gone returns
+`state_present=False`, regardless of stage state — caller decides what to
+do via Tier 2 fallback.
+
+`RedisError` is swallowed inside `summarize()` and returns
+`state_present=False`. We do NOT propagate Redis transients up to the
+reaper — a Redis blip should not flip a job to REVOKED.
+
+### Tier 2 — DB Detection-row count
+
+When Redis state is gone, query DB to determine whether all queued images
+have detections from the job's pipeline algorithms:
+
+```python
+def _is_complete_via_db(self) -> bool:
+    """Slow but durable. Only runs when Redis state is missing
+    (post-cleanup, post-TTL, or post-restart). Logs a WARNING so
+    we know how often we fall through to this path.
+    """
+    queued_ids = self._collected_image_ids()  # see below
+    if not queued_ids:
+        return True  # zero-images job is trivially complete
+    unprocessed = self.pipeline.filter_processed_images(
+        SourceImage.objects.filter(pk__in=queued_ids)
+    ).only("pk")
+    is_complete = not unprocessed.exists()
+    self.logger.warning(
+        "job %s: Redis state gone, fell back to DB detection-count check; "
+        "queued=%d, unprocessed=%d, is_complete=%s",
+        self.pk, len(queued_ids), unprocessed.count(), is_complete,
+    )
+    return is_complete
+```
+
+`self._collected_image_ids()`: derive from `Job.source_image_collection`,
+`Job.source_image_single`, or `Job.deployment` exactly the way
+`MLJob.run`/`pipeline.collect_images` does. Or, if the cost of
+re-resolving is too high, persist the queued id list at dispatch time.
+The `summarize()` payload includes `total` from Redis but Tier 2 needs
+the full id set, so we cannot rely on Redis alone here.
+
+For the first PR cut, derive on the fly via `pipeline.collect_images()`
+to avoid a schema change. If profiling shows this is slow at 5k+ images,
+follow-up adds a `Job.queued_image_ids` jsonb column or similar.
+
+### Tier 3 — sync_api fallback
+
+`progress.is_complete()` for non-async_api jobs. Unchanged. Sync mode
+does not use Redis at all, and its writes are serialized by the single
+Celery task driving each stage.
+
+### `Job.is_actually_complete()`
+
+```python
+def is_actually_complete(self) -> bool:
+    """Source-of-truth completion check. Use everywhere instead of
+    self.progress.is_complete() for state-machine decisions.
+    self.progress.is_complete() remains valid for cosmetic/UI reads.
+    """
+    if self.dispatch_mode == JobDispatchMode.ASYNC_API:
+        snap = AsyncJobStateManager(self.pk).summarize()
+        if snap.state_present:
+            return snap.is_complete
+        return self._is_complete_via_db()
+    return self.progress.is_complete()
+```
+
+Method on the Job model, not a free function — lets call sites read like
+`if not job.is_actually_complete(): ...` matching the existing
+`progress.is_complete()` pattern.
+
+### Call sites to update
+
+| File:line | Current call | Replacement |
+|---|---|---|
+| `ami/jobs/tasks.py:968` (reaper async_api guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
+| `ami/jobs/tasks.py:654` (`_update_job_progress` cleanup gate) | `if job.progress.is_complete():` | `if job.is_actually_complete():` |
+| `ami/jobs/tasks.py:1331` (`task_postrun` SUCCESS guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
+| `ami/jobs/tasks.py:1361` (`task_failure` ASYNC_API guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
+| `ami/jobs/models.py:496` (zero-images path in `MLJob.run`) | `if job.progress.is_complete():` | leave as-is — runs before Redis is initialized |
+| `ami/jobs/models.py:159` (`run_job` post-task log) | `not job.progress.is_complete()` | leave as-is — log gating only, not state-machine |
+
+The four state-machine sites (rows 1-4) are the ones that decide SUCCESS
+vs REVOKED vs deferred-cleanup. They all currently misread the clobbered
+`progress.is_complete()`. The two leave-as-is rows are pre-Redis
+initialization or log-only — the replacement would be no-op or wrong.
+
+### Tests
+
+`ami/jobs/tests/test_async_job_state.py` (new file or extend existing):
+
+- `summarize()` against fresh init (state_present=True, all pending=total)
+- `summarize()` mid-flight (some pending, some processed)
+- `summarize()` complete (state_present=True, all pending=0, is_complete=True)
+- `summarize()` cleaned up (state_present=False, is_complete=False)
+- `summarize()` Redis transient (RedisError → state_present=False)
+
+`ami/jobs/tests/test_job_completion.py` (new):
+
+- `Job.is_actually_complete()` — async_api + Redis present + all pending=0 → True
+- async_api + Redis present + pending>0 → False (mid-flight defer)
+- async_api + Redis gone + DB shows all detections present → True
+- async_api + Redis gone + DB shows missing detections → False (logs WARNING)
+- sync_api → falls through to `progress.is_complete()`, unchanged
+- async_api with 2521 shape (Redis empty, progress JSONB stuck at 99.98%) → True
+
+`ami/jobs/tests/test_tasks.py` extension:
+
+- Reaper reaches the async_api guard with stale `progress` (process=99.98%
+  STARTED) but Redis empty → lands SUCCESS, not REVOKED. Mirrors 2521.
+
+### Logging
+
+Tier 2 fires a WARNING with `job_id, queued_count, unprocessed_count,
+is_complete`. This is the signal we use to monitor:
+
+- How often do we fall through to DB? (Should be rare. High rate ==
+  Redis cleanup is firing too aggressively, separate bug.)
+- When we fall through, do we usually find complete or incomplete?
+  (Mostly complete == Redis TTL expiring on long-finished jobs. Mostly
+  incomplete == genuine partial-completion that the reaper will REVOKE,
+  which is the correct outcome.)
+
+Tier 1 is silent in the happy path; only ERRORS log on RedisError.
+
+## Acceptance criteria
+
+- All four state-machine call sites use `Job.is_actually_complete()`.
+- Existing test suite `ami.jobs` stays green.
+- New tests for the 6+ matrix cases pass.
+- E2E run on a real ADC against a 50+ image collection lands SUCCESS.
+- Manual chaos test: dispatch real job, manually clobber `Job.progress`
+  to mimic 2521 shape via Django shell, wait 11 min, verify reaper lands
+  SUCCESS not REVOKED. Verify Redis pending was empty at decision time.
+- Tier 2 WARNING log appears when expected and only when expected.
+
+## Rollout
+
+Single PR. No feature flag. The new method strictly fixes a
+known-broken path; no breaking change. Deploy + monitor reaper logs for:
+
+- Drop in REVOKEd async_api jobs that have `progress=~100%` in their final
+  Job row state.
+- Any unexpected uptick in `fell back to DB detection-count check`
+  WARNING lines.
+
+## Risks
+
+- **DB fallback is slow on big jobs**: bounded — only fires after Redis
+  cleanup, on a stale job at the reaper tick. Worst case one slow query
+  per stale job per 15-min beat.
+- **`pipeline.collect_images` re-resolves the queued list**: if the
+  Collection has changed since dispatch (rare for completed jobs but
+  possible), the recomputed id list may not match the originally queued
+  list. Acceptable for v1 — the WARNING log will show the count and we
+  can iterate.
+- **Race window between Redis SREM and `summarize()`**: a result that
+  lands during the reaper's `summarize()` call could leave Redis briefly
+  in an inconsistent intermediate state. SCARD is atomic per-key, so the
+  worst case is the snapshot misses a just-completed image and reports
+  pending=1 → reaper waits another tick → next tick sees 0 → SUCCESS.
+  Bounded.
+
+## Follow-ups (separate PRs)
+
+1. **Make `_update_job_progress` writes atomic.** Either re-introduce a
+   tightly-scoped `select_for_update`, or move counts (`detections`,
+   `classifications`, `captures`) to Redis HINCRBY counters and have a
+   separate sync task mirror them into `Job.progress` periodically. The
+   cosmetic counter drift admitted at `tasks.py:563-571` becomes invisible
+   under (1), and the failure mode in this PR's plan goes away even
+   without `is_actually_complete()`.
+2. **Persist `queued_image_ids` on the Job row at dispatch time.**
+   Replaces the on-the-fly `pipeline.collect_images` call in Tier 2.
+3. **Fix `'CANCELLED' is not a valid JobState`** in
+   `ami/jobs/tasks.py:1193` (`_drain_all`). Separate bug surfaced during
+   2521 investigation — `zombie_streams` check is throwing every tick on
+   a legacy job row whose `status='CANCELLED'` is not in the JobState
+   enum (only `CANCELING` / `REVOKED` are valid).
+
+## Diagnostic evidence (job 2521)
+
+Captured from prod via `ami-devops` on 2026-04-29:
+
+- Job row at REVOKE time:
+  - `process=99.98% STARTED processed=4509 remaining=1`
+  - `results=100% SUCCESS`
+  - `progress.errors=[]`
+- NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0
+  num_ack_pending=0 num_redelivered=0`
+- Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml
+  log shows no `Stage 'X' progress lifted to 100% by max() guard` warning
+  (the diagnostic hook for this race), suggesting both writers had stale
+  reads.
+- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` at
+  11:45:45.510, then a later `99.97782%` write to results, then silence
+  for 14.5 min until reaper at 12:00:00.117.
+- Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages:
+  collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS`
+  — note the reaper's in-memory snapshot read shows 100% at the moment of
+  REVOKE. The persisted blob at that moment was 99.98%. The
+  `progress.is_complete()` call inside the guard re-read the persisted
+  blob (or evaluated against the 100%-but-STARTED status, which fails the
+  `final_states` check in `models.py:268`). Either way: guard fired,
+  REVOKED.
+
+This is the second case (status STARTED with progress=100% on the same
+stage) — it confirms the race is on the `status` field write, not just
+the percentage. Both writers had stale snapshots; the slower one
+overwrote the SUCCESS-status with STARTED.
+
+Last updated: 2026-04-29.

From 10cdac28372b1dea2ad4ee24d3716e3d38533969 Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Wed, 29 Apr 2026 14:21:40 -0700
Subject: [PATCH 2/6] docs(jobs): narrow plan to reaper-only Redis check via
 all_tasks_processed
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Revise per review: the only user-visible site bitten by the
_update_job_progress clobber race is the reaper async_api guard at
tasks.py:968. Other sites that read progress.is_complete() either run
at single-writer moments (task_postrun, task_failure) or self-correct
later (cleanup gate, log gating). Narrow the fix accordingly.

Also drops the proposed Job.is_actually_complete() wrapper in favor of
inlining the Redis check at the one call site that needs it. Pydantic
JobProgress.is_complete() stays pure data; AsyncJobStateManager grows
all_tasks_processed() with tri-state bool|None semantics. Naming makes
the scope explicit: Redis tracks NATS-task SREM completeness for the
process and results stages, not job-wide multi-stage completion.

DB-count fallback (Tier 2) deferred until monitoring shows the
Redis-gone WARNING fires often enough to matter.

Sync_api jobs are unaffected — dispatch_mode gate keeps them on the
existing trust-Celery path.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 docs/claude/planning/job-complete-check.md | 455 +++++++++++----------
 1 file changed, 234 insertions(+), 221 deletions(-)

diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md
index 8c8418ddf..bc0847dfb 100644
--- a/docs/claude/planning/job-complete-check.md
+++ b/docs/claude/planning/job-complete-check.md
@@ -1,4 +1,4 @@
-# Plan: reusable `Job.is_actually_complete()` (fix/job-complete-check)
+# Plan: reaper checks Redis directly via `all_tasks_processed()` (fix/job-complete-check)
 
 ## Why
 
@@ -8,268 +8,281 @@ JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The
 `jobs_health_check` reaper at the 10-min cutoff:
 
 1. `mark_lost_images_failed` correctly skipped the job (Redis pending was
-   empty, so `lost_ids` was empty, which is the documented skip path).
-2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard at
-   `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which
+   empty, so `lost_ids` was empty — the documented skip path).
+2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard
+   at `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which
    returned False because of the clobbered `process` stage.
-3. Reaper REVOKEd the job. Result: 5058 detections / 8129 classifications
-   already saved to DB, but the Job row reads REVOKED.
+3. Reaper REVOKEd the job. 5058 detections / 8129 classifications already
+   saved to DB; Job row reads REVOKED.
 
 PR #1244's reconciler is the wrong tool for this symptom — there is nothing
-to reconcile in Redis. The bug is upstream: `_update_job_progress` writes
-the entire `progress` JSONB blob without a row lock (see PR #1261, which
-removed `select_for_update` to fix unrelated row-lock contention), and two
-concurrent workers can each read the same pre-state then commit
-last-writer-wins. The `max()` guard at `tasks.py:598` only protects
-percentage regression *based on what the writer read* — it does not
-serialize the readers, so a slower committer with a stale snapshot
+to reconcile in Redis. The bug is upstream: `_update_job_progress`
+(`tasks.py:556`) writes the entire `progress` JSONB blob without a row
+lock (PR #1261 dropped `select_for_update` to break unrelated row-lock
+contention), and two concurrent workers can each read the same pre-state
+then commit last-writer-wins. The `max()` guard at `tasks.py:598` only
+protects percentage regression *based on what each writer read* — it does
+not serialize the readers, so a slower committer with a stale snapshot
 clobbers a faster committer's blob.
 
-`Job.progress` cannot be the source of truth for state-machine decisions.
-This PR makes that explicit: introduce a reliable completion check backed
-by Redis (which uses atomic SREM/SADD and is not racy), with a DB-count
-fallback for jobs whose Redis state has been cleaned up. `Job.progress`
-stays as the cosmetic mirror for the UI.
+`Job.progress` cannot be the source of truth for state-machine decisions
+on async_api jobs. This PR makes that explicit at the one user-visible
+site: the reaper's async_api guard. Reaper consults Redis directly.
+`Job.progress` stays as the cosmetic mirror for the UI.
 
 Related context:
 
 - Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md`
 - Last-writer-wins admission: `ami/jobs/tasks.py:563-571`
-- PR #1261 commit `50677444` removed `select_for_update` to fix contention.
+- PR #1261 commit `50677444` removed `select_for_update`.
 
-## Scope
+## Scope (narrow)
 
 In scope:
 
-- New `JobCompletionSnapshot` dataclass + `AsyncJobStateManager.summarize()`.
-- New `Job.is_actually_complete()` method with Tier 1 (Redis) → Tier 2
-  (DB count) → Tier 3 (`progress.is_complete()` for sync_api) fallback.
-- WARNING log whenever Tier 2 fires, so we can monitor frequency and
-  catch unexpected fallbacks.
-- Replace `progress.is_complete()` reads at the four state-machine call
-  sites that currently consult it.
-- Tests covering the 4-way matrix (dispatch_mode × Redis-present-or-not).
-
-Out of scope:
-
-- Race fix in `_update_job_progress` itself. That is a separate, larger
-  refactor (move counts to atomic Redis counters, or scope a row lock to
-  microseconds via raw `jsonb_set`). Filed as follow-up: see
-  `### Follow-ups` below.
-- UI changes. `Job.progress` stays as-is. The new method is backend-only.
-- Touching `mark_lost_images_failed`. The reconciler's behavior is
-  correct for its intended bug (NATS-lost messages); 2521 was a different
-  failure that this PR addresses upstream.
-
-## Design
-
-### Tier 1 — Redis oracle
+- New `AsyncJobStateManager.all_tasks_processed() -> bool | None` —
+  one Redis pipeline call (SCARD × 2 + GET) returning a tri-state.
+- Replace the `not job.progress.is_complete()` read at `tasks.py:968`
+  (reaper async_api guard) with the new Redis-backed check.
+- WARNING log when Redis state is absent (`None` return), so we can
+  monitor how often this fires.
+- Tests for the 3-state matrix on `all_tasks_processed()` + a reaper
+  test mirroring the 2521 shape (Redis empty, progress JSONB stuck at
+  99.98% STARTED → reaper lands SUCCESS, not REVOKED).
+
+Out of scope (filed as follow-ups):
+
+- Race fix in `_update_job_progress` itself. Move counts to atomic
+  Redis HINCRBY, or scope a row lock to microseconds via raw `jsonb_set`.
+- Updating the other call sites that read `progress.is_complete()`
+  (`tasks.py:654`, `:1331`, `:1361`, `:159`, `:496`). They are not
+  affected by the clobber pattern under current operation — see
+  "Why only the reaper" below.
+- DB-count fallback for the `None` (Redis-gone) case. Today the
+  WARNING log + `progress.is_complete()` fallback is sufficient. If
+  the WARNING fires often enough to matter, follow-up adds a
+  `pipeline.filter_processed_images()`-based check.
+- Job-level method (`Job.processing_complete()`). Inline at the one
+  call site that needs it; if a future site needs the same logic,
+  extract then.
+- UI changes. `Job.progress` shape is unchanged.
+- `'CANCELLED' is not a valid JobState` crash in `tasks.py:1193`
+  (`zombie_streams._drain_all`). Separate bug surfaced during 2521
+  triage; legacy job row has a status not in the JobState enum.
+
+## Why only the reaper
+
+The clobber pattern in `_update_job_progress` exists at six call sites
+that read `progress.is_complete()`. Only one is user-visible:
+
+| Site | Race exposure | User-visible consequence |
+|---|---|---|
+| `tasks.py:968` reaper async_api guard | **Yes** — runs minutes after the racy writes have settled, sees the clobbered final state | **Yes — REVOKED instead of SUCCESS** |
+| `tasks.py:654` cleanup gate inside `_update_job_progress` | Same writer is itself the racer; `is_complete()=False` after clobber prevents early cleanup | No — cleanup fires later from terminal-state path |
+| `tasks.py:1331` `task_postrun` SUCCESS guard | None — fires at queue-completion, no NATS results yet, single writer | No |
+| `tasks.py:1361` `task_failure` ASYNC_API guard | None — same single-writer moment | No |
+| `tasks.py:159` `run_job` post-run log | None — log-gating only | No |
+| `tasks.py:496` `MLJob.run` zero-images path | None — runs before any concurrent writers | No |
 
-`AsyncJobStateManager.summarize() -> JobCompletionSnapshot`. Single
-Redis pipeline round-trip:
+This PR fixes (1). The others either have a single writer at the moment
+they run, or their failure mode is internal and self-correcting. If
+monitoring later shows another site biting, expand then.
 
-```python
-pipe.scard(self._get_pending_key("process"))
-pipe.scard(self._get_pending_key("results"))
-pipe.scard(self._failed_key)
-pipe.get(self._total_key)
-```
+## Why not put the check inside `progress.is_complete()`
 
-```python
-@dataclass
-class JobCompletionSnapshot:
-    state_present: bool                 # total_raw is not None
-    total: int                          # 0 when state_present=False
-    pending_per_stage: dict[str, int]   # {"process": N, "results": M}
-    failed: int
-    is_complete: bool                   # state_present AND all pending == 0
-
-    # Convenience for callers:
-    @property
-    def all_acked(self) -> bool:
-        return self.state_present and all(v == 0 for v in self.pending_per_stage.values())
-```
+`JobProgress` is a Pydantic model — pure data. Adding Redis I/O behind
+its `is_complete()` method:
 
-`is_complete` is True only when state_present AND every stage's pending
-set is empty. A job whose Redis state is gone returns
-`state_present=False`, regardless of stage state — caller decides what to
-do via Tier 2 fallback.
+- couples a serializable data shape to a side-effect-laden oracle
+- silently makes every API serialization, log dump, and read trip
+  Redis (caller doesn't see the I/O cost)
+- the Pydantic instance has no `job_id` reference, so the method
+  signature would have to grow an awkward `job_id=` argument
+- breaks `JobProgress` reuse outside the Job context
 
-`RedisError` is swallowed inside `summarize()` and returns
-`state_present=False`. We do NOT propagate Redis transients up to the
-reaper — a Redis blip should not flip a job to REVOKED.
+Keep `JobProgress.is_complete()` as cosmetic Pydantic data check.
+Do the Redis call at the state-machine call site that needs it.
 
-### Tier 2 — DB Detection-row count
+## Design
 
-When Redis state is gone, query DB to determine whether all queued images
-have detections from the job's pipeline algorithms:
+### Tier 1 — Redis oracle on `AsyncJobStateManager`
 
 ```python
-def _is_complete_via_db(self) -> bool:
-    """Slow but durable. Only runs when Redis state is missing
-    (post-cleanup, post-TTL, or post-restart). Logs a WARNING so
-    we know how often we fall through to this path.
+# ami/ml/orchestration/async_job_state.py
+def all_tasks_processed(self) -> bool | None:
+    """
+    Truth signal for whether all NATS-tracked tasks have been processed
+    out of both pending sets (process + results stages).
+
+    Returns:
+        True  — both pending sets are empty AND total > 0 (all tasks SREM'd)
+        False — at least one pending set has members (real work outstanding)
+        None  — Redis state absent (cleaned up, TTL expired, never initialized)
+                Caller decides what to do; do not assume completeness.
+
+    Scope: Redis tracks the per-image SREM lifecycle for the `process`
+    and `results` stages only. It does NOT know about the `collect` stage
+    or any future post-results stages. Callers that need "is the entire
+    Job complete across ALL stages" must combine this with their own
+    knowledge of stage layout.
+
+    RedisError (transient connection issues) is logged and returned as
+    None — a Redis blip should NOT flip a job to REVOKED on its own.
     """
-    queued_ids = self._collected_image_ids()  # see below
-    if not queued_ids:
-        return True  # zero-images job is trivially complete
-    unprocessed = self.pipeline.filter_processed_images(
-        SourceImage.objects.filter(pk__in=queued_ids)
-    ).only("pk")
-    is_complete = not unprocessed.exists()
-    self.logger.warning(
-        "job %s: Redis state gone, fell back to DB detection-count check; "
-        "queued=%d, unprocessed=%d, is_complete=%s",
-        self.pk, len(queued_ids), unprocessed.count(), is_complete,
-    )
-    return is_complete
+    try:
+        redis = self._get_redis()
+        with redis.pipeline() as pipe:
+            for stage in self.STAGES:
+                pipe.scard(self._get_pending_key(stage))
+            pipe.get(self._total_key)
+            results = pipe.execute()
+    except RedisError as e:
+        logger.warning(
+            f"Redis error reading all_tasks_processed for job {self.job_id}: {e}"
+        )
+        return None
+
+    *pending_counts, total_raw = results
+    if total_raw is None:
+        return None
+    total = int(total_raw)
+    if total == 0:
+        # Zero-images job: trivially "processed" — no NATS work to do.
+        return True
+    return all(count == 0 for count in pending_counts)
 ```
 
-`self._collected_image_ids()`: derive from `Job.source_image_collection`,
-`Job.source_image_single`, or `Job.deployment` exactly the way
-`MLJob.run`/`pipeline.collect_images` does. Or, if the cost of
-re-resolving is too high, persist the queued id list at dispatch time.
-The `summarize()` payload includes `total` from Redis but Tier 2 needs
-the full id set, so we cannot rely on Redis alone here.
+One Redis pipeline round-trip. SCARD is O(1) per key. Negligible at scale.
 
-For the first PR cut, derive on the fly via `pipeline.collect_images()`
-to avoid a schema change. If profiling shows this is slow at 5k+ images,
-follow-up adds a `Job.queued_image_ids` jsonb column or similar.
-
-### Tier 3 — sync_api fallback
-
-`progress.is_complete()` for non-async_api jobs. Unchanged. Sync mode
-does not use Redis at all, and its writes are serialized by the single
-Celery task driving each stage.
-
-### `Job.is_actually_complete()`
+### Reaper guard replacement
 
 ```python
-def is_actually_complete(self) -> bool:
-    """Source-of-truth completion check. Use everywhere instead of
-    self.progress.is_complete() for state-machine decisions.
-    self.progress.is_complete() remains valid for cosmetic/UI reads.
-    """
-    if self.dispatch_mode == JobDispatchMode.ASYNC_API:
-        snap = AsyncJobStateManager(self.pk).summarize()
-        if snap.state_present:
-            return snap.is_complete
-        return self._is_complete_via_db()
-    return self.progress.is_complete()
+# ami/jobs/tasks.py:968 — replacement
+is_terminal = celery_state in states.READY_STATES
+is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
+if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
+    processed = AsyncJobStateManager(job.pk).all_tasks_processed()
+    if processed is False:
+        # NATS tasks still pending in Redis. mark_lost_images_failed runs
+        # before this and would have reconciled if it could; if we land
+        # here with pending > 0, the reconciler couldn't help (consumer
+        # gone, etc.) and REVOKE is the correct outcome.
+        is_terminal = False
+    elif processed is None:
+        # Redis state gone — rare. Fall back to the racy progress oracle
+        # but log so we know how often this path fires. If it fires
+        # often, follow up with a DB-count fallback.
+        logger.warning(
+            "Reaper for job %s: Redis state absent, falling back to "
+            "progress.is_complete()",
+            job.pk,
+        )
+        if not job.progress.is_complete():
+            is_terminal = False
+    # processed is True -> trust Celery's terminal state, leave is_terminal as-is
 ```
 
-Method on the Job model, not a free function — lets call sites read like
-`if not job.is_actually_complete(): ...` matching the existing
-`progress.is_complete()` pattern.
+Sync_api jobs hit none of this branch — they fall through `is_terminal =
+celery_state in READY_STATES` exactly as today.
 
-### Call sites to update
+### Behavior change matrix
 
-| File:line | Current call | Replacement |
+| Scenario | Today | After this PR |
 |---|---|---|
-| `ami/jobs/tasks.py:968` (reaper async_api guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
-| `ami/jobs/tasks.py:654` (`_update_job_progress` cleanup gate) | `if job.progress.is_complete():` | `if job.is_actually_complete():` |
-| `ami/jobs/tasks.py:1331` (`task_postrun` SUCCESS guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
-| `ami/jobs/tasks.py:1361` (`task_failure` ASYNC_API guard) | `not job.progress.is_complete()` | `not job.is_actually_complete()` |
-| `ami/jobs/models.py:496` (zero-images path in `MLJob.run`) | `if job.progress.is_complete():` | leave as-is — runs before Redis is initialized |
-| `ami/jobs/models.py:159` (`run_job` post-task log) | `not job.progress.is_complete()` | leave as-is — log gating only, not state-machine |
-
-The four state-machine sites (rows 1-4) are the ones that decide SUCCESS
-vs REVOKED vs deferred-cleanup. They all currently misread the clobbered
-`progress.is_complete()`. The two leave-as-is rows are pre-Redis
-initialization or log-only — the replacement would be no-op or wrong.
+| async_api job done; Redis empty; `progress` clobbered (2521 shape) | REVOKED | **SUCCESS** |
+| async_api job mid-flight; Celery SUCCESS but Redis pending > 0 | REVOKED (was already, but for the wrong reason) | REVOKED (same outcome, correct reason: NATS tasks pending) |
+| async_api job done; Redis pending == 0; `progress` accurate | SUCCESS | SUCCESS (no change) |
+| async_api job done; Redis state cleaned up early | varies — depends on `progress` | falls back to `progress.is_complete()` with WARNING log |
+| sync_api job at reaper | unchanged | unchanged |
 
 ### Tests
 
-`ami/jobs/tests/test_async_job_state.py` (new file or extend existing):
-
-- `summarize()` against fresh init (state_present=True, all pending=total)
-- `summarize()` mid-flight (some pending, some processed)
-- `summarize()` complete (state_present=True, all pending=0, is_complete=True)
-- `summarize()` cleaned up (state_present=False, is_complete=False)
-- `summarize()` Redis transient (RedisError → state_present=False)
-
-`ami/jobs/tests/test_job_completion.py` (new):
-
-- `Job.is_actually_complete()` — async_api + Redis present + all pending=0 → True
-- async_api + Redis present + pending>0 → False (mid-flight defer)
-- async_api + Redis gone + DB shows all detections present → True
-- async_api + Redis gone + DB shows missing detections → False (logs WARNING)
-- sync_api → falls through to `progress.is_complete()`, unchanged
-- async_api with 2521 shape (Redis empty, progress JSONB stuck at 99.98%) → True
-
-`ami/jobs/tests/test_tasks.py` extension:
-
-- Reaper reaches the async_api guard with stale `progress` (process=99.98%
-  STARTED) but Redis empty → lands SUCCESS, not REVOKED. Mirrors 2521.
+Extend `ami/ml/tests.py` (or `ami/jobs/tests/test_async_job_state.py`):
+
+- `all_tasks_processed()` after fresh `initialize_job([...])` → False
+  (all images still in pending sets)
+- After SREM-ing every id from both stages → True
+- After SREM-ing some → False
+- With `total=0` (zero-images path) → True
+- With Redis state never initialized → None
+- With Redis state cleaned up via `cleanup()` → None
+- Simulated `RedisError` on the pipeline → None + WARNING logged
+
+Extend `ami/jobs/tests/test_tasks.py` — `TestCheckStaleJobs`:
+
+- async_api job, Celery SUCCESS, Redis pending empty, `progress`
+  hand-set to `process=99.98% STARTED processed=4509 remaining=1`
+  (mirroring 2521) → reaper lands SUCCESS, not REVOKED. Asserts
+  `is_terminal=True` path was taken.
+- async_api job, Celery SUCCESS, Redis pending > 0 → REVOKED (no regression
+  for genuine partial-completion).
+- async_api job, Redis state absent, `progress.is_complete()=True`,
+  Celery SUCCESS → SUCCESS via the fallback path. Assert WARNING log
+  emitted.
+- async_api job, Redis state absent, `progress.is_complete()=False`,
+  Celery SUCCESS → REVOKED via the fallback path. Assert WARNING log emitted.
+- sync_api job at reaper → behavior unchanged (existing tests cover).
 
 ### Logging
 
-Tier 2 fires a WARNING with `job_id, queued_count, unprocessed_count,
-is_complete`. This is the signal we use to monitor:
+`logger.warning(...)` once per stale-async_api job per tick when Redis
+state is absent. Format includes `job.pk` so we can grep.
 
-- How often do we fall through to DB? (Should be rare. High rate ==
-  Redis cleanup is firing too aggressively, separate bug.)
-- When we fall through, do we usually find complete or incomplete?
-  (Mostly complete == Redis TTL expiring on long-finished jobs. Mostly
-  incomplete == genuine partial-completion that the reaper will REVOKE,
-  which is the correct outcome.)
+Monitoring expectation post-deploy:
 
-Tier 1 is silent in the happy path; only ERRORS log on RedisError.
+- Drop in REVOKED async_api jobs whose final `Job.progress` snapshot has
+  `process>=99% STARTED` (the clobber signature).
+- Low-rate WARNING lines from the Redis-gone fallback. High rate ==
+  Redis cleanup is firing too aggressively, separate bug to chase.
 
 ## Acceptance criteria
 
-- All four state-machine call sites use `Job.is_actually_complete()`.
-- Existing test suite `ami.jobs` stays green.
-- New tests for the 6+ matrix cases pass.
-- E2E run on a real ADC against a 50+ image collection lands SUCCESS.
-- Manual chaos test: dispatch real job, manually clobber `Job.progress`
-  to mimic 2521 shape via Django shell, wait 11 min, verify reaper lands
-  SUCCESS not REVOKED. Verify Redis pending was empty at decision time.
-- Tier 2 WARNING log appears when expected and only when expected.
+- `all_tasks_processed()` exists with the three-state semantics.
+- Reaper at `tasks.py:968` uses it; sync_api branch unchanged.
+- Existing `ami.jobs` test suite stays green.
+- New tests for the 3-state matrix + reaper-eats-2521-shape case pass.
+- Manual chaos: dispatch real async_api job, wait until completion, then
+  via Django shell hand-clobber `Job.progress` to mimic 2521. Wait 11
+  min. Confirm reaper lands SUCCESS, not REVOKED. Confirm Redis was
+  clean at decision time.
 
 ## Rollout
 
-Single PR. No feature flag. The new method strictly fixes a
-known-broken path; no breaking change. Deploy + monitor reaper logs for:
-
-- Drop in REVOKEd async_api jobs that have `progress=~100%` in their final
-  Job row state.
-- Any unexpected uptick in `fell back to DB detection-count check`
-  WARNING lines.
+Single PR. No feature flag. The change is strictly additive at the call
+site (one site replaced; no API or schema change). Deploy + monitor.
 
 ## Risks
 
-- **DB fallback is slow on big jobs**: bounded — only fires after Redis
-  cleanup, on a stale job at the reaper tick. Worst case one slow query
-  per stale job per 15-min beat.
-- **`pipeline.collect_images` re-resolves the queued list**: if the
-  Collection has changed since dispatch (rare for completed jobs but
-  possible), the recomputed id list may not match the originally queued
-  list. Acceptable for v1 — the WARNING log will show the count and we
-  can iterate.
-- **Race window between Redis SREM and `summarize()`**: a result that
-  lands during the reaper's `summarize()` call could leave Redis briefly
-  in an inconsistent intermediate state. SCARD is atomic per-key, so the
-  worst case is the snapshot misses a just-completed image and reports
-  pending=1 → reaper waits another tick → next tick sees 0 → SUCCESS.
-  Bounded.
+- **Race window between Redis SREM and `all_tasks_processed()`**: a
+  result that lands during the reaper's call could leave Redis briefly
+  in an intermediate state. SCARD is atomic per-key; worst case is the
+  snapshot misses a just-completed image and reports pending=1 → reaper
+  defers → next tick sees pending=0 → SUCCESS. Bounded.
+- **Redis-gone fallback rate**: if WARNING fires frequently in
+  production, we punted too early on Tier 2 (DB count). Mitigation:
+  add `pipeline.filter_processed_images()`-based fallback in a
+  follow-up PR. Bounded — only fires when Redis state is absent AND the
+  job is past the reaper cutoff.
+- **Sync_api jobs unaffected**: dispatch_mode gate ensures non-async
+  jobs hit the existing path. Verified by inspection + sync_api reaper
+  tests.
 
 ## Follow-ups (separate PRs)
 
-1. **Make `_update_job_progress` writes atomic.** Either re-introduce a
-   tightly-scoped `select_for_update`, or move counts (`detections`,
-   `classifications`, `captures`) to Redis HINCRBY counters and have a
-   separate sync task mirror them into `Job.progress` periodically. The
-   cosmetic counter drift admitted at `tasks.py:563-571` becomes invisible
-   under (1), and the failure mode in this PR's plan goes away even
-   without `is_actually_complete()`.
-2. **Persist `queued_image_ids` on the Job row at dispatch time.**
-   Replaces the on-the-fly `pipeline.collect_images` call in Tier 2.
-3. **Fix `'CANCELLED' is not a valid JobState`** in
-   `ami/jobs/tasks.py:1193` (`_drain_all`). Separate bug surfaced during
-   2521 investigation — `zombie_streams` check is throwing every tick on
-   a legacy job row whose `status='CANCELLED'` is not in the JobState
-   enum (only `CANCELING` / `REVOKED` are valid).
+1. **Make `_update_job_progress` writes atomic.** Either tightly-scoped
+   `select_for_update`, or move counts (`detections`, `classifications`,
+   `captures`) to Redis HINCRBY counters with a separate sync task
+   mirroring them into `Job.progress`. The cosmetic counter drift
+   admitted at `tasks.py:563-571` becomes invisible, and the failure
+   mode this PR addresses goes away even without the new method.
+2. **DB-count fallback for the Redis-gone case** if monitoring shows
+   the WARNING firing often. `pipeline.filter_processed_images()`
+   against the Job's queued image set.
+3. **Persist `queued_image_ids` on the Job row at dispatch** to make
+   (2) cheaper.
+4. **Fix `'CANCELLED' is not a valid JobState`** in `tasks.py:1193`.
+5. **Expand `all_tasks_processed()` usage** to the other 5 sites if
+   monitoring shows a clobber-driven failure at any of them.
 
 ## Diagnostic evidence (job 2521)
 
@@ -282,23 +295,23 @@ Captured from prod via `ami-devops` on 2026-04-29:
 - NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0
   num_ack_pending=0 num_redelivered=0`
 - Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml
-  log shows no `Stage 'X' progress lifted to 100% by max() guard` warning
-  (the diagnostic hook for this race), suggesting both writers had stale
-  reads.
-- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%` at
-  11:45:45.510, then a later `99.97782%` write to results, then silence
-  for 14.5 min until reaper at 12:00:00.117.
+  log shows no `Stage 'X' progress lifted to 100% by max() guard`
+  warning (the diagnostic hook for this race), suggesting both writers
+  had stale reads.
+- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%`
+  at 11:45:45.510, then a later `99.97782%` write to results, then
+  silence for 14.5 min until reaper at 12:00:00.117.
 - Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages:
   collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS`
-  — note the reaper's in-memory snapshot read shows 100% at the moment of
-  REVOKE. The persisted blob at that moment was 99.98%. The
+  — note the reaper's in-memory snapshot read shows 100% at the moment
+  of REVOKE. The persisted blob at that moment was 99.98%. The
   `progress.is_complete()` call inside the guard re-read the persisted
-  blob (or evaluated against the 100%-but-STARTED status, which fails the
-  `final_states` check in `models.py:268`). Either way: guard fired,
-  REVOKED.
+  blob (or evaluated against the 100%-but-STARTED status, which fails
+  the `final_states` check in `models.py:268`). Either way: guard
+  fired, REVOKED.
 
 This is the second case (status STARTED with progress=100% on the same
-stage) — it confirms the race is on the `status` field write, not just
+stage) — confirms the race is on the `status` field write, not just
 the percentage. Both writers had stale snapshots; the slower one
 overwrote the SUCCESS-status with STARTED.
 

From c828d9e1f5b3577a22b9e8fb9f28ef99bc18b0ed Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Wed, 29 Apr 2026 14:28:57 -0700
Subject: [PATCH 3/6] docs(jobs): add NEXT_SESSION_PROMPT for
 fix/job-complete-check impl
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Resume instructions for the next session — implementation steps for
all_tasks_processed() + reaper guard swap, test cases, e2e chaos
validation, follow-up tickets, and the saved diagnostic log paths on
the ami-devops host. References the plan doc at
docs/claude/planning/job-complete-check.md (commit 10cdac28).

Co-Authored-By: Claude <noreply@anthropic.com>
---
 docs/claude/prompts/NEXT_SESSION_PROMPT.md | 218 +++++++++++++++++++++
 1 file changed, 218 insertions(+)
 create mode 100644 docs/claude/prompts/NEXT_SESSION_PROMPT.md

diff --git a/docs/claude/prompts/NEXT_SESSION_PROMPT.md b/docs/claude/prompts/NEXT_SESSION_PROMPT.md
new file mode 100644
index 000000000..baca7df2f
--- /dev/null
+++ b/docs/claude/prompts/NEXT_SESSION_PROMPT.md
@@ -0,0 +1,218 @@
+# Next session: implement `fix/job-complete-check`
+
+Branch: `fix/job-complete-check` at commit `10cdac28` (worktree
+`/home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks`).
+
+Plan: `docs/claude/planning/job-complete-check.md` (read first — has the
+full design, risks, follow-ups, and 2521 evidence).
+
+## TL;DR
+
+Production job 2521 ended REVOKED because `_update_job_progress`
+(`ami/jobs/tasks.py:556`) writes the entire `Job.progress` JSONB blob
+without a row lock since PR #1261 dropped `select_for_update`. Two
+concurrent workers raced; slower committer with `processed=4509
+remaining=1` clobbered the faster committer's `processed=4510 SUCCESS`
+write. NATS + Redis both showed all 4510 tasks done; only `Job.progress`
+lied. Reaper at `tasks.py:968` consults `progress.is_complete()` →
+False (because clobbered) → REVOKED.
+
+Fix: add `AsyncJobStateManager.all_tasks_processed() -> bool | None`,
+inline at the reaper guard. Sync_api unaffected. Other 5 `progress.is_complete()`
+readers untouched (not bitten by clobber). Pydantic `JobProgress` stays
+pure data — Redis I/O lives on the manager class.
+
+## Implementation steps
+
+### 1. `all_tasks_processed()`
+
+File: `ami/ml/orchestration/async_job_state.py`
+
+After `get_pending_image_ids()` (around line 229), add:
+
+```python
+def all_tasks_processed(self) -> bool | None:
+    """Tri-state truth signal for NATS-task SREM completeness across both
+    process and results pending sets.
+
+    True  — both pending sets empty AND total > 0 (or total == 0)
+    False — at least one pending set has members
+    None  — Redis state absent (cleaned up, expired, never initialized,
+            or transient RedisError)
+
+    Scope: tracks NATS task lifecycle only; does not know about `collect`
+    or any future post-results stages.
+    """
+    try:
+        redis = self._get_redis()
+        with redis.pipeline() as pipe:
+            for stage in self.STAGES:
+                pipe.scard(self._get_pending_key(stage))
+            pipe.get(self._total_key)
+            results = pipe.execute()
+    except RedisError as e:
+        logger.warning(
+            f"Redis error reading all_tasks_processed for job {self.job_id}: {e}"
+        )
+        return None
+
+    *pending_counts, total_raw = results
+    if total_raw is None:
+        return None
+    if int(total_raw) == 0:
+        return True
+    return all(count == 0 for count in pending_counts)
+```
+
+### 2. Reaper guard replacement
+
+File: `ami/jobs/tasks.py:963-969`
+
+Replace:
+```python
+is_terminal = celery_state in states.READY_STATES
+is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
+if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete():
+    is_terminal = False
+```
+
+With:
+```python
+is_terminal = celery_state in states.READY_STATES
+is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
+if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
+    processed = AsyncJobStateManager(job.pk).all_tasks_processed()
+    if processed is False:
+        is_terminal = False
+    elif processed is None:
+        logger.warning(
+            "Reaper for job %s: Redis state absent, falling back to "
+            "progress.is_complete()",
+            job.pk,
+        )
+        if not job.progress.is_complete():
+            is_terminal = False
+    # processed is True -> trust Celery's terminal state
+```
+
+Need import: `from ami.ml.orchestration.async_job_state import AsyncJobStateManager`
+(check if already imported at top of file).
+
+### 3. Tests
+
+**`ami/ml/tests.py`** — add `TestAsyncJobStateManagerAllTasksProcessed`:
+- fresh `initialize_job([...])` → False
+- after SREM-ing every id from both stages → True
+- after SREM-ing only one stage → False
+- `total=0` zero-images → True
+- never initialized → None
+- after `cleanup()` → None
+- mocked `RedisError` on pipeline → None + WARNING
+
+**`ami/jobs/tests/test_tasks.py::TestCheckStaleJobs`** — add cases:
+- async_api + Celery SUCCESS + Redis empty + `progress` clobbered to
+  `process=99.98% STARTED processed=4509 remaining=1` → SUCCESS, not REVOKED
+  (the 2521 case)
+- async_api + Celery SUCCESS + Redis pending > 0 → REVOKED
+- async_api + Redis state absent + `progress.is_complete()=True` → SUCCESS
+  via fallback + assert WARNING log
+- async_api + Redis state absent + `progress.is_complete()=False` → REVOKED
+  via fallback + assert WARNING log
+- sync_api → existing behavior unchanged
+
+### 4. Run tests
+
+```bash
+cd /home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks
+docker compose -f docker-compose.ci.yml run --rm django \
+  python manage.py test ami.ml.tests.TestAsyncJobStateManagerAllTasksProcessed ami.jobs.tests.test_tasks --keepdb
+```
+
+If tests pass, run the full `ami.jobs` suite:
+```bash
+docker compose -f docker-compose.ci.yml run --rm django \
+  python manage.py test ami.jobs --keepdb
+```
+
+### 5. E2E chaos validation
+
+Per the chaos pattern in `docs/claude/debugging/chaos-scenarios.md`:
+
+1. Dispatch real async_api job (50+ images) on local stack with ADC
+   running. Wait until SUCCESS naturally.
+2. Via Django shell, hand-clobber `Job.progress` to mimic 2521:
+   ```python
+   from ami.jobs.models import Job, JobState
+   j = Job.objects.get(pk=<id>)
+   j.status = JobState.STARTED
+   j.progress.update_stage("process", status=JobState.STARTED, progress=0.9998,
+                           processed=N-1, remaining=1, failed=0)
+   j.updated_at = j.updated_at - datetime.timedelta(minutes=15)
+   j.save()
+   ```
+3. Trigger `jobs_health_check` manually:
+   ```python
+   from ami.jobs.tasks import jobs_health_check
+   jobs_health_check()
+   ```
+4. Verify: job lands SUCCESS (not REVOKED). Verify Redis still empty
+   (cleanup runs from terminal-state path).
+
+### 6. PR
+
+```bash
+gh pr create --title "fix(jobs): reaper checks Redis directly via all_tasks_processed" --body "..."
+```
+
+PR body template — start from the plan doc's "Why", "Behavior change
+matrix", and "Risks" sections. Reference job 2521.
+
+## Pre-requisites and gotchas
+
+- Worktree currently at `dangling-tasks` path but on branch
+  `fix/job-complete-check`. `git status` should be clean.
+- Pydantic v1 in container — `.dict()` / `.json()` not `.model_dump()`.
+- CI compose file is `docker-compose.ci.yml` for tests with `--keepdb`.
+- Stale RabbitMQ connections after days: restart Django/Celery if
+  enqueueing throws `ConnectionResetError`.
+- `cachalot_disabled()` decorator wraps `process_nats_pipeline_result`
+  — not relevant here but flagged in case a related test fails on cache.
+
+## Saved diagnostic logs from 2521
+
+On the ami-devops host (not local):
+- `/tmp/job2521_celery_72h.log`
+- `/tmp/job2521_worker_jobs_72h.log`
+- `/tmp/job2521_worker_ml_72h.log`
+- `/tmp/job2521_worker_default_72h.log`
+- `/tmp/job2521_beat_72h.log`
+
+If next session needs to re-grep, dispatch a subagent in
+`~/Projects/AMI/ami-devops/` rather than re-fetching from prod.
+
+## Open follow-up tickets to file (not in this PR)
+
+1. Make `_update_job_progress` writes atomic. Either re-introduce a
+   tightly-scoped `select_for_update`, or move counts to Redis HINCRBY
+   counters with separate sync task. Solves the underlying race that
+   this PR works around.
+2. DB-count fallback for the Redis-gone case. Add only if the WARNING
+   log fires often in production.
+3. Persist `queued_image_ids` on Job at dispatch — makes (2) cheaper.
+4. Fix `'CANCELLED' is not a valid JobState` crash in
+   `ami/jobs/tasks.py:1193` (`zombie_streams._drain_all`). Throws
+   every beat tick on a legacy job row whose status is not in the
+   JobState enum.
+5. Expand `all_tasks_processed()` usage to other 5 `progress.is_complete()`
+   sites if monitoring shows clobber-driven failures elsewhere.
+
+## Branch state
+
+```
+$ git log --oneline fix/job-complete-check ^main
+10cdac28 docs(jobs): narrow plan to reaper-only Redis check via all_tasks_processed
+091e2fc7 docs(jobs): plan reusable Job.is_actually_complete() backed by Redis
+```
+
+(091e2fc7 is the original broader plan — superseded by 10cdac28. Do NOT
+implement from 091e2fc7.)

From 36b17c420855d3eb6a5b265cc2231d6a99143bb3 Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Wed, 29 Apr 2026 14:49:53 -0700
Subject: [PATCH 4/6] fix(jobs): reaper checks Redis directly via
 all_tasks_processed
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Job 2521 ended REVOKED despite NATS+Redis both confirming all 4510 tasks
processed. Root cause: `_update_job_progress` writes the entire
`Job.progress` JSONB blob without a row lock since #1261 dropped
`select_for_update`. Concurrent workers raced; a slower committer
clobbered the faster committer's SUCCESS write, leaving progress at
`processed=4509 remaining=1 STARTED`. The reaper read the clobbered
`progress.is_complete()` -> False -> REVOKE.

Add `AsyncJobStateManager.all_tasks_processed() -> bool | None`
(tri-state) backed directly by Redis SCARD across both pending sets, and
inline it at the reaper guard in `check_stale_jobs`. Sync_api jobs
unchanged (dispatch_mode gate). When Redis state is absent (cleanup,
expiry, never initialized, or transient RedisError), fall back to
`progress.is_complete()` and emit a WARNING so monitoring can flag the
fallback path.

Other 5 `progress.is_complete()` callers stay as-is — only the reaper
guard is bitten by the clobber. Atomic-write follow-up tracked separately.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 ami/jobs/tasks.py                       |  28 +++-
 ami/jobs/tests/test_tasks.py            | 195 ++++++++++++++++++++++++
 ami/ml/orchestration/async_job_state.py |  30 ++++
 ami/ml/tests.py                         |  54 +++++++
 4 files changed, 300 insertions(+), 7 deletions(-)

diff --git a/ami/jobs/tasks.py b/ami/jobs/tasks.py
index f3c996f86..554c3ba18 100644
--- a/ami/jobs/tasks.py
+++ b/ami/jobs/tasks.py
@@ -909,9 +909,10 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[
 
     For each stale job, checks Celery for a terminal task status. REVOKED is
     always trusted. For async_api jobs, SUCCESS and FAILURE are only accepted
-    when job.progress.is_complete() — NATS workers may still be delivering
-    results after the Celery task finishes. All other cases result in revocation.
-    Async resources (NATS/Redis) are cleaned up in both branches.
+    when AsyncJobStateManager.all_tasks_processed() reports True (i.e. the
+    Redis pending sets are drained). When Redis state is absent, falls back to
+    job.progress.is_complete(). All other cases result in revocation. Async
+    resources (NATS/Redis) are cleaned up in both branches.
 
     Returns a list of dicts describing what was done to each job.
     """
@@ -961,12 +962,25 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[
                     # Treat as unknown state — job will be revoked below.
 
             # Only trust terminal Celery states. For async_api jobs, SUCCESS and
-            # FAILURE are only accepted when progress is complete — NATS workers may
-            # still be delivering results after the Celery task finishes.
+            # FAILURE are only accepted when all NATS tasks are processed — workers
+            # may still be delivering results after the Celery task finishes.
+            # Consult Redis (source of truth for SREM completeness) directly rather
+            # than Job.progress.is_complete(), which mirrors a JSONB blob racy under
+            # concurrent _update_job_progress writes since #1261.
             is_terminal = celery_state in states.READY_STATES
             is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
-            if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete():
-                is_terminal = False
+            if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
+                processed = AsyncJobStateManager(job.pk).all_tasks_processed()
+                if processed is False:
+                    is_terminal = False
+                elif processed is None:
+                    logger.warning(
+                        "Reaper for job %s: Redis state absent, falling back to " "progress.is_complete()",
+                        job.pk,
+                    )
+                    if not job.progress.is_complete():
+                        is_terminal = False
+                # processed is True -> trust Celery's terminal state
 
             previous_status = job.status
             if is_terminal:
diff --git a/ami/jobs/tests/test_tasks.py b/ami/jobs/tests/test_tasks.py
index 89587f3f1..3126f5486 100644
--- a/ami/jobs/tests/test_tasks.py
+++ b/ami/jobs/tests/test_tasks.py
@@ -1229,3 +1229,198 @@ def test_jobs_health_check_runs_lost_images_before_stale_jobs(self, mock_manager
 
         job.refresh_from_db()
         self.assertEqual(job.status, JobState.SUCCESS.value)
+
+
+class TestCheckStaleJobsReaperGuard(TransactionTestCase):
+    """Reaper guard for async_api jobs: SUCCESS/FAILURE Celery state is only
+    accepted when AsyncJobStateManager.all_tasks_processed() reports True. The
+    earlier guard read Job.progress.is_complete() — racy under concurrent
+    _update_job_progress writes since #1261 dropped select_for_update. Job 2521
+    landed REVOKED with NATS+Redis fully drained because a slower committer
+    clobbered the SUCCESS write. This class verifies the new Redis-direct path,
+    the absent-state fallback to progress.is_complete() with a WARNING, and
+    that sync_api jobs are unaffected.
+    """
+
+    def setUp(self):
+        cache.clear()
+        self.project = Project.objects.create(name="Reaper Guard Project")
+        self.pipeline = Pipeline.objects.create(name="Reaper Pipeline", slug="reaper-pipeline")
+        self.pipeline.projects.add(self.project)
+        self.collection = SourceImageCollection.objects.create(name="Reaper Coll", project=self.project)
+
+    def tearDown(self):
+        cache.clear()
+
+    def _stale_async_job(self, *, task_id: str = "reaper-task") -> Job:
+        job = Job.objects.create(
+            job_type_key=MLJob.key,
+            project=self.project,
+            name="reaper async job",
+            pipeline=self.pipeline,
+            source_image_collection=self.collection,
+            dispatch_mode=JobDispatchMode.ASYNC_API,
+        )
+        job.task_id = task_id
+        job.update_status(JobState.STARTED, save=True)
+        return job
+
+    def _mark_stale(self, job: Job) -> None:
+        """Push updated_at back past STALLED_JOBS_MAX_MINUTES via raw update so
+        Job.save() side-effects (auto_now) don't undo it. Call AFTER any helper
+        that touches the job model."""
+        Job.objects.filter(pk=job.pk).update(
+            updated_at=datetime.datetime.now() - datetime.timedelta(minutes=Job.STALLED_JOBS_MAX_MINUTES + 1)
+        )
+        job.refresh_from_db()
+
+    def _set_progress_clobbered(self, job: Job, total: int, processed: int) -> None:
+        """Mimic the 2521 Job.progress shape: process stage at processed/total
+        with status STARTED, even though Redis was actually fully drained."""
+        progress = job.progress
+        collect = progress.get_stage("collect")
+        collect.progress = 1.0
+        collect.status = JobState.SUCCESS
+        progress.update_stage(
+            "process",
+            progress=processed / total,
+            status=JobState.STARTED,
+            processed=processed,
+            remaining=total - processed,
+            failed=0,
+        )
+        progress.update_stage(
+            "results",
+            progress=processed / total,
+            status=JobState.STARTED,
+            captures=processed,
+        )
+        job.save()
+
+    def _set_progress_complete(self, job: Job) -> None:
+        progress = job.progress
+        for key in ("collect", "process", "results"):
+            stage = progress.get_stage(key)
+            stage.progress = 1.0
+            stage.status = JobState.SUCCESS
+        job.save()
+
+    @patch("celery.result.AsyncResult")
+    def test_async_celery_success_redis_empty_progress_clobbered_lands_success(self, mock_async_result):
+        """The 2521 case. Pre-fix, this came back REVOKED because the reaper
+        consulted progress.is_complete() (False, due to clobber). Post-fix,
+        Redis says all_tasks_processed() → True, so SUCCESS is honored."""
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = self._stale_async_job()
+        ids = [str(i) for i in range(10)]
+        manager = AsyncJobStateManager(job.pk)
+        manager.initialize_job(ids)
+        manager.update_state(set(ids), stage="process")
+        manager.update_state(set(ids), stage="results")
+        # Clobber: progress shows mid-flight even though Redis is drained.
+        self._set_progress_clobbered(job, total=10, processed=9)
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.SUCCESS
+
+        check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(
+            job.status,
+            JobState.SUCCESS.value,
+            f"clobbered progress should not block SUCCESS when Redis says drained; got {job.status}",
+        )
+
+    @patch("celery.result.AsyncResult")
+    def test_async_celery_success_redis_pending_lands_revoked(self, mock_async_result):
+        """Redis still has pending ids → genuine in-flight; reaper revokes."""
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = self._stale_async_job()
+        ids = [str(i) for i in range(10)]
+        AsyncJobStateManager(job.pk).initialize_job(ids)
+        # No SREMs — pending sets still full.
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.SUCCESS
+
+        check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(job.status, JobState.REVOKED.value)
+
+    @patch("celery.result.AsyncResult")
+    def test_async_celery_success_redis_absent_progress_complete_lands_success(self, mock_async_result):
+        """Redis state is gone (cleaned up / never initialized). Reaper falls
+        back to progress.is_complete(). Complete progress → SUCCESS + WARNING."""
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = self._stale_async_job()
+        # Don't initialize Redis state.
+        self._set_progress_complete(job)
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.SUCCESS
+
+        with self.assertLogs("ami.jobs.tasks", level="WARNING") as cm:
+            check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(job.status, JobState.SUCCESS.value)
+        self.assertTrue(any("Redis state absent" in m for m in cm.output))
+
+    @patch("celery.result.AsyncResult")
+    def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(self, mock_async_result):
+        """Redis absent + progress incomplete → REVOKED via fallback + WARNING."""
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = self._stale_async_job()
+        # Don't initialize Redis. Default progress is fresh (not complete).
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.SUCCESS
+
+        with self.assertLogs("ami.jobs.tasks", level="WARNING") as cm:
+            check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(job.status, JobState.REVOKED.value)
+        self.assertTrue(any("Redis state absent" in m for m in cm.output))
+
+    @patch("celery.result.AsyncResult")
+    def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_async_result):
+        """sync_api jobs skip the Redis guard entirely — Celery's terminal
+        state is authoritative. No Redis state initialized; if the new path
+        leaked into sync_api this would REVOKE."""
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = Job.objects.create(
+            job_type_key=MLJob.key,
+            project=self.project,
+            name="reaper sync job",
+            pipeline=self.pipeline,
+            source_image_collection=self.collection,
+            dispatch_mode=JobDispatchMode.SYNC_API,
+        )
+        job.task_id = "reaper-sync-task"
+        job.update_status(JobState.STARTED, save=True)
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.SUCCESS
+
+        check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(job.status, JobState.SUCCESS.value)
diff --git a/ami/ml/orchestration/async_job_state.py b/ami/ml/orchestration/async_job_state.py
index 26e7c3024..4c9502a60 100644
--- a/ami/ml/orchestration/async_job_state.py
+++ b/ami/ml/orchestration/async_job_state.py
@@ -228,6 +228,36 @@ def get_pending_image_ids(self) -> set[str]:
             return set()
         return {m.decode() if isinstance(m, (bytes, bytearray)) else str(m) for m in members}
 
+    def all_tasks_processed(self) -> bool | None:
+        """Tri-state truth signal for NATS-task SREM completeness across both
+        process and results pending sets.
+
+        True  — both pending sets empty AND total > 0 (or total == 0)
+        False — at least one pending set has members
+        None  — Redis state absent (cleaned up, expired, never initialized,
+                or transient RedisError)
+
+        Scope: tracks NATS task lifecycle only; does not know about `collect`
+        or any future post-results stages.
+        """
+        try:
+            redis = self._get_redis()
+            with redis.pipeline() as pipe:
+                for stage in self.STAGES:
+                    pipe.scard(self._get_pending_key(stage))
+                pipe.get(self._total_key)
+                results = pipe.execute()
+        except RedisError as e:
+            logger.warning(f"Redis error reading all_tasks_processed for job {self.job_id}: {e}")
+            return None
+
+        *pending_counts, total_raw = results
+        if total_raw is None:
+            return None
+        if int(total_raw) == 0:
+            return True
+        return all(count == 0 for count in pending_counts)
+
     def cleanup(self) -> None:
         """
         Delete all Redis keys associated with this job.
diff --git a/ami/ml/tests.py b/ami/ml/tests.py
index 18549d8cb..0297e4d8f 100644
--- a/ami/ml/tests.py
+++ b/ami/ml/tests.py
@@ -1407,6 +1407,60 @@ def test_update_state_returns_none_when_state_genuinely_missing(self):
         progress = self.manager.update_state({"img1", "img2"}, "process")
         self.assertIsNone(progress)
 
+    def test_all_tasks_processed_fresh_init_returns_false(self):
+        """Just-initialized job has all images pending in both stages."""
+        self._init_and_verify(self.image_ids)
+        self.assertFalse(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_after_drain_returns_true(self):
+        """SREM-ing every id from both stages → True."""
+        self._init_and_verify(self.image_ids)
+        self.manager.update_state(set(self.image_ids), "process")
+        self.manager.update_state(set(self.image_ids), "results")
+        self.assertTrue(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_partial_stage_returns_false(self):
+        """Process drained but results still pending → False."""
+        self._init_and_verify(self.image_ids)
+        self.manager.update_state(set(self.image_ids), "process")
+        # results stage untouched
+        self.assertFalse(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_zero_total_returns_true(self):
+        """A zero-images job is trivially complete."""
+        self.manager.initialize_job([])
+        self.assertTrue(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_never_initialized_returns_none(self):
+        """No init → total key absent → None (caller falls back)."""
+        # Do NOT call initialize_job
+        self.assertIsNone(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_after_cleanup_returns_none(self):
+        """After cleanup() the total key is gone → None."""
+        self._init_and_verify(self.image_ids)
+        self.manager.cleanup()
+        self.assertIsNone(self.manager.all_tasks_processed())
+
+    def test_all_tasks_processed_returns_none_on_redis_error(self):
+        """Transient RedisError → WARNING + None (caller falls back)."""
+        from unittest.mock import MagicMock, patch
+
+        from redis.exceptions import RedisError
+
+        self._init_and_verify(self.image_ids)
+
+        pipe = MagicMock()
+        pipe.execute.side_effect = RedisError("Connection reset by peer")
+        fake_redis = MagicMock()
+        fake_redis.pipeline.return_value.__enter__.return_value = pipe
+
+        with patch.object(self.manager, "_get_redis", return_value=fake_redis):
+            with self.assertLogs("ami.ml.orchestration.async_job_state", level="WARNING") as cm:
+                result = self.manager.all_tasks_processed()
+        self.assertIsNone(result)
+        self.assertTrue(any("Redis error reading all_tasks_processed" in m for m in cm.output))
+
 
 class TestSaveResultsRefreshesDeploymentCounts(TestCase):
     """save_results must refresh Deployment cached counts, not just Event counts.

From 0a7480b87dad54b6d4f4e3018235e9e02214de4e Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Thu, 30 Apr 2026 16:22:39 -0700
Subject: [PATCH 5/6] docs(jobs): capture state-triangulation discussion (DB +
 Redis + NATS)

Low-priority follow-up to #1276. Three sources of truth for async_api job
lifecycle (Job model, Redis sets, NATS consumer state) lie differently;
unifying them into a diagnostic view catches bug classes Redis alone
can't see (#1247-style stranding, result-handler swallow, hung in-flight
workers). Doc proposes snapshot-on-terminal, diagnostics endpoint,
JobReconciler.diagnose(), and an admin UI panel.

Tracked in #1285.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../planning/job-state-triangulation.md       | 145 ++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 docs/claude/planning/job-state-triangulation.md

diff --git a/docs/claude/planning/job-state-triangulation.md b/docs/claude/planning/job-state-triangulation.md
new file mode 100644
index 000000000..0c719f1a4
--- /dev/null
+++ b/docs/claude/planning/job-state-triangulation.md
@@ -0,0 +1,145 @@
+# Job state triangulation: DB + Redis + NATS
+
+**Status:** captured for future work — low priority. Tracked in
+[#1285](https://github.com/RolnickLab/antenna/issues/1285). Discussion
+artifact from session 2026-04-30 following PR #1276.
+
+## Problem
+
+The platform has three distinct sources of truth for the lifecycle of an
+async_api job:
+
+1. **DB (`ami.jobs.models.Job`)** — `Job.status` (lifecycle enum) and
+   `Job.progress` (Pydantic `JobProgress` JSONB). External-facing; what UI
+   reads. Clobber-prone since #1261 dropped `select_for_update` from
+   `_update_job_progress` (`ami/jobs/tasks.py`).
+2. **Redis (`AsyncJobStateManager`)** — atomic per-image lifecycle: pending
+   set per stage, failed set, total. Ephemeral (7-day TTL). Single source of
+   truth used by reaper guard after PR #1276.
+3. **NATS JetStream consumer state** — message lifecycle: `delivered`,
+   `ack_floor`, `num_pending`, `num_ack_pending`, `num_redelivered`,
+   `num_waiting`. Authoritative for what was dispatched and acked.
+
+Each lies differently. None alone catches every bug class observed in
+production.
+
+## What each source knows that others don't
+
+| Bug class | DB | Redis | NATS | Diagnosis signal |
+|---|---|---|---|---|
+| 2521 clobber (#1276 root cause) | progress incomplete | empty | `ack_floor == delivered` | last-writer race in `_update_job_progress` |
+| Lost images (#1247) | stuck STARTED | non-empty | `num_pending > 0`, redelivery exhausted | NATS gave up, Redis still tracks |
+| Worker crash loop | stuck STARTED | non-empty | `num_redelivered` climbing | infrastructure issue |
+| Result-handler swallow | SUCCESS | failed set non-empty | drained | error path bug |
+| Hung in-flight worker | progress 95% | non-empty | `ack_pending > 0` for minutes | worker stalled mid-batch |
+
+DB alone catches none of these reliably. Redis catches most. Redis+NATS
+catches all.
+
+## Proposed: diagnostics endpoint
+
+`GET /api/v2/jobs/<id>/diagnostics` returning structured triangulation:
+
+```json
+{
+  "db": {
+    "status": "STARTED",
+    "celery_state": "SUCCESS",
+    "progress_complete": false,
+    "stages": {"process": 1.0, "results": 1.0, "collect": 0.0}
+  },
+  "redis": {
+    "process": {"pending": 0, "total": 450},
+    "results": {"pending": 0, "total": 450},
+    "failed": [],
+    "all_tasks_processed": true,
+    "ttl_seconds": 580000
+  },
+  "nats": {
+    "stream": "...",
+    "consumer": "...",
+    "delivered": 450,
+    "ack_floor": 450,
+    "num_pending": 0,
+    "num_ack_pending": 0,
+    "num_redelivered": 12,
+    "num_waiting": 4
+  },
+  "divergence": [
+    {
+      "severity": "warn",
+      "msg": "DB progress incomplete but Redis drained and NATS fully acked — likely _update_job_progress clobber"
+    }
+  ]
+}
+```
+
+Cache 5–10s server-side (NATS info is async socket round-trip).
+
+## Proposed: UI diagnostic panel
+
+Default jobs detail view: keep existing progress bars (simple).
+
+Add expandable **Diagnostics** panel (admin/power-user). Per stage row:
+
+```
+process     dispatched  acked   pending  failed  redelivered
+            450/450     445     5        0       12
+            ── NATS ──  ─NATS─  ─Redis─  Redis   NATS
+```
+
+Header strip: NATS consumer name, Redis TTL, DB vs Celery vs Redis truth
+mismatch flags.
+
+"Force reconcile" admin button = re-run reaper guard logic on demand.
+
+## Proposed: snapshot on terminal transition
+
+On terminal state change (SUCCESS / FAILURE / REVOKED), persist NATS
+snapshot into `Job.progress.diagnostics` (or a new `Job.diagnostics` JSONB
+field). Forensic record for post-mortems — NATS consumer state is
+short-lived after job completion.
+
+## Reuse for reaper
+
+`AsyncJobStateManager.all_tasks_processed()` (added in #1276) is half of
+this. Promote to `JobReconciler.diagnose()` returning the full structured
+triangulation. Reaper consumes `diagnose().is_terminal_safe`. Diagnostics
+endpoint consumes the same call. Single source of triangulation logic.
+
+## Tradeoffs
+
+- NATS info call needs JetStream credentials in Django (already have via
+  natsconn work).
+- More numbers = more confusion for non-admins → gate diag panel by
+  permission.
+- NATS consumer state goes stale fast post-completion (consumers are
+  deleted). Snapshot-on-terminal mitigates.
+- Adds a per-job NATS round-trip on diag endpoint hit. Cache + admin-only
+  surface keeps cost bounded.
+
+## Recommended implementation order
+
+1. **Snapshot-on-terminal** first — cheap, no UI work, big forensic value.
+2. **Diagnostics endpoint** + JSON shape. Lets reaper + admin tooling
+   consume it; no UI needed yet.
+3. **Promote reaper to consume `JobReconciler.diagnose()`** — replaces
+   inline tri-state branch.
+4. **UI diagnostic panel** last, once endpoint shape is stable.
+
+## Related tickets
+
+- #1232 — Redis-only follow-ups from PR #1231 retry-conflation work
+- #721 — Backend bug: status stuck STARTED (user-facing symptom)
+- #1085 — PSv2: integrate incomplete job monitoring
+- #1134 — COMPLETED job state for partial-error jobs
+- #1112 — Worker visibility per job
+- #1166 — Processing service status indicator (UI side)
+- #1168 — Async jobs hang when NATS exhausts max_deliver
+
+## Why this is low priority
+
+PR #1276 closes the immediate clobber-revoke loop with Redis-direct check.
+Triangulation is the next step up: forensic visibility + catching adjacent
+bug classes (#1247-style, result-handler swallow). Not blocking; pull when
+the next stuck-job report lands and Redis-only signal is insufficient.

From 563530f9db8f44ed95d5a2d6737e2de1f7cca116 Mon Sep 17 00:00:00 2001
From: Michael Bunsen <notbot@gmail.com>
Date: Wed, 17 Jun 2026 16:27:28 -0700
Subject: [PATCH 6/6] fix(jobs): scope reaper fast-path to Celery SUCCESS, not
 FAILURE

Address review feedback on the stale-job reaper guard.

CodeRabbit (major): the reaper trusted both Celery SUCCESS and FAILURE
as terminal for drained async_api jobs. But update_job_failure() defers
post-queue run_job failures for async_api jobs to the async result
handler, which decides the final status from the processed/failed counts
against FAILURE_THRESHOLD (a drained-but-failed task can still resolve to
SUCCESS). Trusting Celery FAILURE here would force the job to FAILURE and
bypass that logic. Scope the Redis-backed fast-path to SUCCESS only; a
stale async_api job whose Celery task ended FAILURE now falls through to
the revoke branch instead. Add a regression test.

Copilot (nit): the fallback WARNING said "Redis state absent" but
all_tasks_processed() also returns None on transient RedisError. Reworded
to "Redis state unavailable" to cover both causes; updated the assertions.

Narrow both reaper save points to update_fields=[status, progress,
finished_at, updated_at]. A full save() re-writes the whole row from the
snapshot fetched at select_for_update() time, which can clobber `logs`
and `progress.errors` written by a concurrent _update_job_progress (no
row lock since #1261). Mirrors the narrowed-write pattern already used in
_update_job_progress.

Also remove three docs/claude session-artifact files that were added on
this branch. They contain production job IDs, internal infrastructure
references, and other operational detail that does not belong in a public
repository.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ami/jobs/tasks.py                             |  60 +++-
 ami/jobs/tests/test_tasks.py                  |  55 ++-
 docs/claude/planning/job-complete-check.md    | 318 ------------------
 .../planning/job-state-triangulation.md       | 145 --------
 docs/claude/prompts/NEXT_SESSION_PROMPT.md    | 218 ------------
 5 files changed, 89 insertions(+), 707 deletions(-)
 delete mode 100644 docs/claude/planning/job-complete-check.md
 delete mode 100644 docs/claude/planning/job-state-triangulation.md
 delete mode 100644 docs/claude/prompts/NEXT_SESSION_PROMPT.md

diff --git a/ami/jobs/tasks.py b/ami/jobs/tasks.py
index 554c3ba18..f8f199045 100644
--- a/ami/jobs/tasks.py
+++ b/ami/jobs/tasks.py
@@ -908,11 +908,15 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[
     healthy. Default cutoff is :attr:`Job.STALLED_JOBS_MAX_MINUTES`.
 
     For each stale job, checks Celery for a terminal task status. REVOKED is
-    always trusted. For async_api jobs, SUCCESS and FAILURE are only accepted
-    when AsyncJobStateManager.all_tasks_processed() reports True (i.e. the
-    Redis pending sets are drained). When Redis state is absent, falls back to
-    job.progress.is_complete(). All other cases result in revocation. Async
-    resources (NATS/Redis) are cleaned up in both branches.
+    always trusted. For async_api jobs, only SUCCESS is fast-pathed to a
+    terminal status, and only when AsyncJobStateManager.all_tasks_processed()
+    reports True (i.e. the Redis pending sets are drained); when Redis state is
+    unavailable it falls back to job.progress.is_complete(). Celery FAILURE is
+    not trusted for async_api jobs — update_job_failure() defers the terminal
+    outcome to the async result handler (FAILURE_THRESHOLD logic), so a stale
+    FAILED async_api job is revoked rather than forced to FAILURE. All other
+    cases result in revocation. Async resources (NATS/Redis) are cleaned up in
+    both branches.
 
     Returns a list of dicts describing what was done to each job.
     """
@@ -961,33 +965,54 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[
                     )
                     # Treat as unknown state — job will be revoked below.
 
-            # Only trust terminal Celery states. For async_api jobs, SUCCESS and
-            # FAILURE are only accepted when all NATS tasks are processed — workers
-            # may still be delivering results after the Celery task finishes.
-            # Consult Redis (source of truth for SREM completeness) directly rather
-            # than Job.progress.is_complete(), which mirrors a JSONB blob racy under
-            # concurrent _update_job_progress writes since #1261.
+            # Only trust terminal Celery states. For async_api jobs, a SUCCESS
+            # Celery state is only accepted when all NATS tasks are processed —
+            # workers may still be delivering results after the Celery task
+            # finishes. Consult Redis (source of truth for SREM completeness)
+            # directly rather than Job.progress.is_complete(), which mirrors a
+            # JSONB blob racy under concurrent _update_job_progress writes since
+            # #1261.
+            #
+            # Celery FAILURE is deliberately NOT fast-pathed to a terminal
+            # status here: update_job_failure() defers post-queue run_job
+            # failures for async_api jobs to the async result handler, which
+            # decides the terminal outcome from the final processed/failed
+            # counts against FAILURE_THRESHOLD (a drained-but-failed Celery task
+            # can still resolve to SUCCESS). Trusting Celery FAILURE here would
+            # force the job to FAILURE and bypass that threshold logic, so a
+            # stale async_api job whose Celery task ended FAILURE falls through
+            # to the revoke branch instead.
             is_terminal = celery_state in states.READY_STATES
             is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
-            if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
+            if is_async_api and celery_state == states.SUCCESS:
                 processed = AsyncJobStateManager(job.pk).all_tasks_processed()
                 if processed is False:
                     is_terminal = False
                 elif processed is None:
                     logger.warning(
-                        "Reaper for job %s: Redis state absent, falling back to " "progress.is_complete()",
+                        "Reaper for job %s: Redis state unavailable, falling back to " "progress.is_complete()",
                         job.pk,
                     )
                     if not job.progress.is_complete():
                         is_terminal = False
-                # processed is True -> trust Celery's terminal state
+                # processed is True -> trust Celery SUCCESS
+            elif is_async_api and celery_state == states.FAILURE:
+                # Don't treat Celery FAILURE as authoritative for async_api jobs
+                # (see comment above); revoke instead of forcing FAILURE.
+                is_terminal = False
 
             previous_status = job.status
             if is_terminal:
                 if not dry_run:
                     job.update_status(celery_state, save=False)
                     job.finished_at = datetime.datetime.now()
-                    job.save()
+                    # Narrow the write to the fields the reaper actually mutates.
+                    # A full save() would re-write the whole row from the snapshot
+                    # fetched at select_for_update() time, clobbering `logs` and
+                    # `progress.errors` that a concurrent _update_job_progress (no
+                    # row lock since #1261) may commit. update_status() touches
+                    # status + progress.summary.status, so `progress` is included.
+                    job.save(update_fields=["status", "progress", "finished_at", "updated_at"])
             else:
                 # Per-job diagnostic: surface enough state at revoke time that an
                 # operator can answer "why was this stalled?" without grepping
@@ -1009,7 +1034,10 @@ def check_stale_jobs(minutes: int | None = None, dry_run: bool = False) -> list[
                 if not dry_run:
                     job.update_status(JobState.REVOKED, save=False)
                     job.finished_at = datetime.datetime.now()
-                    job.save()
+                    # See note on the terminal branch above: narrow the write so
+                    # the reaper doesn't clobber a concurrent progress writer's
+                    # `logs` / `progress.errors`.
+                    job.save(update_fields=["status", "progress", "finished_at", "updated_at"])
 
         # Async resource cleanup runs outside the transaction — it makes network
         # calls (NATS/Redis) that should not hold the DB row lock.
diff --git a/ami/jobs/tests/test_tasks.py b/ami/jobs/tests/test_tasks.py
index 3126f5486..343d54d8a 100644
--- a/ami/jobs/tests/test_tasks.py
+++ b/ami/jobs/tests/test_tasks.py
@@ -1232,14 +1232,15 @@ def test_jobs_health_check_runs_lost_images_before_stale_jobs(self, mock_manager
 
 
 class TestCheckStaleJobsReaperGuard(TransactionTestCase):
-    """Reaper guard for async_api jobs: SUCCESS/FAILURE Celery state is only
-    accepted when AsyncJobStateManager.all_tasks_processed() reports True. The
-    earlier guard read Job.progress.is_complete() — racy under concurrent
-    _update_job_progress writes since #1261 dropped select_for_update. Job 2521
-    landed REVOKED with NATS+Redis fully drained because a slower committer
-    clobbered the SUCCESS write. This class verifies the new Redis-direct path,
-    the absent-state fallback to progress.is_complete() with a WARNING, and
-    that sync_api jobs are unaffected.
+    """Reaper guard for async_api jobs: a SUCCESS Celery state is only accepted
+    when AsyncJobStateManager.all_tasks_processed() reports True. The earlier
+    guard read Job.progress.is_complete() — racy under concurrent
+    _update_job_progress writes since #1261 dropped select_for_update. A
+    production job once landed REVOKED with NATS+Redis fully drained because a
+    slower committer clobbered the SUCCESS write. This class verifies the new
+    Redis-direct path, the unavailable-state fallback to progress.is_complete()
+    with a WARNING, that Celery FAILURE is not fast-pathed to a terminal
+    FAILURE for async_api jobs, and that sync_api jobs are unaffected.
     """
 
     def setUp(self):
@@ -1375,7 +1376,7 @@ def test_async_celery_success_redis_absent_progress_complete_lands_success(self,
 
         job.refresh_from_db()
         self.assertEqual(job.status, JobState.SUCCESS.value)
-        self.assertTrue(any("Redis state absent" in m for m in cm.output))
+        self.assertTrue(any("Redis state unavailable" in m for m in cm.output))
 
     @patch("celery.result.AsyncResult")
     def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(self, mock_async_result):
@@ -1395,7 +1396,7 @@ def test_async_celery_success_redis_absent_progress_incomplete_lands_revoked(sel
 
         job.refresh_from_db()
         self.assertEqual(job.status, JobState.REVOKED.value)
-        self.assertTrue(any("Redis state absent" in m for m in cm.output))
+        self.assertTrue(any("Redis state unavailable" in m for m in cm.output))
 
     @patch("celery.result.AsyncResult")
     def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_async_result):
@@ -1424,3 +1425,37 @@ def test_sync_api_celery_success_lands_success_without_redis_check(self, mock_as
 
         job.refresh_from_db()
         self.assertEqual(job.status, JobState.SUCCESS.value)
+
+    @patch("celery.result.AsyncResult")
+    def test_async_celery_failure_redis_drained_lands_revoked_not_failure(self, mock_async_result):
+        """A Celery FAILURE on an async_api job is never fast-pathed to a
+        terminal FAILURE here, even when Redis reports the pending sets drained.
+
+        update_job_failure() deliberately defers post-queue run_job failures for
+        async_api jobs to the async result handler, which decides the terminal
+        outcome from the final processed/failed counts against FAILURE_THRESHOLD
+        (a drained-but-failed task can still resolve to SUCCESS). The reaper must
+        not pre-empt that by forcing FAILURE, so the stale job is REVOKED instead.
+        """
+        from ami.jobs.tasks import check_stale_jobs
+
+        job = self._stale_async_job(task_id="reaper-failure-task")
+        ids = [str(i) for i in range(10)]
+        manager = AsyncJobStateManager(job.pk)
+        manager.initialize_job(ids)
+        manager.update_state(set(ids), stage="process")
+        manager.update_state(set(ids), stage="results")
+        self._mark_stale(job)
+
+        from celery import states as celery_states
+
+        mock_async_result.return_value.state = celery_states.FAILURE
+
+        check_stale_jobs()
+
+        job.refresh_from_db()
+        self.assertEqual(
+            job.status,
+            JobState.REVOKED.value,
+            f"Celery FAILURE must not be forced to terminal FAILURE for async_api; got {job.status}",
+        )
diff --git a/docs/claude/planning/job-complete-check.md b/docs/claude/planning/job-complete-check.md
deleted file mode 100644
index bc0847dfb..000000000
--- a/docs/claude/planning/job-complete-check.md
+++ /dev/null
@@ -1,318 +0,0 @@
-# Plan: reaper checks Redis directly via `all_tasks_processed()` (fix/job-complete-check)
-
-## Why
-
-Production job 2521 (project 192, ~4510 images, async_api) finished with all
-4510 NATS messages acked and Redis pending sets empty, but `Job.progress`
-JSONB had `process=99.98% STARTED processed=4509 remaining=1`. The
-`jobs_health_check` reaper at the 10-min cutoff:
-
-1. `mark_lost_images_failed` correctly skipped the job (Redis pending was
-   empty, so `lost_ids` was empty — the documented skip path).
-2. `check_stale_jobs` found Celery `state=SUCCESS` but the async_api guard
-   at `ami/jobs/tasks.py:968` consults `job.progress.is_complete()`, which
-   returned False because of the clobbered `process` stage.
-3. Reaper REVOKEd the job. 5058 detections / 8129 classifications already
-   saved to DB; Job row reads REVOKED.
-
-PR #1244's reconciler is the wrong tool for this symptom — there is nothing
-to reconcile in Redis. The bug is upstream: `_update_job_progress`
-(`tasks.py:556`) writes the entire `progress` JSONB blob without a row
-lock (PR #1261 dropped `select_for_update` to break unrelated row-lock
-contention), and two concurrent workers can each read the same pre-state
-then commit last-writer-wins. The `max()` guard at `tasks.py:598` only
-protects percentage regression *based on what each writer read* — it does
-not serialize the readers, so a slower committer with a stale snapshot
-clobbers a faster committer's blob.
-
-`Job.progress` cannot be the source of truth for state-machine decisions
-on async_api jobs. This PR makes that explicit at the one user-visible
-site: the reaper's async_api guard. Reaper consults Redis directly.
-`Job.progress` stays as the cosmetic mirror for the UI.
-
-Related context:
-
-- Lifecycle and bug taxonomy: `docs/claude/processing-lifecycle.md`
-- Last-writer-wins admission: `ami/jobs/tasks.py:563-571`
-- PR #1261 commit `50677444` removed `select_for_update`.
-
-## Scope (narrow)
-
-In scope:
-
-- New `AsyncJobStateManager.all_tasks_processed() -> bool | None` —
-  one Redis pipeline call (SCARD × 2 + GET) returning a tri-state.
-- Replace the `not job.progress.is_complete()` read at `tasks.py:968`
-  (reaper async_api guard) with the new Redis-backed check.
-- WARNING log when Redis state is absent (`None` return), so we can
-  monitor how often this fires.
-- Tests for the 3-state matrix on `all_tasks_processed()` + a reaper
-  test mirroring the 2521 shape (Redis empty, progress JSONB stuck at
-  99.98% STARTED → reaper lands SUCCESS, not REVOKED).
-
-Out of scope (filed as follow-ups):
-
-- Race fix in `_update_job_progress` itself. Move counts to atomic
-  Redis HINCRBY, or scope a row lock to microseconds via raw `jsonb_set`.
-- Updating the other call sites that read `progress.is_complete()`
-  (`tasks.py:654`, `:1331`, `:1361`, `:159`, `:496`). They are not
-  affected by the clobber pattern under current operation — see
-  "Why only the reaper" below.
-- DB-count fallback for the `None` (Redis-gone) case. Today the
-  WARNING log + `progress.is_complete()` fallback is sufficient. If
-  the WARNING fires often enough to matter, follow-up adds a
-  `pipeline.filter_processed_images()`-based check.
-- Job-level method (`Job.processing_complete()`). Inline at the one
-  call site that needs it; if a future site needs the same logic,
-  extract then.
-- UI changes. `Job.progress` shape is unchanged.
-- `'CANCELLED' is not a valid JobState` crash in `tasks.py:1193`
-  (`zombie_streams._drain_all`). Separate bug surfaced during 2521
-  triage; legacy job row has a status not in the JobState enum.
-
-## Why only the reaper
-
-The clobber pattern in `_update_job_progress` exists at six call sites
-that read `progress.is_complete()`. Only one is user-visible:
-
-| Site | Race exposure | User-visible consequence |
-|---|---|---|
-| `tasks.py:968` reaper async_api guard | **Yes** — runs minutes after the racy writes have settled, sees the clobbered final state | **Yes — REVOKED instead of SUCCESS** |
-| `tasks.py:654` cleanup gate inside `_update_job_progress` | Same writer is itself the racer; `is_complete()=False` after clobber prevents early cleanup | No — cleanup fires later from terminal-state path |
-| `tasks.py:1331` `task_postrun` SUCCESS guard | None — fires at queue-completion, no NATS results yet, single writer | No |
-| `tasks.py:1361` `task_failure` ASYNC_API guard | None — same single-writer moment | No |
-| `tasks.py:159` `run_job` post-run log | None — log-gating only | No |
-| `tasks.py:496` `MLJob.run` zero-images path | None — runs before any concurrent writers | No |
-
-This PR fixes (1). The others either have a single writer at the moment
-they run, or their failure mode is internal and self-correcting. If
-monitoring later shows another site biting, expand then.
-
-## Why not put the check inside `progress.is_complete()`
-
-`JobProgress` is a Pydantic model — pure data. Adding Redis I/O behind
-its `is_complete()` method:
-
-- couples a serializable data shape to a side-effect-laden oracle
-- silently makes every API serialization, log dump, and read trip
-  Redis (caller doesn't see the I/O cost)
-- the Pydantic instance has no `job_id` reference, so the method
-  signature would have to grow an awkward `job_id=` argument
-- breaks `JobProgress` reuse outside the Job context
-
-Keep `JobProgress.is_complete()` as cosmetic Pydantic data check.
-Do the Redis call at the state-machine call site that needs it.
-
-## Design
-
-### Tier 1 — Redis oracle on `AsyncJobStateManager`
-
-```python
-# ami/ml/orchestration/async_job_state.py
-def all_tasks_processed(self) -> bool | None:
-    """
-    Truth signal for whether all NATS-tracked tasks have been processed
-    out of both pending sets (process + results stages).
-
-    Returns:
-        True  — both pending sets are empty AND total > 0 (all tasks SREM'd)
-        False — at least one pending set has members (real work outstanding)
-        None  — Redis state absent (cleaned up, TTL expired, never initialized)
-                Caller decides what to do; do not assume completeness.
-
-    Scope: Redis tracks the per-image SREM lifecycle for the `process`
-    and `results` stages only. It does NOT know about the `collect` stage
-    or any future post-results stages. Callers that need "is the entire
-    Job complete across ALL stages" must combine this with their own
-    knowledge of stage layout.
-
-    RedisError (transient connection issues) is logged and returned as
-    None — a Redis blip should NOT flip a job to REVOKED on its own.
-    """
-    try:
-        redis = self._get_redis()
-        with redis.pipeline() as pipe:
-            for stage in self.STAGES:
-                pipe.scard(self._get_pending_key(stage))
-            pipe.get(self._total_key)
-            results = pipe.execute()
-    except RedisError as e:
-        logger.warning(
-            f"Redis error reading all_tasks_processed for job {self.job_id}: {e}"
-        )
-        return None
-
-    *pending_counts, total_raw = results
-    if total_raw is None:
-        return None
-    total = int(total_raw)
-    if total == 0:
-        # Zero-images job: trivially "processed" — no NATS work to do.
-        return True
-    return all(count == 0 for count in pending_counts)
-```
-
-One Redis pipeline round-trip. SCARD is O(1) per key. Negligible at scale.
-
-### Reaper guard replacement
-
-```python
-# ami/jobs/tasks.py:968 — replacement
-is_terminal = celery_state in states.READY_STATES
-is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
-if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
-    processed = AsyncJobStateManager(job.pk).all_tasks_processed()
-    if processed is False:
-        # NATS tasks still pending in Redis. mark_lost_images_failed runs
-        # before this and would have reconciled if it could; if we land
-        # here with pending > 0, the reconciler couldn't help (consumer
-        # gone, etc.) and REVOKE is the correct outcome.
-        is_terminal = False
-    elif processed is None:
-        # Redis state gone — rare. Fall back to the racy progress oracle
-        # but log so we know how often this path fires. If it fires
-        # often, follow up with a DB-count fallback.
-        logger.warning(
-            "Reaper for job %s: Redis state absent, falling back to "
-            "progress.is_complete()",
-            job.pk,
-        )
-        if not job.progress.is_complete():
-            is_terminal = False
-    # processed is True -> trust Celery's terminal state, leave is_terminal as-is
-```
-
-Sync_api jobs hit none of this branch — they fall through `is_terminal =
-celery_state in READY_STATES` exactly as today.
-
-### Behavior change matrix
-
-| Scenario | Today | After this PR |
-|---|---|---|
-| async_api job done; Redis empty; `progress` clobbered (2521 shape) | REVOKED | **SUCCESS** |
-| async_api job mid-flight; Celery SUCCESS but Redis pending > 0 | REVOKED (was already, but for the wrong reason) | REVOKED (same outcome, correct reason: NATS tasks pending) |
-| async_api job done; Redis pending == 0; `progress` accurate | SUCCESS | SUCCESS (no change) |
-| async_api job done; Redis state cleaned up early | varies — depends on `progress` | falls back to `progress.is_complete()` with WARNING log |
-| sync_api job at reaper | unchanged | unchanged |
-
-### Tests
-
-Extend `ami/ml/tests.py` (or `ami/jobs/tests/test_async_job_state.py`):
-
-- `all_tasks_processed()` after fresh `initialize_job([...])` → False
-  (all images still in pending sets)
-- After SREM-ing every id from both stages → True
-- After SREM-ing some → False
-- With `total=0` (zero-images path) → True
-- With Redis state never initialized → None
-- With Redis state cleaned up via `cleanup()` → None
-- Simulated `RedisError` on the pipeline → None + WARNING logged
-
-Extend `ami/jobs/tests/test_tasks.py` — `TestCheckStaleJobs`:
-
-- async_api job, Celery SUCCESS, Redis pending empty, `progress`
-  hand-set to `process=99.98% STARTED processed=4509 remaining=1`
-  (mirroring 2521) → reaper lands SUCCESS, not REVOKED. Asserts
-  `is_terminal=True` path was taken.
-- async_api job, Celery SUCCESS, Redis pending > 0 → REVOKED (no regression
-  for genuine partial-completion).
-- async_api job, Redis state absent, `progress.is_complete()=True`,
-  Celery SUCCESS → SUCCESS via the fallback path. Assert WARNING log
-  emitted.
-- async_api job, Redis state absent, `progress.is_complete()=False`,
-  Celery SUCCESS → REVOKED via the fallback path. Assert WARNING log emitted.
-- sync_api job at reaper → behavior unchanged (existing tests cover).
-
-### Logging
-
-`logger.warning(...)` once per stale-async_api job per tick when Redis
-state is absent. Format includes `job.pk` so we can grep.
-
-Monitoring expectation post-deploy:
-
-- Drop in REVOKED async_api jobs whose final `Job.progress` snapshot has
-  `process>=99% STARTED` (the clobber signature).
-- Low-rate WARNING lines from the Redis-gone fallback. High rate ==
-  Redis cleanup is firing too aggressively, separate bug to chase.
-
-## Acceptance criteria
-
-- `all_tasks_processed()` exists with the three-state semantics.
-- Reaper at `tasks.py:968` uses it; sync_api branch unchanged.
-- Existing `ami.jobs` test suite stays green.
-- New tests for the 3-state matrix + reaper-eats-2521-shape case pass.
-- Manual chaos: dispatch real async_api job, wait until completion, then
-  via Django shell hand-clobber `Job.progress` to mimic 2521. Wait 11
-  min. Confirm reaper lands SUCCESS, not REVOKED. Confirm Redis was
-  clean at decision time.
-
-## Rollout
-
-Single PR. No feature flag. The change is strictly additive at the call
-site (one site replaced; no API or schema change). Deploy + monitor.
-
-## Risks
-
-- **Race window between Redis SREM and `all_tasks_processed()`**: a
-  result that lands during the reaper's call could leave Redis briefly
-  in an intermediate state. SCARD is atomic per-key; worst case is the
-  snapshot misses a just-completed image and reports pending=1 → reaper
-  defers → next tick sees pending=0 → SUCCESS. Bounded.
-- **Redis-gone fallback rate**: if WARNING fires frequently in
-  production, we punted too early on Tier 2 (DB count). Mitigation:
-  add `pipeline.filter_processed_images()`-based fallback in a
-  follow-up PR. Bounded — only fires when Redis state is absent AND the
-  job is past the reaper cutoff.
-- **Sync_api jobs unaffected**: dispatch_mode gate ensures non-async
-  jobs hit the existing path. Verified by inspection + sync_api reaper
-  tests.
-
-## Follow-ups (separate PRs)
-
-1. **Make `_update_job_progress` writes atomic.** Either tightly-scoped
-   `select_for_update`, or move counts (`detections`, `classifications`,
-   `captures`) to Redis HINCRBY counters with a separate sync task
-   mirroring them into `Job.progress`. The cosmetic counter drift
-   admitted at `tasks.py:563-571` becomes invisible, and the failure
-   mode this PR addresses goes away even without the new method.
-2. **DB-count fallback for the Redis-gone case** if monitoring shows
-   the WARNING firing often. `pipeline.filter_processed_images()`
-   against the Job's queued image set.
-3. **Persist `queued_image_ids` on the Job row at dispatch** to make
-   (2) cheaper.
-4. **Fix `'CANCELLED' is not a valid JobState`** in `tasks.py:1193`.
-5. **Expand `all_tasks_processed()` usage** to the other 5 sites if
-   monitoring shows a clobber-driven failure at any of them.
-
-## Diagnostic evidence (job 2521)
-
-Captured from prod via `ami-devops` on 2026-04-29:
-
-- Job row at REVOKE time:
-  - `process=99.98% STARTED processed=4509 remaining=1`
-  - `results=100% SUCCESS`
-  - `progress.errors=[]`
-- NATS at cleanup: `delivered=4510 ack_floor=4510 num_pending=0
-  num_ack_pending=0 num_redelivered=0`
-- Redis: all keys deleted at cleanup; pre-cleanup unknown but worker_ml
-  log shows no `Stage 'X' progress lifted to 100% by max() guard`
-  warning (the diagnostic hook for this race), suggesting both writers
-  had stale reads.
-- Worker log: `Updated job 2521 progress in stage 'process' to 100.0%`
-  at 11:45:45.510, then a later `99.97782%` write to results, then
-  silence for 14.5 min until reaper at 12:00:00.117.
-- Reaper line: `Reaping stalled job: no progress for 14.2 min ... stages:
-  collect=100.0% SUCCESS, process=100.0% STARTED, results=100.0% SUCCESS`
-  — note the reaper's in-memory snapshot read shows 100% at the moment
-  of REVOKE. The persisted blob at that moment was 99.98%. The
-  `progress.is_complete()` call inside the guard re-read the persisted
-  blob (or evaluated against the 100%-but-STARTED status, which fails
-  the `final_states` check in `models.py:268`). Either way: guard
-  fired, REVOKED.
-
-This is the second case (status STARTED with progress=100% on the same
-stage) — confirms the race is on the `status` field write, not just
-the percentage. Both writers had stale snapshots; the slower one
-overwrote the SUCCESS-status with STARTED.
-
-Last updated: 2026-04-29.
diff --git a/docs/claude/planning/job-state-triangulation.md b/docs/claude/planning/job-state-triangulation.md
deleted file mode 100644
index 0c719f1a4..000000000
--- a/docs/claude/planning/job-state-triangulation.md
+++ /dev/null
@@ -1,145 +0,0 @@
-# Job state triangulation: DB + Redis + NATS
-
-**Status:** captured for future work — low priority. Tracked in
-[#1285](https://github.com/RolnickLab/antenna/issues/1285). Discussion
-artifact from session 2026-04-30 following PR #1276.
-
-## Problem
-
-The platform has three distinct sources of truth for the lifecycle of an
-async_api job:
-
-1. **DB (`ami.jobs.models.Job`)** — `Job.status` (lifecycle enum) and
-   `Job.progress` (Pydantic `JobProgress` JSONB). External-facing; what UI
-   reads. Clobber-prone since #1261 dropped `select_for_update` from
-   `_update_job_progress` (`ami/jobs/tasks.py`).
-2. **Redis (`AsyncJobStateManager`)** — atomic per-image lifecycle: pending
-   set per stage, failed set, total. Ephemeral (7-day TTL). Single source of
-   truth used by reaper guard after PR #1276.
-3. **NATS JetStream consumer state** — message lifecycle: `delivered`,
-   `ack_floor`, `num_pending`, `num_ack_pending`, `num_redelivered`,
-   `num_waiting`. Authoritative for what was dispatched and acked.
-
-Each lies differently. None alone catches every bug class observed in
-production.
-
-## What each source knows that others don't
-
-| Bug class | DB | Redis | NATS | Diagnosis signal |
-|---|---|---|---|---|
-| 2521 clobber (#1276 root cause) | progress incomplete | empty | `ack_floor == delivered` | last-writer race in `_update_job_progress` |
-| Lost images (#1247) | stuck STARTED | non-empty | `num_pending > 0`, redelivery exhausted | NATS gave up, Redis still tracks |
-| Worker crash loop | stuck STARTED | non-empty | `num_redelivered` climbing | infrastructure issue |
-| Result-handler swallow | SUCCESS | failed set non-empty | drained | error path bug |
-| Hung in-flight worker | progress 95% | non-empty | `ack_pending > 0` for minutes | worker stalled mid-batch |
-
-DB alone catches none of these reliably. Redis catches most. Redis+NATS
-catches all.
-
-## Proposed: diagnostics endpoint
-
-`GET /api/v2/jobs/<id>/diagnostics` returning structured triangulation:
-
-```json
-{
-  "db": {
-    "status": "STARTED",
-    "celery_state": "SUCCESS",
-    "progress_complete": false,
-    "stages": {"process": 1.0, "results": 1.0, "collect": 0.0}
-  },
-  "redis": {
-    "process": {"pending": 0, "total": 450},
-    "results": {"pending": 0, "total": 450},
-    "failed": [],
-    "all_tasks_processed": true,
-    "ttl_seconds": 580000
-  },
-  "nats": {
-    "stream": "...",
-    "consumer": "...",
-    "delivered": 450,
-    "ack_floor": 450,
-    "num_pending": 0,
-    "num_ack_pending": 0,
-    "num_redelivered": 12,
-    "num_waiting": 4
-  },
-  "divergence": [
-    {
-      "severity": "warn",
-      "msg": "DB progress incomplete but Redis drained and NATS fully acked — likely _update_job_progress clobber"
-    }
-  ]
-}
-```
-
-Cache 5–10s server-side (NATS info is async socket round-trip).
-
-## Proposed: UI diagnostic panel
-
-Default jobs detail view: keep existing progress bars (simple).
-
-Add expandable **Diagnostics** panel (admin/power-user). Per stage row:
-
-```
-process     dispatched  acked   pending  failed  redelivered
-            450/450     445     5        0       12
-            ── NATS ──  ─NATS─  ─Redis─  Redis   NATS
-```
-
-Header strip: NATS consumer name, Redis TTL, DB vs Celery vs Redis truth
-mismatch flags.
-
-"Force reconcile" admin button = re-run reaper guard logic on demand.
-
-## Proposed: snapshot on terminal transition
-
-On terminal state change (SUCCESS / FAILURE / REVOKED), persist NATS
-snapshot into `Job.progress.diagnostics` (or a new `Job.diagnostics` JSONB
-field). Forensic record for post-mortems — NATS consumer state is
-short-lived after job completion.
-
-## Reuse for reaper
-
-`AsyncJobStateManager.all_tasks_processed()` (added in #1276) is half of
-this. Promote to `JobReconciler.diagnose()` returning the full structured
-triangulation. Reaper consumes `diagnose().is_terminal_safe`. Diagnostics
-endpoint consumes the same call. Single source of triangulation logic.
-
-## Tradeoffs
-
-- NATS info call needs JetStream credentials in Django (already have via
-  natsconn work).
-- More numbers = more confusion for non-admins → gate diag panel by
-  permission.
-- NATS consumer state goes stale fast post-completion (consumers are
-  deleted). Snapshot-on-terminal mitigates.
-- Adds a per-job NATS round-trip on diag endpoint hit. Cache + admin-only
-  surface keeps cost bounded.
-
-## Recommended implementation order
-
-1. **Snapshot-on-terminal** first — cheap, no UI work, big forensic value.
-2. **Diagnostics endpoint** + JSON shape. Lets reaper + admin tooling
-   consume it; no UI needed yet.
-3. **Promote reaper to consume `JobReconciler.diagnose()`** — replaces
-   inline tri-state branch.
-4. **UI diagnostic panel** last, once endpoint shape is stable.
-
-## Related tickets
-
-- #1232 — Redis-only follow-ups from PR #1231 retry-conflation work
-- #721 — Backend bug: status stuck STARTED (user-facing symptom)
-- #1085 — PSv2: integrate incomplete job monitoring
-- #1134 — COMPLETED job state for partial-error jobs
-- #1112 — Worker visibility per job
-- #1166 — Processing service status indicator (UI side)
-- #1168 — Async jobs hang when NATS exhausts max_deliver
-
-## Why this is low priority
-
-PR #1276 closes the immediate clobber-revoke loop with Redis-direct check.
-Triangulation is the next step up: forensic visibility + catching adjacent
-bug classes (#1247-style, result-handler swallow). Not blocking; pull when
-the next stuck-job report lands and Redis-only signal is insufficient.
diff --git a/docs/claude/prompts/NEXT_SESSION_PROMPT.md b/docs/claude/prompts/NEXT_SESSION_PROMPT.md
deleted file mode 100644
index baca7df2f..000000000
--- a/docs/claude/prompts/NEXT_SESSION_PROMPT.md
+++ /dev/null
@@ -1,218 +0,0 @@
-# Next session: implement `fix/job-complete-check`
-
-Branch: `fix/job-complete-check` at commit `10cdac28` (worktree
-`/home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks`).
-
-Plan: `docs/claude/planning/job-complete-check.md` (read first — has the
-full design, risks, follow-ups, and 2521 evidence).
-
-## TL;DR
-
-Production job 2521 ended REVOKED because `_update_job_progress`
-(`ami/jobs/tasks.py:556`) writes the entire `Job.progress` JSONB blob
-without a row lock since PR #1261 dropped `select_for_update`. Two
-concurrent workers raced; slower committer with `processed=4509
-remaining=1` clobbered the faster committer's `processed=4510 SUCCESS`
-write. NATS + Redis both showed all 4510 tasks done; only `Job.progress`
-lied. Reaper at `tasks.py:968` consults `progress.is_complete()` →
-False (because clobbered) → REVOKED.
-
-Fix: add `AsyncJobStateManager.all_tasks_processed() -> bool | None`,
-inline at the reaper guard. Sync_api unaffected. Other 5 `progress.is_complete()`
-readers untouched (not bitten by clobber). Pydantic `JobProgress` stays
-pure data — Redis I/O lives on the manager class.
-
-## Implementation steps
-
-### 1. `all_tasks_processed()`
-
-File: `ami/ml/orchestration/async_job_state.py`
-
-After `get_pending_image_ids()` (around line 229), add:
-
-```python
-def all_tasks_processed(self) -> bool | None:
-    """Tri-state truth signal for NATS-task SREM completeness across both
-    process and results pending sets.
-
-    True  — both pending sets empty AND total > 0 (or total == 0)
-    False — at least one pending set has members
-    None  — Redis state absent (cleaned up, expired, never initialized,
-            or transient RedisError)
-
-    Scope: tracks NATS task lifecycle only; does not know about `collect`
-    or any future post-results stages.
-    """
-    try:
-        redis = self._get_redis()
-        with redis.pipeline() as pipe:
-            for stage in self.STAGES:
-                pipe.scard(self._get_pending_key(stage))
-            pipe.get(self._total_key)
-            results = pipe.execute()
-    except RedisError as e:
-        logger.warning(
-            f"Redis error reading all_tasks_processed for job {self.job_id}: {e}"
-        )
-        return None
-
-    *pending_counts, total_raw = results
-    if total_raw is None:
-        return None
-    if int(total_raw) == 0:
-        return True
-    return all(count == 0 for count in pending_counts)
-```
-
-### 2. Reaper guard replacement
-
-File: `ami/jobs/tasks.py:963-969`
-
-Replace:
-```python
-is_terminal = celery_state in states.READY_STATES
-is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
-if is_async_api and celery_state in {states.SUCCESS, states.FAILURE} and not job.progress.is_complete():
-    is_terminal = False
-```
-
-With:
-```python
-is_terminal = celery_state in states.READY_STATES
-is_async_api = job.dispatch_mode == JobDispatchMode.ASYNC_API
-if is_async_api and celery_state in {states.SUCCESS, states.FAILURE}:
-    processed = AsyncJobStateManager(job.pk).all_tasks_processed()
-    if processed is False:
-        is_terminal = False
-    elif processed is None:
-        logger.warning(
-            "Reaper for job %s: Redis state absent, falling back to "
-            "progress.is_complete()",
-            job.pk,
-        )
-        if not job.progress.is_complete():
-            is_terminal = False
-    # processed is True -> trust Celery's terminal state
-```
-
-Need import: `from ami.ml.orchestration.async_job_state import AsyncJobStateManager`
-(check if already imported at top of file).
-
-### 3. Tests
-
-**`ami/ml/tests.py`** — add `TestAsyncJobStateManagerAllTasksProcessed`:
-- fresh `initialize_job([...])` → False
-- after SREM-ing every id from both stages → True
-- after SREM-ing only one stage → False
-- `total=0` zero-images → True
-- never initialized → None
-- after `cleanup()` → None
-- mocked `RedisError` on pipeline → None + WARNING
-
-**`ami/jobs/tests/test_tasks.py::TestCheckStaleJobs`** — add cases:
-- async_api + Celery SUCCESS + Redis empty + `progress` clobbered to
-  `process=99.98% STARTED processed=4509 remaining=1` → SUCCESS, not REVOKED
-  (the 2521 case)
-- async_api + Celery SUCCESS + Redis pending > 0 → REVOKED
-- async_api + Redis state absent + `progress.is_complete()=True` → SUCCESS
-  via fallback + assert WARNING log
-- async_api + Redis state absent + `progress.is_complete()=False` → REVOKED
-  via fallback + assert WARNING log
-- sync_api → existing behavior unchanged
-
-### 4. Run tests
-
-```bash
-cd /home/michael/Projects/AMI/antenna/.claude/worktrees/dangling-tasks
-docker compose -f docker-compose.ci.yml run --rm django \
-  python manage.py test ami.ml.tests.TestAsyncJobStateManagerAllTasksProcessed ami.jobs.tests.test_tasks --keepdb
-```
-
-If tests pass, run the full `ami.jobs` suite:
-```bash
-docker compose -f docker-compose.ci.yml run --rm django \
-  python manage.py test ami.jobs --keepdb
-```
-
-### 5. E2E chaos validation
-
-Per the chaos pattern in `docs/claude/debugging/chaos-scenarios.md`:
-
-1. Dispatch real async_api job (50+ images) on local stack with ADC
-   running. Wait until SUCCESS naturally.
-2. Via Django shell, hand-clobber `Job.progress` to mimic 2521:
-   ```python
-   from ami.jobs.models import Job, JobState
-   j = Job.objects.get(pk=<id>)
-   j.status = JobState.STARTED
-   j.progress.update_stage("process", status=JobState.STARTED, progress=0.9998,
-                           processed=N-1, remaining=1, failed=0)
-   j.updated_at = j.updated_at - datetime.timedelta(minutes=15)
-   j.save()
-   ```
-3. Trigger `jobs_health_check` manually:
-   ```python
-   from ami.jobs.tasks import jobs_health_check
-   jobs_health_check()
-   ```
-4. Verify: job lands SUCCESS (not REVOKED). Verify Redis still empty
-   (cleanup runs from terminal-state path).
-
-### 6. PR
-
-```bash
-gh pr create --title "fix(jobs): reaper checks Redis directly via all_tasks_processed" --body "..."
-```
-
-PR body template — start from the plan doc's "Why", "Behavior change
-matrix", and "Risks" sections. Reference job 2521.
-
-## Pre-requisites and gotchas
-
-- Worktree currently at `dangling-tasks` path but on branch
-  `fix/job-complete-check`. `git status` should be clean.
-- Pydantic v1 in container — `.dict()` / `.json()` not `.model_dump()`.
-- CI compose file is `docker-compose.ci.yml` for tests with `--keepdb`.
-- Stale RabbitMQ connections after days: restart Django/Celery if
-  enqueueing throws `ConnectionResetError`.
-- `cachalot_disabled()` decorator wraps `process_nats_pipeline_result`
-  — not relevant here but flagged in case a related test fails on cache.
-
-## Saved diagnostic logs from 2521
-
-On the ami-devops host (not local):
-- `/tmp/job2521_celery_72h.log`
-- `/tmp/job2521_worker_jobs_72h.log`
-- `/tmp/job2521_worker_ml_72h.log`
-- `/tmp/job2521_worker_default_72h.log`
-- `/tmp/job2521_beat_72h.log`
-
-If next session needs to re-grep, dispatch a subagent in
-`~/Projects/AMI/ami-devops/` rather than re-fetching from prod.
-
-## Open follow-up tickets to file (not in this PR)
-
-1. Make `_update_job_progress` writes atomic. Either re-introduce a
-   tightly-scoped `select_for_update`, or move counts to Redis HINCRBY
-   counters with separate sync task. Solves the underlying race that
-   this PR works around.
-2. DB-count fallback for the Redis-gone case. Add only if the WARNING
-   log fires often in production.
-3. Persist `queued_image_ids` on Job at dispatch — makes (2) cheaper.
-4. Fix `'CANCELLED' is not a valid JobState` crash in
-   `ami/jobs/tasks.py:1193` (`zombie_streams._drain_all`). Throws
-   every beat tick on a legacy job row whose status is not in the
-   JobState enum.
-5. Expand `all_tasks_processed()` usage to other 5 `progress.is_complete()`
-   sites if monitoring shows clobber-driven failures elsewhere.
-
-## Branch state
-
-```
-$ git log --oneline fix/job-complete-check ^main
-10cdac28 docs(jobs): narrow plan to reaper-only Redis check via all_tasks_processed
-091e2fc7 docs(jobs): plan reusable Job.is_actually_complete() backed by Redis
-```
-
-(091e2fc7 is the original broader plan — superseded by 10cdac28. Do NOT
-implement from 091e2fc7.)