Skip to content

Log premature-terminal and missing-state job failures (observability for #1337)#1343

Merged
mihow merged 4 commits into
mainfrom
fix/1337-terminal-anomaly-logs
Jun 19, 2026
Merged

Log premature-terminal and missing-state job failures (observability for #1337)#1343
mihow merged 4 commits into
mainfrom
fix/1337-terminal-anomaly-logs

Conversation

@mihow

@mihow mihow commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is an observability-only change (no behaviour change) that makes two classes of "the job ended up in a terminal state" event visible in the logs, so we can tell legitimate terminal verdicts from premature ones before adding any corrective logic.

It follows #1338 (now merged), which made the result handler's terminal status transition atomic and non-regressing (the companion #1342 extends the same guard to the cancel and signal-handler writers). A side effect of that hardening is that a terminal verdict is now irreversible: a late-arriving completion can no longer pull a job back out of REVOKED/FAILURE. That is correct when the terminal verdict was right (a user cancel, a real crash), but it cements the verdict when it was wrong (for example, the stale-job reaper revoked a slow-but-alive job and its results then landed, or a result arrived while the job's Redis state was momentarily absent). These logs surface those cases instead of letting them disappear silently.

List of Changes

  1. Log when work completes for a job that is already terminal. In _update_job_progress (ami/jobs/tasks.py), when the guarded terminal transition does not fire because the job is already terminal/CANCELING, emit a warning via the per-job log, naming the stage and the terminal state that was not applied. This is often legitimate (a cancel or the reaper genuinely won the race), but a frequent occurrence is the signal of a premature terminal verdict. Observation only — the guard behaviour is unchanged.
  2. Log context when a result arrives for missing Redis state. The result handler treats a missing total-images key as fatal (ack + _fail_job). That single condition conflates three very different situations: state genuinely cleaned up (end of life), state never seeded yet (startup race), and state wiped by a duplicate/redelivered run_job re-running initialize_job. A new _log_missing_state_context helper records the job's age and status at both missing-state branches and splits the log by job state: a terminal/CANCELING job logs an info line (a late result after the job already finished — benign, e.g. cancel cleanup, which _fail_job no-ops on anyway), while a still-running job with missing state logs a warning (the case worth investigating). Behaviour unchanged — the job is still failed where it was before; we just log why first, at the right severity.

The log messages are plain operational statements (no ticket numbers or internal jargon in the runtime strings); the rationale and issue reference live in code comments.

Why observation before correction

We have a report of Redis state appearing "missing for a moment at the beginning" of jobs, and the result handler currently fails a job on the first missing-state read with no second chance. Before adding grace/retry logic we want to confirm the actual trigger (a small age in the new log, on a still-running job, would point to a not-yet-seeded or redispatch race rather than genuine cleanup). Instrument, confirm, then fix.

Follow-up (NOT in this PR — proposed)

Once the logs confirm the trigger, the corrective changes to make are:

  1. Grace on missing-state in the result handler. If state is missing but the job is young / not yet STARTED / recently dispatched, do not ack-and-fail; re-raise so NATS redelivers and the brief gap self-heals, and only fail after a grace window. Today it fails immediately on the first read.
  2. Make initialize_job non-clobbering / idempotent. It currently deletes the pending sets before re-adding, so a second run_job (a re-run, or an acks_late redelivery) wipes a job's live in-flight state. Refuse or no-op re-initialization of a job that already has pending state unless it's an explicit reset. This is a prerequisite for Improve celery task dispatch and cancellation to prevent stuck jobs #1324's acks_late redelivery, which can re-trigger run_job.

How to Test the Changes

  • Full ami/jobs/tests/test_tasks.py and ami/jobs/tests/test_jobs.py pass locally (no behaviour change).
  • The new log lines sit on the existing missing-state and already-terminal code paths; no new branches in control flow.
  • Verified live on a dev deployment: cancelling a job mid-process produces the expected info line ("result arrived after the job already finished, status=REVOKED ... ignoring") for each in-flight result, rather than a misleading failure warning.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Builds on #1338 (merged). Sibling fast-follow of #1342 (not a dependency). Refs #1337, #1219, #1324.

Summary by CodeRabbit

  • Chores
    • Enhanced diagnostic logging for job processing to improve visibility into edge cases, including better detection of missing state conditions and unexpected job state transitions for troubleshooting purposes.

@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 7b848cb
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a35b8ea18f8ce0008a56815

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new helper _log_missing_state_context(job_id, stage) in ami/jobs/tasks.py that fetches job metadata and logs an info or warning message depending on whether the job is already terminal or not. This helper is called at both the "process" and "results" missing-state paths in process_nats_pipeline_result. A warning log is also added in _update_job_progress when became_complete=True but the guarded DB transition is skipped because the job is already terminal.

Changes

Missing Redis State & Premature Terminal Diagnostics

Layer / File(s) Summary
_log_missing_state_context helper and call sites
ami/jobs/tasks.py
Adds _log_missing_state_context(job_id, stage) (lines 439–488) that queries job status, dispatch mode, and age, then logs info if the job is already terminal or a warning if it is non-terminal with missing Redis state. Wires this call into both the "process"-stage (line 285) and "results"-stage (line 368) missing-state branches of process_nats_pipeline_result, before the existing ACK/fail path.
Warning log for skipped completion transition in _update_job_progress
ami/jobs/tasks.py
Adds an else branch (lines 708–722) for the became_complete=True / zero-row-updated case, fetching the current job status and logging a warning that completion was not applied because the job was already terminal or CANCELING.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

  • Review how the state and progress of jobs are tracked #1285: This PR directly implements forensic logging for missing Redis progress state and premature terminal verdicts across process_nats_pipeline_result and _update_job_progress, which aligns with the framework proposed in #1285 for triangulating job state across DB, Redis, and NATS.
  • Review and simplify job logs #1236: Both touch logging behavior in process_nats_pipeline_result; this PR adds structured diagnostic logging at the same missing-state ACK/fail path discussed in #1236.

Possibly related PRs

  • RolnickLab/antenna#1234: Directly modifies the same process_nats_pipeline_result and _update_job_progress control flow paths — including the missing-state ACK/fail path — that this PR now augments with diagnostic logging.

Suggested labels

PSv2

Poem

🐇 Hoppity-hop through the pipeline I go,
When Redis goes missing, I now let you know!
A warning for non-terminals, info for the done,
No silent failures — diagnostics are fun!
The job's age and status, all logged with great care,
So no sneaky transitions slip by unaware. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding diagnostic logging for premature-terminal and missing-state job failures. It is clear, concise, and specific about the observability improvements being made.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides comprehensive coverage of all required template sections including summary, list of changes, detailed description with rationale, testing instructions, and a completed checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1337-terminal-anomaly-logs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow marked this pull request as ready for review June 19, 2026 21:21
@mihow mihow force-pushed the fix/1337-terminal-transition-chokepoint branch from a8bce8d to aaa5365 Compare June 19, 2026 21:21
mihow and others added 2 commits June 19, 2026 14:21
…gnosis

Observation-only follow-up to #1338/#1342. Now that terminal status
transitions are irreversible, surface the two cases where a terminal verdict
may have been wrong, instead of letting them disappear silently:

1. When work completes for a job the guard finds already terminal/CANCELING,
   log a warning. Often legitimate (cancel/reaper won the race) but, if frequent,
   the signal of a premature terminal verdict.
2. When a result is failed because the job's Redis state is missing, log the
   job age/status/dispatch first. A small age points to a not-yet-seeded or
   redelivered-run_job race rather than genuine cleanup.

No behaviour change — both warnings sit on existing code paths. Lets us confirm
the trigger before adding grace/idempotency logic (see PR body follow-up).

Refs #1337, #1219, #1324.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y terminal

The missing-state diagnostic logged a WARNING saying 'Failing job' for every
in-flight result that arrived after a job finished — but _fail_job no-ops on a
terminal job, so after a cancel (which deletes the Redis state) this fired once
per in-flight batch and misdescribed normal cleanup as a failure. Now: a
terminal job logs at info ('ignoring in-flight result for already-terminal
job'); only a NON-terminal job with missing state logs the warning, which is the
case actually worth investigating.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mihow mihow changed the base branch from fix/1337-terminal-transition-chokepoint to main June 19, 2026 21:21
@mihow mihow force-pushed the fix/1337-terminal-anomaly-logs branch from 42d127b to 0f52c17 Compare June 19, 2026 21:21
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 7b848cb
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a35b8ea46c9560008d9737a

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 461-487: The condition checking `if row["status"] in
JobState.final_states()` at line 461 does not include the CANCELING state, but
_fail_job treats CANCELING as terminal/no-op. This causes cancel-in-flight
cleanup to incorrectly trigger the non-terminal warning log with misleading
"Failing job" message. Modify the condition to also treat CANCELING as terminal,
either by adding CANCELING to the final_states check or by explicitly including
it in the condition, so that expected cancel races are properly classified as
expected cleanup rather than anomalies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe12e339-3fc2-4556-af0c-12faa47373e2

📥 Commits

Reviewing files that changed from the base of the PR and between df04cf5 and 0f52c17.

📒 Files selected for processing (1)
  • ami/jobs/tasks.py

Comment thread ami/jobs/tasks.py Outdated
…eaner

The missing-state and completed-after-terminal logs read like insider notes —
ticket numbers and race-theory in the runtime message. Move the rationale and
the issue reference into code comments and make the log lines plain operational
statements an operator can act on without chasing a ticket. Also drop the
redundant dispatch_mode field and the extra status re-query.

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 19, 2026 21:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds targeted logging to improve observability around two “job reached a terminal state unexpectedly” scenarios in the async results processing pipeline, without changing the underlying guard/fail behavior. This helps distinguish legitimate terminal transitions (cancel/reaper) from premature/incorrect ones that would otherwise be silent.

Changes:

  • Emit missing-Redis-state context logs before ack+fail in both stage="process" and stage="results" missing-state branches.
  • Emit a warning when _update_job_progress detects completion but the guarded terminal transition does not apply (job already terminal/CANCELING).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py Outdated
@mihow mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 19, 2026
- Treat CANCELING as terminal-like in the missing-state classification so a
  cancel-in-flight result logs the benign info line instead of the misleading
  'still running / marking it failed' warning (matches _fail_job's no-op set).
  Caught by CodeRabbit and Copilot.
- Rename the values() dict from 'row' to 'job_values' (per review).
- Log the completed-after-terminal case via job.logger and include the stage and
  attempted terminal state, without an extra status re-query (per Copilot).

Refs #1337.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mihow mihow merged commit 702de1e into main Jun 19, 2026
7 checks passed
@mihow mihow deleted the fix/1337-terminal-anomaly-logs branch June 19, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants