Skip to content

Serve direct thumbnail URLs to reduce load on main app server#1331

Open
mihow wants to merge 64 commits into
mainfrom
feat/thumbnails-serializer-direct-urls
Open

Serve direct thumbnail URLs to reduce load on main app server#1331
mihow wants to merge 64 commits into
mainfrom
feat/thumbnails-serializer-direct-urls

Conversation

@mihow

@mihow mihow commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #1306. Base branch: feat/thumbnail-source-images. Several pieces of the original proposal were promoted to the parent PR on 2026-06-10 so it deploys safely on its own; what remains here is only the warm/cold URL classification (+223/−20 over four files).

Summary

If a thumbnail has already been generated, and the source image hasn't changed, list responses return the direct file-storage URL instead of the Django redirect route. Cold or stale thumbnails keep the existing redirect route, which generates them on first request.

Here you can see the small thumbnail has already been fetched, but the medium one has not.
image

Two separate costs — what this fixes and what it must not introduce

Earlier drafts of this PR (and some review discussion) conflated these, so spelling them out:

  1. Browser HTTP round-trips — what this PR removes. With the parent alone, every thumbnail <img> is a request through SourceImageThumbnailViewSet: DRF auth, get_object(), a thumbnail-row check, then a 302 to storage. This is not a database N+1 — it is per-image HTTP fan-out, paid by every client. A 40-card gallery page is 80 of those requests. The parent has since reduced the per-request cost (no storage existence check, and the 302 is now browser-cached for five minutes), but each client still pays the full fan-out on first view and again whenever the cache expires. This PR removes it for warm rows: a measurement on a dev deployment showed Django requests per warm gallery view going from 80 to 0.

  2. Database N+1 — what this PR must avoid introducing. The parent's serializer never touches the thumbnail table: it builds route URLs from obj.pk plus settings, so there is no pre-existing N+1 to fix (an earlier draft wrongly claimed there was — see this comment and Luke's review thread). This PR makes the serializer read obj.thumbnails.all() per row to decide warm vs cold, which would be a new N+1 on its own. SourceImageQuerySet.with_thumbnails() prefetches it as one extra query per page, and the tests pin that contract.

In short: the win is fewer HTTP requests to Django; the prefetch is self-defense against a regression this PR would otherwise create, not a query-count improvement over the parent.

Why generation stays lazy (and why the cached row matters)

Captures have a sparse access pattern: a project can hold on the order of a million captures, but users review occurrence crops — captures themselves are opened only when troubleshooting or checking the context around a detection, so only a few hundred per project are ever viewed. That rules out pre-warming thumbnails at upload/sync time (it would spend storage and compute on images nobody opens) and with it the simpler "deterministic URL, no lookup" design, which only works when the blob is guaranteed to exist.

With generation lazy, something must answer "has this thumbnail been generated, at the current spec, and is it still fresh?" per capture and label — that is exactly what the SourceImageThumbnail row records, and why this PR reads it (via the prefetch below) instead of constructing URLs blindly.

What changes

ami/main/models.py

  • New SourceImageQuerySet.with_thumbnails() — a plain prefetch_related("thumbnails") so thumbnail_urls can decide warm/cold in memory instead of firing a SELECT per row. (An earlier revision pushed an is_source_changed annotation into this prefetch; it was dropped in favour of the in-memory thumbnail_is_valid predicate below, so the same check is shared with the generator.)
  • New SourceImage.thumbnail_urls(request) — per-label {label: url} builder, the single source of truth for the warm/cold decision. Warm = a cached row exists with a non-blank path, a width equal to the requested spec width, and no source change → default_storage.url(row.path). Anything else → route URL via reverse_with_params, exactly as the parent serializer built it. The width check is strict equality because the generator (since the spec-width fix in Thumbnails API for SourceImages #1306) records the requested width on the row; legacy rows that stored the encoder's output read as stale here and self-heal through one regeneration.
  • New SourceImage.thumbnail_is_valid(spec, thumb) — the in-memory validity predicate (non-blank path, width equal to the spec, source not changed) shared by thumbnail_urls and the lazy generator gate, so the warm/cold rule lives in one place (per Luke's review).
  • Un-prefetched fallback — if thumbnails was not prefetched (a freshly created instance in a write response, or a caller that skipped with_thumbnails), every label falls back to the route URL without querying. This keeps single-object serialization working and guarantees no per-row N+1; it also means the warm-cache win is enforced by the list query-count test rather than at runtime (until Django zen-queries lands).

ami/main/api/serializers.py

SourceImageThumbnailSerializer.get_thumbnails becomes a one-line passthrough:

def get_thumbnails(self, obj):
    return obj.thumbnail_urls(request=self.context.get("request"))

ami/main/api/views.py

  • SourceImageViewSet.get_queryset.with_thumbnails().
  • EventViewSet.get_queryset's inner example_captures Prefetch → same.

List of changes

Change (user/operator effect) How (implementation) Notes
Warm-cache gallery pages no longer route thumbnail requests through Django Serializer emits default_storage.url(row.path) when the cached row matches the configured width and isn't source-changed Removes up to 80 browser→Django round-trips per page view. Frontend treats the URL as opaque (<img src=>); no FE change.
Cold and stale thumbnails still regenerate via the redirect viewset Same SourceImageThumbnailViewSet + concurrency-safe upsert No behavior change on this path.
Re-uploaded source images don't serve stale thumbnails from the warm path In-memory thumbnail_is_valid compares row vs source last_modified; stale rows fall back to the route URL Mirrors the predicate the generator already uses. Guards r3373715397.
Thumbnails generated before the #1306 spec-width fix self-heal Rows whose stored width is the encoder output (e.g. 239 for a 240 spec) read as stale and get the route URL, regenerating once One extra regeneration per legacy row, then warm forever.
The serializer's new per-row thumbnail read can't become an N+1 with_thumbnails() prefetch is the contract; viewsets wire it; un-prefetched callers fall back to route URLs without querying; a multi-row test asserts a warm list serves storage URLs with one thumbnail query One extra query per page, not per row.
Per-label URL contract tested at the model surface test_thumbnail_urls_model_method_emits_warm_and_route exercises SourceImage.thumbnail_urls directly Future template tags / management commands pick up the contract for free.

Contract change

thumbnails.{small,medium} URL shape changes for cached rows from http://<host>/api/v2/captures/thumbnails/<id>/?label=small to e.g. https://<bucket>.s3.amazonaws.com/thumbnails/capture_<id>/small.jpg.

Frontend treats the value as opaque (see ui/src/data-services/models/capture.ts:170-182, ui/src/pages/captures/capture-columns.tsx). No frontend change required.

On deployments using FileSystemStorage (e.g. local dev), default_storage.url() returns a relative /media/... URL that Django still serves — the "bypass Django" win applies fully only where storage URLs are external (S3/MinIO). Compose with a CDN later via AWS_S3_CUSTOM_DOMAIN and these URLs front-load automatically.

Promoted to #1306 (no longer in this diff)

These shipped on the parent so it is solo-deployable: removal of the per-request default_storage.exists() check, progressive JPEG encoding, Cache-Control: private, max-age=300 on the redirect, the blank-path regen guard, and the spec-width fix (replacing the 2px width tolerance an earlier revision of this branch carried).

Luke's review threads addressed

What this does NOT change

  • The redirect viewset (SourceImageThumbnailViewSet) and its concurrency-safe upsert path are unchanged.
  • Browser cache semantics for storage URLs are governed by the storage backend's headers, unchanged here.
  • Frontend code is unchanged.

Known follow-ups

  • Orphan-thumbnail reconciler — a row whose storage blob was deleted out of band now produces a broken <img> until the row is removed (consequence of the exists() removal, which lives in Thumbnails API for SourceImages #1306). Planned as a periodic integrity-check task; separate ticket.
  • Event.first_capture() at ami/main/models.py:1277 is a per-row method that runs a fresh query; pre-existing, unchanged.

Verification

  • Browser network panel on a warm gallery page: <img> requests go direct to storage; zero requests to /api/v2/captures/thumbnails/....
  • Captures-list endpoint query count: +1 thumbnail prefetch query vs the parent, not +N (asserted by test_warm_list_serves_storage_urls_with_single_thumbnail_query).
  • First measurement on a dev deployment is in this comment: Django round-trips per gallery view went from 80 to 0 once the cache was warm.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Performance

    • Optimized thumbnail URL generation to reduce database queries when loading image galleries.
    • Improved thumbnail caching and validation to ensure efficient serving of image previews.
  • Tests

    • Added comprehensive test coverage for thumbnail serving behavior, including cache validation and query efficiency checks.

loppear and others added 29 commits May 14, 2026 09:55
…edirects, and add index for (source_image, label).
Conflicts:
- ami/main/tests.py: combined import additions (create_captures_from_files +
  create_detections, both used).
- ami/main/migrations/0088_*: renamed 0088_sourceimagethumbnail_and_more.py to
  0089_*, repointed dependency to the new 0088_detection_det_srcimg_created_idx
  from main. Linear chain, no merge migration needed.
JPEG natively supports only L, RGB, and CMYK modes. Source images in
RGBA/P/LA/PA modes (common for uploaded PNGs) raised
"OSError: cannot write mode <X> as JPEG" on `img.save(format="JPEG")`,
500-ing the thumbnail endpoint. Convert to RGB first when needed.

Adds a regression test that feeds an RGBA PNG through find_or_generate
and asserts the redirect succeeds and the thumbnail row lands.

Closes review thread on ami/main/models.py L2241.

Co-Authored-By: Claude <noreply@anthropic.com>
Two changes, one root cause.

`SourceImage.last_modified` tracks the source bytes' mtime (S3 LastModified
on the sync path) and is the freshness signal the thumbnail regen check uses
to decide whether to invalidate a cached thumbnail. It is NOT the same as
`updated_at` (BaseModel auto_now), which bumps on every row save (cached
counts, detection counts, etc.) and would force regen far too often.

The upload path (`create_source_image_from_upload`) never set this field,
so:
  1. Uploaded captures had `last_modified=None` forever, breaking parity
     with sync-created rows.
  2. The thumbnail regen check `thumb.last_modified < self.last_modified`
     raised `TypeError` on the 2nd+ request for any uploaded capture's
     thumbnail (datetime < None).

Fixes:

* `create_source_image_from_upload`: set `last_modified=timezone.now()` so
  uploaded captures have a real source mtime, matching the sync path's
  behaviour of copying the S3 LastModified header.
* `find_or_generate_thumbnail_for_label`: guard the comparison. None on
  either side now means "no signal of source change, trust the cached
  thumb" instead of crashing. Covers legacy rows synced before the upload
  backfill.

Tests:

* `test_thumbnail_source_last_modified_none_does_not_crash` reproduces the
  TypeError class of bug by nulling `last_modified` after a thumbnail row
  exists and asserts the 2nd request reuses the cache.
* `test_upload_handler_backfills_last_modified` covers the upload-path
  backfill (asserts the new field falls inside [before, after] of the
  upload call).

Closes review thread on ami/main/models.py L2227.

Co-Authored-By: Claude <noreply@anthropic.com>
Two F541 (Ruff: f-string without placeholders) hits, plus a small
behavioural cleanup on the list action.

* `SourceImageThumbnailViewSet.list` now raises
  `MethodNotAllowed` (405) instead of `NotFound` (404). The route exists
  for `/captures/thumbnails/<pk>/?label=...` only; listing has no
  defined semantics. 405 sets the Allow header and is the correct REST
  response.
* Updated `test_thumbnail_no_list` to expect 405 and to use a plain
  string literal (the previous test used an f-string with no
  placeholders).

Closes review thread on ami/main/tests.py L237 (Ruff F541).

Co-Authored-By: Claude <noreply@anthropic.com>
When `settings.THUMBNAILS['SIZES']` is empty (misconfiguration), the
default-label fallback `next(iter(_sizes))` raised `StopIteration` and
surfaced as a 500 with no useful detail. Replace with an explicit empty
check that returns 404 with a clear API error message, and use
`next(iter(_sizes))` only after we know the dict is non-empty (also
switch the fallback expression to `or next(...)` since the previous
form passed `next(iter(...))` as the default arg unconditionally — fine
for non-empty dicts but still worth tidying alongside the guard).

Adds a test that overrides THUMBNAILS with an empty SIZES dict and
asserts the endpoint returns 404 with the configured-error message.

Co-Authored-By: Claude <noreply@anthropic.com>
`requests.get(url)` with no timeout blocks indefinitely if the upstream
storage is slow or unreachable. The thumbnail-generation path
(`SourceImage.find_or_generate_thumbnail_for_label`) runs this
synchronously inside a Django web request, so a stuck upstream ties up
a gunicorn worker; a gallery's worth of stuck calls drains the pool.

Default timeout is `(connect=5s, read=30s)`. Callers that need
different limits (e.g. very large originals on a slow link) can override
by passing `timeout=`. Behaviour for fast / responsive upstreams is
unchanged.

Tests verify the default is finite and that explicit overrides are
honored.

Co-Authored-By: Claude <noreply@anthropic.com>
…rent-gen race

Two related bugs in `find_or_generate_thumbnail_for_label`:

1. Under concurrent gen for the same (source_image, label) pair the loser
   could either (a) hit the new UniqueConstraint and 500 with
   IntegrityError when its delete-loop ran before the winner's INSERT,
   or (b) destroy the winner's row + storage blob when the delete-loop
   ran after. Either outcome is wrong; both were possible because the
   code did `filter(label=label).delete()` then `thumbnails.create(...)`
   instead of a single atomic upsert.

2. The pre-save delete-loop also deleted the storage blob BEFORE the
   new `default_storage.save()`, which on storage backends that suffix
   colliding keys (FileSystemStorage default) left the row pointing at
   a path that the next save could overwrite in the same way the racer
   would. Belt-and-braces failure where the cleanup order itself was the
   bug.

Fix:

* Replace the delete-then-create with `update_or_create` keyed on
  `label`. Django serializes via the UniqueConstraint and, on collision,
  catches IntegrityError on INSERT and re-runs as UPDATE. No 500s, no
  destructive deletes of concurrent rows.
* Snapshot the prior `path` before the new save and clean it up after
  the row update only when the backend gave us a new key
  (`prior_path != thumbnail_path`). Backends that overwrite in place
  (boto3 with `AWS_S3_FILE_OVERWRITE=True`) skip the cleanup and avoid
  the file-deletion race entirely.
* Force-bump `thumb.last_modified` on the UPDATE branch via
  `QuerySet.update` because `auto_now_add=True` only fires on INSERT —
  without this, regen via source-mtime bump would re-trigger regen on
  every subsequent request as the field's pre_save callback preserved
  the existing (now-stale) value.

Tests:

* `test_thumbnail_regen_reuses_row_via_upsert`: delete the storage blob
  out from under a cached row, re-request, assert the row's PK is
  preserved (proving the UPDATE path was taken, not a fresh INSERT).
* `test_thumbnail_exists_newer_modified_source` updated to assert row
  reuse + freshness bump rather than the previous (incidental) PK
  change.

The `SourceImageThumbnail.last_modified` field semantics are still
muddled (see PR thread; rename or drop tracked separately).

Co-Authored-By: Claude <noreply@anthropic.com>
@netlify

netlify Bot commented Jun 4, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit dd40bae
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a3108a62e7a250008b76932

mihow and others added 8 commits June 10, 2026 17:26
…EAD, progressive JPEG, redirect caching, blank-path guard) into feat/thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>

# Conflicts:
#	ami/main/models.py
#	ami/main/tests.py
…ec width

The generator records the requested spec width on the row (see parent
branch), so the warm-check can use strict equality instead of the 2px
tolerance that compensated for PIL's aspect-ratio rounding. Legacy rows
that stored encoder output read as stale, fall back to the route URL,
and self-heal with a single regeneration.

Co-Authored-By: Claude <noreply@anthropic.com>
The test pinned a Pillow encoder flag rather than behavior we own; it
served its purpose during development.

Co-Authored-By: Claude <noreply@anthropic.com>
…facts

Several comment blocks added during review explained why a change was
correct relative to code that no longer exists. Keep only the constraints
a future reader needs (spec-width vs encoder output, auto_now_add bump,
presign-lifetime bound); drop the change-justification prose, which lives
in the commit history.

Co-Authored-By: Claude <noreply@anthropic.com>
…thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>
Same pass as the parent branch: keep the contract statements (prefetch
required, annotation required, strict width semantics) and drop the
review-era justification prose.

Co-Authored-By: Claude <noreply@anthropic.com>
… sweep

Co-Authored-By: Claude <noreply@anthropic.com>
…to feat/thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>

@loppear loppear left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor simplification/DRY suggested, but overall I agree this change is good and worth including with #1306

Comment thread ami/main/models.py Outdated
# ``thumb.width`` stores the requested spec width (see the generator), so
# strict equality works; legacy encoder-width rows read stale and self-heal.
width_match = thumb is not None and thumb.width == spec["width"]
warm = thumb is not None and thumb.path and width_match and not thumb.is_source_changed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this "warm" (thumb exists and is valid) check is implemented here (depending on a new annotation to the queryset) and in SourceImage.find_or_generate_thumbnail_for_label. In both cases we have already queried/loaded a source image and thumb (or None) objects, my inclination would be to extract the non-annotated version as a helper method e.g. SourceImage.thumbnail_is_valid(self, spec, thumb).

@mihow mihow Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviewing again @loppear! I implemented the singular helper as you suggested. I did add an extra guard that should ensure the thumbnails are prefetched and the thumbnail_is_valid check does not trigger n+1 queries. I'd like to integrate Django zen-queries into the project sometime this year to catch introductions of n+1 queries automatically, since that has been the main source of slow load times as the database grows

mihow added a commit that referenced this pull request Jun 16, 2026
Single-sources the 'is this thumbnail good?' predicate so the generator
gate and the warm/cold URL check (#1331) share one definition. The helper
takes (spec, thumb) and compares in memory; legacy encoder-width rows read
invalid and self-heal on next generation.

Co-Authored-By: Claude <noreply@anthropic.com>
…force prefetch

Extracts SourceImage.thumbnail_is_valid(spec, thumb) so the generator regen
gate and the warm/cold URL decision in thumbnail_urls share one in-memory
predicate (per Luke's review on #1331). With the predicate no longer needing
SQL, with_thumbnails() collapses to a plain prefetch_related('thumbnails').

thumbnail_urls now guards on prefetched thumbnails and raises if missing,
turning a forgotten prefetch into a loud error instead of a silent N+1
(TODO: drop once django-zen-queries enforces this globally). The guard
surfaced one un-prefetched caller, Event.first_capture(), now wired through
with_thumbnails().

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/thumbnails-serializer-direct-urls branch from 1ebb2d5 to 15677a2 Compare June 16, 2026 03:18
@mihow mihow force-pushed the feat/thumbnail-source-images branch from a80d20d to 63d103f Compare June 16, 2026 03:18
Base automatically changed from feat/thumbnail-source-images to main June 16, 2026 03:45
@mihow mihow changed the title Proposal: store and serve full thumbnail URLs instead of hitting server Serve direct thumbnail URLs to reduce load on main web server Jun 16, 2026
@mihow mihow changed the title Serve direct thumbnail URLs to reduce load on main web server Serve direct thumbnail URLs to reduce load on main app server Jun 16, 2026
…lizer-direct-urls

# Conflicts:
#	ami/main/api/serializers.py
#	ami/main/models.py
#	ami/main/tests.py
Copilot AI review requested due to automatic review settings June 16, 2026 03:53
@netlify

netlify Bot commented Jun 16, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit dd40bae
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a3108a6c20bbf0008503613

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the captures/events API to emit direct storage thumbnail URLs for “warm” thumbnails (cached + matching current spec + source not changed) while keeping the existing redirect endpoint URL for cold/stale thumbnails, reducing per-gallery HTTP fan-out to Django without introducing DB N+1s by enforcing a thumbnail prefetch contract.

Changes:

  • Add SourceImage.thumbnail_urls() (warm/cold URL selection) and SourceImageQuerySet.with_thumbnails() (prefetch contract) in the model layer.
  • Switch thumbnail serialization to delegate to SourceImage.thumbnail_urls() and update view querysets/prefetches to include .with_thumbnails().
  • Add tests covering warm-path direct URLs, stale conditions (width mismatch, legacy width, source-changed), and the prefetch requirement.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
ami/main/models.py Adds warm/cold thumbnail URL logic and a queryset helper to prefetch thumbnails; reuses the validity predicate for generation and serialization.
ami/main/api/serializers.py Routes the thumbnails field through SourceImage.thumbnail_urls() and documents the required prefetch contract.
ami/main/api/views.py Applies .with_thumbnails() to captures querysets and event example-captures prefetch to satisfy the serializer/model contract.
ami/main/tests.py Adds regression tests for warm direct URLs, stale fallbacks, source-changed handling, and the “requires prefetch” guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mihow

mihow commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

mihow and others added 2 commits June 15, 2026 21:15
The strict prefetch guard raised on every non-list serialization of a
SourceImage through the shared serializer — notably the create/update write
responses, which serialize a freshly-saved instance with no prefetch cache
(500 on POST /captures/). Replace the raise with a no-query fallback: when
thumbnails aren't prefetched, emit route URLs for every label without
querying. This keeps the zero-N+1 guarantee (never a per-object SELECT) while
not breaking single-object serialization; list endpoints still get warm
storage URLs via with_thumbnails(), pinned by the list query-count tests.

Co-Authored-By: Claude <noreply@anthropic.com>
…dth test

The runtime no longer raises when thumbnails are unprefetched (it falls back
to route URLs without querying), so the warm-list contract is enforced only by
tests. Add a multi-row test asserting a warm captures list serves direct
storage URLs while hitting the thumbnail table at most once — guards both a
dropped with_thumbnails() (route URLs, perf win lost) and a per-row lazy read
(N+1).

Merge the two strict-width-mismatch tests (settings-changed +100 and legacy
encoder rounding -1) into one subTest-parametrized case; they exercised the
same branch. Fix a stale comment that claimed thumbnail_urls still raises
without a prefetch.

Co-Authored-By: Claude <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/models.py`:
- Around line 1435-1468: The _regroup_lock function has a race condition in its
finally block where the separate cache.get() and cache.delete() operations on
lock_key allow another worker to acquire the same lock between those two calls
if the TTL expires. Replace the two separate operations with a single atomic
compare-and-delete operation using a Lua script or Redis primitive (via
django_redis's script support) that verifies the token matches and deletes the
lock in one indivisible step, preventing the TTL expiration window from allowing
lock acquisition by another worker between the get and delete calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6ebd9a13-5f3f-4f60-a577-0dc6f1f464b0

📥 Commits

Reviewing files that changed from the base of the PR and between 870cf99 and dd40bae.

📒 Files selected for processing (4)
  • ami/main/api/serializers.py
  • ami/main/api/views.py
  • ami/main/models.py
  • ami/main/tests.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ami/main/api/views.py
  • ami/main/api/serializers.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/models.py`:
- Around line 1435-1468: The _regroup_lock function has a race condition in its
finally block where the separate cache.get() and cache.delete() operations on
lock_key allow another worker to acquire the same lock between those two calls
if the TTL expires. Replace the two separate operations with a single atomic
compare-and-delete operation using a Lua script or Redis primitive (via
django_redis's script support) that verifies the token matches and deletes the
lock in one indivisible step, preventing the TTL expiration window from allowing
lock acquisition by another worker between the get and delete calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6ebd9a13-5f3f-4f60-a577-0dc6f1f464b0

📥 Commits

Reviewing files that changed from the base of the PR and between 870cf99 and dd40bae.

📒 Files selected for processing (4)
  • ami/main/api/serializers.py
  • ami/main/api/views.py
  • ami/main/models.py
  • ami/main/tests.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ami/main/api/views.py
  • ami/main/api/serializers.py
🛑 Comments failed to post (1)
ami/main/models.py (1)

1435-1468: ⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Cache backend configuration ==\n'
rg -n -C3 'CACHES\s*=|django_redis|LocMemCache|MemcachedCache|RedisCache|CACHE_URL' \
  --glob '*.py' --glob '*.env*' --glob '*.yml' --glob '*.yaml'

printf '\n== Regroup task time limits / lock TTL ==\n'
rg -n -C4 'regroup_events|soft_time_limit|time_limit|REGROUP_LOCK_TTL_SECONDS' \
  --glob '*.py'

Repository: RolnickLab/antenna

Length of output: 31016


Make the regroup lock release atomic using a Lua script or Redis primitive.

The _regroup_lock function uses two separate cache operations (cache.get() then cache.delete() on lines 1466–1468) to verify and release the lock. This creates a race condition when a task approaches the 600-second TTL: if the task hits soft_time_limit (600 seconds) right as the key's TTL expires, another worker can acquire the same lock between the get and delete calls, and the finally block will then delete that new lock. Result: overlapping regroup runs.

Since the cache backend is Redis (django-redis), use a Lua script or Redis's atomic primitives (e.g., via django_redis's script support) to replace the separate get-and-delete with a single compare-and-delete operation that cannot be interrupted by TTL expiry.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/models.py` around lines 1435 - 1468, The _regroup_lock function has
a race condition in its finally block where the separate cache.get() and
cache.delete() operations on lock_key allow another worker to acquire the same
lock between those two calls if the TTL expires. Replace the two separate
operations with a single atomic compare-and-delete operation using a Lua script
or Redis primitive (via django_redis's script support) that verifies the token
matches and deletes the lock in one indivisible step, preventing the TTL
expiration window from allowing lock acquisition by another worker between the
get and delete calls.

@loppear loppear left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants