Orchestrate DOAB backfill drain + serialise harvest under shared flock#35
Open
rdhyee wants to merge 1 commit into
Open
Orchestrate DOAB backfill drain + serialise harvest under shared flock#35rdhyee wants to merge 1 commit into
rdhyee wants to merge 1 commit into
Conversation
Adds the production orchestration for the one-off ~20.6k DOAB backfill (refs Gluejar/regluit#1151) and serialises it with the nightly harvest (provisioning#31) under a single host-local lock. Slow-and-gentle is enforced by code (flock) + halt-aware markers, not by hope. New artifacts under roles/regluit_prod/: - templates/doab-backfill.sh.j2: one bounded backfill_doab pass per cron tick. Honors the command's 0/3/4 exit-code contract via .done/.halted marker files (halt circuit-breaker honored ACROSS ticks). Holds /var/lock/doab-oai.lock via flock -n; never exits non-zero (Eric treats cron mail as a signal — failure surfaces via .halted + log). Deliberately NOT `set -e` so we can inspect the command's rc. - templates/doab-harvest.sh.j2: nightly load_doab, now wrapped in the SAME /var/lock/doab-oai.lock so harvest + backfill cannot hit DOAB OAI concurrently on this host. Intentionally thin — load_doab.handle() reads the Retry-After sentinel natively (race-free per c9221e2 / the parallel regluit fix), so the wrapper doesn't duplicate that logic. - tasks/doab.yml: NEW, owns all DOAB cron. Creates /var/lib/regluit/ doab-backfill state dir, installs the two wrapper scripts, schedules the harvest (04:30) and backfill (every :00 + :30). Production-gated (deploy_type == 'prod'). Documents the CROSS-HOST INVARIANT vs doab-check (separate DO host on the same DOAB endpoint; only one host's backfill may be armed at a time; regluit drains first). - tasks/cron.yml: the inline #31 harvest cron entry is removed; ownership moves to doab.yml so harvest + backfill + their shared flock live together. A pointer comment is left in cron.yml. - tasks/main.yml: imports doab.yml between cron.yml and log_management.yml so the existing 30-day log rotation still covers the new doab-*.log. Cadence: ~20.6k / 500 per pass ≈ 40+ ticks ≈ ~1 day at every-30-min — the intended slow drip. Self-disables via .done once drained; freezes via .halted on a halt; operator clears markers to re-arm. Codex-reviewed end-to-end (3 rounds, final: LGTM). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
2026-05-20 status update:
So this PR is still the right place to preserve the orchestration design, but it should not be treated as final until the merge-scope decision on Gluejar/regluit#1152 is made. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to Gluejar/regluit#1152. Adds the production orchestration to actually drain the ~20.6k-record DOAB backfill and serialises it with the existing nightly harvest (#31, Gluejar/regluit#1129) under a shared host-local lock — so harvest + backfill can never hit DOAB OAI concurrently on this host. "Slow-and-gentle" enforced by code (flock) + halt-aware markers, not by hope.
What's in this PR (1 commit,
83b1bf8)New artifacts under
roles/regluit_prod/:templates/doab-backfill.sh.j2backfill_doabpass per cron tick. Holds/var/lock/doab-oai.lockviaflock -nfor the whole pass. Honors the command's 0/3/4 exit codes via.done/.haltedmarkers (the halt circuit-breaker is honored ACROSS ticks, not just within one run). Deliberately NOTset -e(must inspect rc); never exits non-zero (Eric treats cron mail as a signal — failure surfaces via.halted+ log).templates/doab-harvest.sh.j2load_doabrun, now wrapped in the same flock. Intentionally thin:load_doab.handle()reads the Retry-After sentinel natively (race-free per the regluit-side commits), so the wrapper does not duplicate that logic.tasks/doab.yml/var/lib/regluit/doab-backfillstate dir, installs both scripts, schedules harvest (04:30) and backfill (every:00/:30). Alldeploy_type == 'prod'-gated. Documents the CROSS-HOST INVARIANT vs doab-check.tasks/cron.ymldoab.ymlso harvest + backfill + their shared flock live together. A pointer comment is left.tasks/main.ymldoab.ymlbetweencron.ymlandlog_management.ymlso existing 30-day log rotation still covers the newdoab-*.log.Cadence: ~20.6k / 500-per-pass ≈ 40+ ticks ≈ ~1 day at every-30-min — the intended slow drip. Self-disables via
.doneonce drained; freezes via.haltedon a circuit-breaker.Approval status
3 Codex code-review rounds, final: LGTM.
set +erc-capture, skip paths must exit 3 not 0 (forced a command-side change tracked in Gluejar/regluit#1152), explicit dir/install Ansible tasks. All folded in..doneonly on rc 0, discovery-ban writes no.done, flock fd lifetime,.haltedfreeze. Two findings: harvest wrapper sentinel check, cross-host concurrency.load_doab.handle()reads the sentinel natively (proven inline). Finding-2 (cross-host) addressed by design (doab-check runner is disabled-by-default + invariant documented + minute-offset stagger). Caught new real defect: existingdoab-check/scripts/doab_load.shhad no flock → harvest+backfill could collide on the doab-check host.doab_load.shin EbookFoundation/doab-check#19 closed the collision.What deserves human eyes (not Codex's domain)
Replacing the inline Automate DOAB harvest on nightly cadence (Gluejar/regluit#1129) #31 harvest cron with a templated wrapper — small behavioural delta (now skips a tick if the lock is held; the 3-day rolling window self-heals on the next clear night per the script's own pre-existing comments). This is the price of the slow-and-gentle invariant.
The CROSS-HOST INVARIANT is operator-managed (no distributed lock). regluit drains first (
.done), then doab-check is armed by settingIDS_FILEin its crontab. Building a distributed lock between an AWS box and a DO droplet for a one-off catch-up would be over-engineering against the slow-and-gentle posture.Backlinks
5ed52c1c)🤖 Generated with Claude Code (Opus 4.7, 1M context) — multi-round Codex CLI adversarial review