Skip to content

Orchestrate DOAB backfill drain + serialise harvest under shared flock#35

Open
rdhyee wants to merge 1 commit into
masterfrom
feature/doab-backfill-orchestration
Open

Orchestrate DOAB backfill drain + serialise harvest under shared flock#35
rdhyee wants to merge 1 commit into
masterfrom
feature/doab-backfill-orchestration

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented May 19, 2026

Companion to Gluejar/regluit#1152. Adds the production orchestration to actually drain the ~20.6k-record DOAB backfill and serialises it with the existing nightly harvest (#31, Gluejar/regluit#1129) under a shared host-local lock — so harvest + backfill can never hit DOAB OAI concurrently on this host. "Slow-and-gentle" enforced by code (flock) + halt-aware markers, not by hope.

What's in this PR (1 commit, 83b1bf8)

New artifacts under roles/regluit_prod/:

Path Role
templates/doab-backfill.sh.j2 One bounded backfill_doab pass per cron tick. Holds /var/lock/doab-oai.lock via flock -n for the whole pass. Honors the command's 0/3/4 exit codes via .done / .halted markers (the halt circuit-breaker is honored ACROSS ticks, not just within one run). Deliberately NOT set -e (must inspect rc); never exits non-zero (Eric treats cron mail as a signal — failure surfaces via .halted + log).
templates/doab-harvest.sh.j2 The existing nightly load_doab run, now wrapped in the same flock. Intentionally thin: load_doab.handle() reads the Retry-After sentinel natively (race-free per the regluit-side commits), so the wrapper does not duplicate that logic.
tasks/doab.yml NEW. Owns all DOAB cron. Creates /var/lib/regluit/doab-backfill state dir, installs both scripts, schedules harvest (04:30) and backfill (every :00/:30). All deploy_type == 'prod'-gated. Documents the CROSS-HOST INVARIANT vs doab-check.
tasks/cron.yml The inline #31 harvest cron entry is removed; ownership moves to doab.yml so harvest + backfill + their shared flock live together. A pointer comment is left.
tasks/main.yml Imports doab.yml between cron.yml and log_management.yml so existing 30-day log rotation still covers the new doab-*.log.

Cadence: ~20.6k / 500-per-pass ≈ 40+ ticks ≈ ~1 day at every-30-min — the intended slow drip. Self-disables via .done once drained; freezes via .halted on a circuit-breaker.

Approval status

3 Codex code-review rounds, final: LGTM.

Round Outcome
R1 (plan) 3 corrections: set +e rc-capture, skip paths must exit 3 not 0 (forced a command-side change tracked in Gluejar/regluit#1152), explicit dir/install Ansible tasks. All folded in.
R1 (code) Verified clean: rc capture, exit mapping, .done only on rc 0, discovery-ban writes no .done, flock fd lifetime, .halted freeze. Two findings: harvest wrapper sentinel check, cross-host concurrency.
R2 Finding-1 confirmed non-defect: load_doab.handle() reads the sentinel natively (proven inline). Finding-2 (cross-host) addressed by design (doab-check runner is disabled-by-default + invariant documented + minute-offset stagger). Caught new real defect: existing doab-check/scripts/doab_load.sh had no flock → harvest+backfill could collide on the doab-check host.
R3 LGTM — patched doab_load.sh in EbookFoundation/doab-check#19 closed the collision.

What deserves human eyes (not Codex's domain)

  1. Replacing the inline Automate DOAB harvest on nightly cadence (Gluejar/regluit#1129) #31 harvest cron with a templated wrapper — small behavioural delta (now skips a tick if the lock is held; the 3-day rolling window self-heals on the next clear night per the script's own pre-existing comments). This is the price of the slow-and-gentle invariant.

  2. The CROSS-HOST INVARIANT is operator-managed (no distributed lock). regluit drains first (.done), then doab-check is armed by setting IDS_FILE in its crontab. Building a distributed lock between an AWS box and a DO droplet for a one-off catch-up would be over-engineering against the slow-and-gentle posture.

Backlinks

🤖 Generated with Claude Code (Opus 4.7, 1M context) — multi-round Codex CLI adversarial review

Adds the production orchestration for the one-off ~20.6k DOAB backfill
(refs Gluejar/regluit#1151) and serialises it with the nightly harvest
(provisioning#31) under a single host-local lock. Slow-and-gentle is
enforced by code (flock) + halt-aware markers, not by hope.

New artifacts under roles/regluit_prod/:

- templates/doab-backfill.sh.j2: one bounded backfill_doab pass per cron
  tick. Honors the command's 0/3/4 exit-code contract via .done/.halted
  marker files (halt circuit-breaker honored ACROSS ticks). Holds
  /var/lock/doab-oai.lock via flock -n; never exits non-zero (Eric treats
  cron mail as a signal — failure surfaces via .halted + log).
  Deliberately NOT `set -e` so we can inspect the command's rc.

- templates/doab-harvest.sh.j2: nightly load_doab, now wrapped in the
  SAME /var/lock/doab-oai.lock so harvest + backfill cannot hit DOAB OAI
  concurrently on this host. Intentionally thin — load_doab.handle()
  reads the Retry-After sentinel natively (race-free per c9221e2 / the
  parallel regluit fix), so the wrapper doesn't duplicate that logic.

- tasks/doab.yml: NEW, owns all DOAB cron. Creates /var/lib/regluit/
  doab-backfill state dir, installs the two wrapper scripts, schedules
  the harvest (04:30) and backfill (every :00 + :30). Production-gated
  (deploy_type == 'prod'). Documents the CROSS-HOST INVARIANT vs
  doab-check (separate DO host on the same DOAB endpoint; only one
  host's backfill may be armed at a time; regluit drains first).

- tasks/cron.yml: the inline #31 harvest cron entry is removed; ownership
  moves to doab.yml so harvest + backfill + their shared flock live
  together. A pointer comment is left in cron.yml.

- tasks/main.yml: imports doab.yml between cron.yml and log_management.yml
  so the existing 30-day log rotation still covers the new doab-*.log.

Cadence: ~20.6k / 500 per pass ≈ 40+ ticks ≈ ~1 day at every-30-min — the
intended slow drip. Self-disables via .done once drained; freezes via
.halted on a halt; operator clears markers to re-arm.

Codex-reviewed end-to-end (3 rounds, final: LGTM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 20, 2026

2026-05-20 status update:

  • The command-side work has moved again since this PR was opened: Add backfill_doab management command (refs #1151) Gluejar/regluit#1152 now emits reusable .active / .deleted snapshots and can consume them via --use-active-file / --use-deleted-file.
  • Test-machine validation now supports the "one OAI crawl, multiple machines catch up" story: test.unglue.it produced the universal snapshot, same-host reuse produced a byte-identical worklist without DOAB calls, and doab-check consumed that same snapshot via an operator-side diff script.
  • This provisioning PR may therefore be simplified before merge depending on Eric's scope decision. If we keep the full regluit consumer path, the current orchestration still makes sense. If we trim regluit#1152 to producer-only, this PR should be adjusted to run the simpler producer/backfill flow plus any operator-script handoff rather than assuming every reusable path is permanent product code.
  • The slow-and-gentle invariant still matters either way: host-local flock prevents regluit harvest/backfill overlap, and cross-host sequencing remains operator-managed (regluit first, then doab-check).

So this PR is still the right place to preserve the orchestration design, but it should not be treated as final until the merge-scope decision on Gluejar/regluit#1152 is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant