Skip to content

refactor: bound and harden Transifex help-sync error handling#792

Draft
cwillisf wants to merge 3 commits into
masterfrom
harden-help-sync-error-handling
Draft

refactor: bound and harden Transifex help-sync error handling#792
cwillisf wants to merge 3 commits into
masterfrom
harden-help-sync-error-handling

Conversation

@cwillisf

@cwillisf cwillisf commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Resolves

  • No tracked issue. Addresses recurring failures of the Daily Help Update scheduled workflow, which had been ending either as a fast failure on a Transifex 502 or a six-hour job cancellation (example runs: 27810215828, 28006748291, 26935165394).

Proposed Changes

Both failure modes traced to one place. The Transifex SDK's ResourceStringsAsyncUpload.upload() creates an async upload and polls until its status is succeeded, with no timeout and no exit for a failed status. When an upload never reaches succeeded the poll loops forever (until the runner's 6-hour limit); when a poll hits a transient 502 it crashes the job with an unhandled rejection and no useful message. The commit it ran against was irrelevant, which is why the same commit produced both a 3-minute failure and a 6-hour cancellation on different days.

The first CI run on this branch confirmed the root cause and named the trigger: it failed in seconds on resource MemberQuestions_4000041438_json with parse_error: No strings could be extracted. A Freshdesk folder with no published articles reduces to empty content, Transifex rejects it, and (before the fix) that one upload hung the whole sync through the push fan-out's Promise.all.

  • txPush now does the create-then-poll itself so it can stop on a failed status and throw with the upload's reported errors, bound the wait with an overall deadline (fails fast on a stuck upload), and retry transient errors (5xx, 429, network blips) with backoff instead of dying on the first one.
  • tx-push-help now skips resources with no strings and emits a warning rather than pushing empty content, so an empty Freshdesk folder is treated as a content situation rather than a sync failure. emitWarning moved into a shared warnings module so the push script reports through the same WARNINGS_FILE channel CI already checks.
  • txPull gets the same transient retry/backoff and clearer per-resource error messages, plus a per-request download timeout.
  • Added timeout-minutes to the daily-help-update job as a last-resort backstop.
  • Quoted the runner env vars in the warnings step to satisfy shellcheck (separate commit, no behavior change).

The txPush/txPull changes live in the shared lib, so tx-push-src and the pull:* scripts benefit too.

Out of scope, noted for later: daily-tx-pull.yml has the same hang exposure through the SDK's download poll and no timeout-minutes; the push fan-out uses an unbounded Promise.all rather than the existing poolMap; and the push side has no equivalent of the pull side's stale-resource detection (a TX resource whose Freshdesk folder was deleted). Each is a reasonable follow-up.

cwillisf added 3 commits June 23, 2026 07:00
Quote $GITHUB_OUTPUT and $GITHUB_STEP_SUMMARY to satisfy shellcheck
(SC2086). No behavior change; the runner sets these to space-free paths.
The daily help sync had been ending one of two ways regardless of the
commit it ran against: a fast failure on a Transifex 502, or a six-hour
job cancellation. Both trace to one place. The SDK's
ResourceStringsAsyncUpload.upload() creates an async upload and then
polls until the status is 'succeeded', with no timeout and no exit for a
'failed' status. When an upload never reaches 'succeeded', that poll
loops forever (until the runner's 6-hour limit); when a poll happens to
hit a transient 502, it crashes the job with an unhandled rejection and
no useful message.

txPush now does the create-then-poll itself so it can:
- stop on a 'failed' status and throw with the upload's reported errors,
  so we learn why an upload was rejected instead of waiting forever;
- bound the wait with an overall deadline, failing fast on a stuck upload;
- retry transient errors (5xx, 429, network blips) with backoff rather
  than dying on the first blip.

The same retry/backoff and clearer error reporting are applied to
txPull's download path, plus a per-request download timeout. A
timeout-minutes backstop on the workflow guards against any remaining
unexpected hang (including the SDK's download poll, which has the same
unbounded shape).
A Freshdesk folder with no published articles reduces to an empty string
set. Uploading that made Transifex reject it with 'No strings could be
extracted', which (before the bounded poll) left the upload hanging and,
via the push fan-out's Promise.all, took the whole sync down. The first
CI run on this branch confirmed it: the job failed in seconds on resource
MemberQuestions_4000041438_json with exactly that parse error.

Skip resources with no strings and emit a warning instead of pushing
empty content. An empty folder is a content situation, not a sync
failure, so it should not fail (or stall) the run.

Extracts the existing emitWarning helper from help-utils into a shared
warnings module so the push script reports through the same WARNINGS_FILE
channel that CI already checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant