refactor: bound and harden Transifex help-sync error handling#792
Draft
cwillisf wants to merge 3 commits into
Draft
refactor: bound and harden Transifex help-sync error handling#792cwillisf wants to merge 3 commits into
cwillisf wants to merge 3 commits into
Conversation
Quote $GITHUB_OUTPUT and $GITHUB_STEP_SUMMARY to satisfy shellcheck (SC2086). No behavior change; the runner sets these to space-free paths.
The daily help sync had been ending one of two ways regardless of the commit it ran against: a fast failure on a Transifex 502, or a six-hour job cancellation. Both trace to one place. The SDK's ResourceStringsAsyncUpload.upload() creates an async upload and then polls until the status is 'succeeded', with no timeout and no exit for a 'failed' status. When an upload never reaches 'succeeded', that poll loops forever (until the runner's 6-hour limit); when a poll happens to hit a transient 502, it crashes the job with an unhandled rejection and no useful message. txPush now does the create-then-poll itself so it can: - stop on a 'failed' status and throw with the upload's reported errors, so we learn why an upload was rejected instead of waiting forever; - bound the wait with an overall deadline, failing fast on a stuck upload; - retry transient errors (5xx, 429, network blips) with backoff rather than dying on the first blip. The same retry/backoff and clearer error reporting are applied to txPull's download path, plus a per-request download timeout. A timeout-minutes backstop on the workflow guards against any remaining unexpected hang (including the SDK's download poll, which has the same unbounded shape).
A Freshdesk folder with no published articles reduces to an empty string set. Uploading that made Transifex reject it with 'No strings could be extracted', which (before the bounded poll) left the upload hanging and, via the push fan-out's Promise.all, took the whole sync down. The first CI run on this branch confirmed it: the job failed in seconds on resource MemberQuestions_4000041438_json with exactly that parse error. Skip resources with no strings and emit a warning instead of pushing empty content. An empty folder is a content situation, not a sync failure, so it should not fail (or stall) the run. Extracts the existing emitWarning helper from help-utils into a shared warnings module so the push script reports through the same WARNINGS_FILE channel that CI already checks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves
Daily Help Updatescheduled workflow, which had been ending either as a fast failure on a Transifex 502 or a six-hour job cancellation (example runs: 27810215828, 28006748291, 26935165394).Proposed Changes
Both failure modes traced to one place. The Transifex SDK's
ResourceStringsAsyncUpload.upload()creates an async upload and polls until its status issucceeded, with no timeout and no exit for afailedstatus. When an upload never reachessucceededthe poll loops forever (until the runner's 6-hour limit); when a poll hits a transient 502 it crashes the job with an unhandled rejection and no useful message. The commit it ran against was irrelevant, which is why the same commit produced both a 3-minute failure and a 6-hour cancellation on different days.The first CI run on this branch confirmed the root cause and named the trigger: it failed in seconds on resource
MemberQuestions_4000041438_jsonwithparse_error: No strings could be extracted. A Freshdesk folder with no published articles reduces to empty content, Transifex rejects it, and (before the fix) that one upload hung the whole sync through the push fan-out'sPromise.all.txPushnow does the create-then-poll itself so it can stop on afailedstatus and throw with the upload's reportederrors, bound the wait with an overall deadline (fails fast on a stuck upload), and retry transient errors (5xx, 429, network blips) with backoff instead of dying on the first one.tx-push-helpnow skips resources with no strings and emits a warning rather than pushing empty content, so an empty Freshdesk folder is treated as a content situation rather than a sync failure.emitWarningmoved into a sharedwarningsmodule so the push script reports through the sameWARNINGS_FILEchannel CI already checks.txPullgets the same transient retry/backoff and clearer per-resource error messages, plus a per-request download timeout.timeout-minutesto thedaily-help-updatejob as a last-resort backstop.The
txPush/txPullchanges live in the shared lib, sotx-push-srcand thepull:*scripts benefit too.Out of scope, noted for later:
daily-tx-pull.ymlhas the same hang exposure through the SDK's download poll and notimeout-minutes; the push fan-out uses an unboundedPromise.allrather than the existingpoolMap; and the push side has no equivalent of the pull side's stale-resource detection (a TX resource whose Freshdesk folder was deleted). Each is a reasonable follow-up.