[TESTING] Combined #2177 + #2193 for event-log-race-repro#2197
Draft
VaguelySerious wants to merge 23 commits into
Draft
[TESTING] Combined #2177 + #2193 for event-log-race-repro#2197VaguelySerious wants to merge 23 commits into
VaguelySerious wants to merge 23 commits into
Conversation
…rded value A reused/duration sleep races a `wait_completed` replay against a non-deterministic, wall-clock-derived expected value, producing a false CorruptedEventLogError on a perfectly consistent event log. `sleep(<ms|string>)` computes its resumeAt as `Date.now() + duration` (see parseDurationToDate). The original run records that absolute timestamp into both wait_created and wait_completed. During replay the VM clock advances to each event's createdAt, so a freshly-created sleep recomputes a *different* absolute resumeAt. Normally harmless: the wait_created consumer overwrites the queue item's resumeAt with the recorded (authoritative) value before wait_completed is validated. The bug: when a wait_completed is consumed by a sleep consumer that never applied a wait_created (hasCreatedEvent=false), the queue item still holds the freshly-recomputed value, and the resumeAt comparison fails — even though the event log is internally consistent and the recorded resumeAt is the source of truth. Captured in production stress runs: hasCreatedEvent=false with ~18-42s deltas between the recomputed and recorded resumeAt. Fix: only validate resumeAt when an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail. Extracted the validation into `detectResumeAtMismatch`, which also lowers the consumer callback's cognitive-complexity warning (33 -> 21). Adds a regression test that advances the replay clock (via updateTimestamp) and asserts a consistent wait_completed with hasCreatedEvent=false no longer raises CorruptedEventLogError. Pre-existing stable bug; independent of the hook-vs-sleep race fix (#2171) and of the server-side work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the missing patch changeset for the `@workflow/core` wait_completed resumeAt replay fix so it gets a version bump and changelog entry. Also remove the fixed 250ms grace timer from the new regression test: it now races the error-vs-resolve outcomes directly, so a regression surfaces deterministically (error branch, or a hang caught by the test timeout) rather than via a flaky race against a wall-clock guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hasCreatedEvent gate that skips the resumeAt equality check (to avoid a false mismatch against a non-deterministic recomputed value) also skipped the malformed-value check. A non-finite / unparseable resumeAt is corrupt data regardless of whether an authoritative recorded value exists — the original run always records a valid parseDurationToDate(...) Date, so a consistent log never produces one. Flag it unconditionally, before the hasCreatedEvent gate; finite-value equality stays gated as before. Adds a regression test asserting an Invalid Date resumeAt raises even with hasCreatedEvent=false (the counterpart to the consistent-replay test, which must keep resolving). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ist + infra breakdown Rework the PR-comment renderer so a human can immediately see what gates the job and inspect every failing run: - 🚨 Event-Log Regressions table lists *every* gating run in full (never truncated), each with its duration, a synthesised detail line, and a direct dashboard link. Stuck runs render "no terminal state after <ms>". - Infra (non-gating) section groups harness noise by error code with a plain-language explanation and example run links, instead of flooding one table with thousands of rows. - Headline names the regression count and digests the infra noise (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH"). Adds unit coverage for the breakdown, message synthesis and the never-truncate-regressions guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Peter Wielander <peter.wielander@vercel.com>
Forces ENFORCE_STRICT_CONCURRENCY behaviour on regardless of the env var so all CI e2e jobs exercise per-run flow topics + maxConcurrency:1. Skips the three off-by-default unit assertions. Drop this commit before merge.
On run-poll timeout, fetch the run's event log and record the latest event (type, step name, elapsed) as the stuck run's errorMessage. The summary's regression table then shows "stuck after N events; latest step_started (foo) at +12.3s" with a dashboard link, instead of only a duration — so a human can see where the run wedged without opening every link. Best-effort; falls back to the duration-only note if the event fetch fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ep-resumeat-replay # Conflicts: # .github/scripts/render-event-log-race-repro-results.js # .github/scripts/render-event-log-race-repro-results.test.js
A run flagged at the 150s poll budget can simply be slow on a loaded preview deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up and was wrongly gated as `stuck`. Add a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome (completed → non-gating `SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class). Only a run still non-terminal after budget + grace is a genuine wedge (gating `stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural agreement fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e they occur
Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not
found`. Both came from harness-side network calls to the deployment, not the
SDK. A single dropped connection should never abort tracking an otherwise
healthy run.
- Add `withRetry` (linear backoff, transient-network detection) and apply it to
the harness network calls: getWorkflowMetadata, start, resumeHook, and the
run-status poll. On final failure the error is prefixed with the call site
(e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra
breakdown says *where* it happened.
- pollTerminalRun no longer aborts on a flaky GET: a transient error just
retries/continues until the deadline.
- waitForHook labels its surfaced error ("waitForHook: Hook not found") so the
hook-propagation timeout is identifiable in the summary.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ist + infra breakdown Rework the PR-comment renderer so a human can immediately see what gates the job and inspect every failing run: - 🚨 Event-Log Regressions table lists *every* gating run in full (never truncated), each with its duration, a synthesised detail line, and a direct dashboard link. Stuck runs render "no terminal state after <ms>". - Infra (non-gating) section groups harness noise by error code with a plain-language explanation and example run links, instead of flooding one table with thousands of rows. - Headline names the regression count and digests the infra noise (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH"). Adds unit coverage for the breakdown, message synthesis and the never-truncate-regressions guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8f41186)
On run-poll timeout, fetch the run's event log and record the latest event (type, step name, elapsed) as the stuck run's errorMessage. The summary's regression table then shows "stuck after N events; latest step_started (foo) at +12.3s" with a dashboard link, instead of only a duration — so a human can see where the run wedged without opening every link. Best-effort; falls back to the duration-only note if the event fetch fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit dee4370)
A run flagged at the 150s poll budget can simply be slow on a loaded preview deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up and was wrongly gated as `stuck`. Add a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome (completed → non-gating `SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class). Only a run still non-terminal after budget + grace is a genuine wedge (gating `stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural agreement fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 31d5b99)
…e they occur
Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not
found`. Both came from harness-side network calls to the deployment, not the
SDK. A single dropped connection should never abort tracking an otherwise
healthy run.
- Add `withRetry` (linear backoff, transient-network detection) and apply it to
the harness network calls: getWorkflowMetadata, start, resumeHook, and the
run-status poll. On final failure the error is prefixed with the call site
(e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra
breakdown says *where* it happened.
- pollTerminalRun no longer aborts on a flaky GET: a transient error just
retries/continues until the deadline.
- waitForHook labels its surfaced error ("waitForHook: Hook not found") so the
hook-propagation timeout is identifiable in the summary.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit a9b68c0)
…issing field
A failed WorkflowRun exposes its reason as `error: { code, message }` and has no
top-level `errorCode`, so the poller's `classifyFailure(runData.errorCode)` was
always passing `undefined` — collapsing every polled failure to an
uncategorised, detail-less `other`. Read `runData.error.code`/`.message` so
USER_ERROR/RUNTIME_ERROR/CORRUPTED_EVENT_LOG are classified correctly and the
regression row shows why the run failed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d2a59d4)
…issing field
A failed WorkflowRun exposes its reason as `error: { code, message }` and has no
top-level `errorCode`, so the poller's `classifyFailure(runData.errorCode)` was
always passing `undefined` — collapsing every polled failure to an
uncategorised, detail-less `other`. Read `runData.error.code`/`.message` so
USER_ERROR/RUNTIME_ERROR/CORRUPTED_EVENT_LOG are classified correctly and the
regression row shows why the run failed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d2a59d4)
…t-concurrency # Conflicts: # packages/core/e2e/event-log-race-repro.test.ts
…eplay' into peter/repro-combined-2177-2193
🦋 Changeset detectedLatest commit: 72540fe The changes in this PR will be included in the next version bump. This PR includes changesets to release 17 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (2 failed)hono (1 failed):
nuxt (1 failed):
🌍 Community Worlds (87 failed)mongodb (11 failed):
redis (9 failed):
turso-dev (1 failed):
turso (66 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details.
Check the workflow run for details. |
Contributor
Contributor
Event Log Race Repro🚨 2 of 2000 latest repro runs hit event-log regressions — see the linked runs below. 🚨 Event-Log Regressions (2)These gate the job. Each row links to the workflow run on the dashboard.
Run History
Latest Scenario Breakdown
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Testing-only PR — do not merge. Combines all commits from #2177 (
fix(core): wait_completed.resumeAtreplay fix) and #2193 (ENFORCE_STRICT_CONCURRENCY) so the event-log-race-repro job runs against both changes together, on top of the full CI-harness improvements (actionable summary, slow≠stuck grace, fetch retries, and structured-error failure classification).Goal: see whether the combination reproduces the
CORRUPTED_EVENT_LOGobserved on #2193 (now correctly classified rather than masked asother), and whether the two fixes interact.Sources:
nate/fix-reused-sleep-resumeat-replaypeter/enforce-strict-concurrency🤖 Generated with Claude Code