From cd981c8385a6ab7fac48a37b6a39529cc0a0ace1 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Tue, 2 Jun 2026 10:19:26 +0000
Subject: [PATCH] [docs] Reduce noise in changelog files (#2075)

Signed-off-by: Peter Wielander <mittgfu@gmail.com>
---
 docs/content/docs/changelog/meta.json         |   2 +-
 .../docs/changelog/resilient-start.mdx        | 314 ++----------------
 2 files changed, 32 insertions(+), 284 deletions(-)

diff --git a/docs/content/docs/changelog/meta.json b/docs/content/docs/changelog/meta.json
index 0c01dff133..ba12d32964 100644
--- a/docs/content/docs/changelog/meta.json
+++ b/docs/content/docs/changelog/meta.json
@@ -1,5 +1,5 @@
 {
   "title": "Changelog",
-  "pages": ["index", "eager-processing", "resilient-start"],
+  "pages": ["index", "resilient-start"],
   "defaultOpen": false
 }
diff --git a/docs/content/docs/changelog/resilient-start.mdx b/docs/content/docs/changelog/resilient-start.mdx
index 653c439e0f..8560762807 100644
--- a/docs/content/docs/changelog/resilient-start.mdx
+++ b/docs/content/docs/changelog/resilient-start.mdx
@@ -7,321 +7,69 @@ description: Overhaul run start logic to tolerate world storage unavailability,
 
 ## Motivation
 
-When `world` storage is unavailable but the queue is up,
-`start()` previously failed entirely because `world.events.create(run_created)`
-is called before `world.queue()`. This change decouples run creation from queue
-dispatch so that runs can still be accepted when storage is degraded.
+When `world` storage is unavailable but the queue is up, `start()` previously failed entirely because `world.events.create(run_created)` is called before `world.queue()`. This change decouples run creation from queue dispatch so that runs can still be accepted when storage is degraded.
 
-Additionally, the runtime previously called `world.runs.get(runId)` before
-`run_started`, adding an extra round-trip. By always calling `run_started`
-directly, we save that round-trip and can return pre-loaded events in the
-response to skip the initial `events.list` call, reducing TTFB.
+Additionally, the runtime previously called `world.runs.get(runId)` before `run_started`, adding an extra round-trip. By always calling `run_started` directly, we save that round-trip and can return pre-loaded events in the response to skip the initial `events.list` call, reducing TTFB.
 
 ## Design
 
-### `start()` changes (packages/core)
+### `start()` changes
 
-- `world.events.create` (run_created) and `world.queue` are now called **in parallel**
-  via `Promise.allSettled`.
-- If `events.create` errors with **429 or 5xx**, we log a warning saying that run
-  creation failed but the run was accepted — creation will be re-tried async by the
-  runtime when it processes the queue message. The returned `Run` instance is marked
-  with `resilientStart = true`.
-- If `events.create` errors with **409** (EntityConflictError), the run already exists
-  (e.g., the queue handler's resilient start path created it first due to a cold-start
-  race). This is treated as success.
+- `world.events.create` (run_created) and `world.queue` are now called **in parallel** via `Promise.allSettled`.
+- If `events.create` errors with **429 or 5xx**, we log a warning saying that run creation failed but the run was accepted — creation will be re-tried async by the runtime when it processes the queue message. The returned `Run` instance is marked with `resilientStart = true`.
+- If `events.create` errors with **409** (EntityConflictError), the run already exists (e.g., the queue handler's resilient start path created it first due to a cold-start race). This is treated as success.
 - If `world.queue` fails, we still throw — the run truly failed and was not enqueued.
-- The queue invocation now receives all the run inputs (`input`, `deploymentId`,
-  `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can
-  create the run later if needed.
-- When the runtime re-enqueues itself, it does **not** pass these inputs — only the
-  first queue cycle carries them.
+- The queue invocation now receives all the run inputs (`input`, `deploymentId`, `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can create the run later if needed.
+- When the runtime re-enqueues itself, it does **not** pass these inputs — only the first queue cycle carries them.
 
-### `workflowEntrypoint` changes (packages/core)
+### `workflowEntrypoint` changes
 
-- When calling `world.events.create` with `run_started`, we now also always pass the
-  run input that was sent through the queue, if available. The response will still be on off:
-  - **200 with event (now running)**: As usual, but the server could have used the run input to create the run if it didn't exist yet. The response will be opaque to the runtime.
-  - **200 without event (already running)**: As usual
-  - **409 or 410 (already finished)**: As usual
+- When calling `world.events.create` with `run_started`, we now also always pass the run input that was sent through the queue, if available. The world is responsible for creating the run if it doesn't already exist.
 
-### `Run.returnValue` polling (packages/core)
+### `Run.returnValue` polling
 
-- When `resilientStart` is true on the Run instance (run_created failed), the
-  `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times
-  (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime
-  to create the run via `run_started`.
-- When `resilientStart` is false (normal path), 404 fails immediately — no delay
-  for the common case of a wrong run ID.
+- When `resilientStart` is true on the Run instance (run_created failed), the `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime to create the run via `run_started`.
+- When `resilientStart` is false (normal path), 404 fails immediately — no delay for the common case of a wrong run ID.
 
-### World / workflow-server changes
+### World contract changes
 
-- Posting `run_started` to a **non-existent** run is now allowed when the run input is
-  sent along with the payload. The server:
-  1. Creates a `run_created` event first (so the event log is consistent).
-  2. Strips the input from the `run_started` event data (it lives on `run_created`).
-  3. Then creates the `run_started` event normally.
-  4. Emits a log and a Datadog metric (`workflow_server.resilient_start.run_created_via_run_started`)
-     to track when this fallback path is hit.
-- When `run_started` encounters an **already-running** run, all worlds return `{ run }`
-  with `event: undefined` instead of throwing. No duplicate event is created.
+- Posting `run_started` to a **non-existent** run is now allowed when the run input is sent along with the payload. The world creates a `run_created` event first (so the event log is consistent), then creates the `run_started` event normally.
+- When `run_started` encounters an **already-running** run, all worlds return `{ run }` with `event: undefined` instead of throwing. No duplicate event is created.
 
 ### Queue transport changes
 
-`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain
-JSON serialization. Each world uses a transport that preserves binary data:
+`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain JSON serialization. Each world uses a transport that preserves binary data:
 
-- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a
-  `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively.
-- **world-local**: `TypedJsonTransport` — uses the existing `jsonReplacer`/`jsonReviver`
-  from `fs.ts` that encode Uint8Array as `{ __type: 'Uint8Array', data: '<base64>' }`.
-- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as
-  world-local, inlined since world-postgres doesn't import from world-local.
+- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively.
+- **world-local**: `TypedJsonTransport` — encodes Uint8Array as `{ __type: 'Uint8Array', data: '<base64>' }`.
+- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as world-local.
 
 ## Decisions
 
-1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to
-   minimize latency in the happy path.
+1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to minimize latency in the happy path.
 
-2. **Already-running returns run without event**: When `run_started` encounters an
-   already-running run, all worlds return `{ run }` with `event: undefined` (no
-   `events` array) instead of throwing. The runtime detects this by checking for
-   `result.event === undefined`. This avoids an extra `world.runs.get` round-trip.
+2. **Already-running returns run without event**: When `run_started` encounters an already-running run, all worlds return `{ run }` with `event: undefined` (no `events` array) instead of throwing. The runtime detects this by checking for `result.event === undefined`. This avoids an extra `world.runs.get` round-trip.
 
-3. **Events in 200 response**: We only return events on the 200 path (first caller).
-   On the already-running path, we fall back to the normal `events.list` call. This is
-   correct because only on 200 can we be certain we know the full event history.
+3. **Events in 200 response**: We only return events on the 200 path (first caller). On the already-running path, we fall back to the normal `events.list` call. This is correct because only on 200 can we be certain we know the full event history.
 
-4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true`
-   (run_created failed). Normal runs fail fast on 404.
+4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true` (run_created failed). Normal runs fail fast on 404.
 
 ## Known concerns
 
-### Cold-start race on Vercel (observed in CI)
+### Cold-start race on Vercel
 
-On Vercel, the parallel dispatch can cause the queue message to be processed before
-`run_created` completes, if `run_created` hits a cold-start lambda. Confirmed via
-Datadog: the `run_started` request hit a warm lambda (23ms) while `run_created` hit
-a cold lambda (727ms), even though `run_created` arrived at the edge 116ms earlier.
-When this happens:
+On Vercel, the parallel dispatch can cause the queue message to be processed before `run_created` completes, if `run_created` hits a cold-start lambda. When this happens:
 
 1. The runtime's resilient start path creates the run from `run_started`.
 2. The original `run_created` arrives and gets 409 (EntityConflictError).
 3. `start()` treats the 409 as success (the run exists).
 
-This is handled correctly. The `resilientStart` flag is NOT set on the Run instance
-in this case (409 is not a retryable error), so `returnValue` fails fast on 404.
+The `resilientStart` flag is NOT set on the Run instance in this case (409 is not a retryable error), so `returnValue` fails fast on 404.
 
-### Local Prod test flakiness (resolved)
+### Atomicity of run entity creation
 
-On world-local, the queue's async IIFE can deliver the message before
-`events.create(run_created)` finishes writing to the shared filesystem. The
-resilient start path should handle this, but Local Prod tests showed occasional
-runs stuck at `pending` (no `run_started` event), and Windows CI showed
-"Unconsumed event in event log" errors from duplicate `run_created` events.
+The normal `run_created` path and the resilient start path can race on creating the run entity. In `world-local`, both paths use `writeExclusive` (O_CREAT|O_EXCL) — atomic at the OS level, so exactly one writer wins and the other gets EEXIST. The normal path throws `EntityConflictError` on conflict (handled by `start()` as 409); the resilient start path re-reads the run from disk on conflict.
 
-**Root cause:** A TOCTOU race between the normal `run_created` path and the
-resilient start path. Both used `writeJSON` which checks existence with
-`fs.access()` (non-atomic), so both could pass the check and write separate
-`run_created` events with different event IDs. Fixed by switching both paths to
-`writeExclusive` (O_CREAT|O_EXCL) — see retrospective items 12 and 16.
+In `world-postgres`, the resilient start path uses `onConflictDoNothing` plus a re-read on conflict for the same effect, with the same outcome on either side of the race.
 
-## Follow-up work
-
-- [x] ~~Investigate Local Prod test flakiness~~ — resolved via `writeExclusive`
-  for run entity creation (retrospective items 12, 16).
-- [ ] Monitor the Datadog metric in production to understand how often the fallback is hit.
-- [x] ~~Events optimization for re-enqueue cycles~~ — decided against. The
-  already-running path returns early without writing an event, so preloading
-  events there would require an extra filesystem/DB query on every re-enqueue.
-  More importantly, on Vercel with at-least-once delivery, multiple lambdas can
-  process the same run concurrently — the event snapshot could be stale or
-  incomplete. The runtime's fallback to `events.list` is the correct behavior
-  for re-enqueue cycles.
-- [x] ~~CborTransport pass-through~~ — refactored. `encode()`/`decode()` now
-  live inside `CborTransport.serialize()`/`deserialize()`, matching the pattern
-  used by TypedJsonTransport (world-local) and the inline transport
-  (world-postgres). Call sites pass plain objects instead of pre-encoded buffers.
-
-## Development retrospective
-
-Chronological log of mistakes, misunderstandings, and reverted approaches during
-development. Included for future reference when working on similar cross-cutting
-runtime changes.
-
-### 1. Uint8Array corruption through JSON queue transport
-
-The initial implementation passed `runInput.input` (a `Uint8Array`) directly through
-the queue payload. `Uint8Array` doesn't survive `JSON.stringify` — it becomes
-`{"0":72,"1":101,...}`. This corrupted the workflow input when the resilient start
-path tried to recreate the run from the queue-delivered data.
-
-Caught by the `spawnWorkflowFromStepWorkflow` e2e test and the `world-testing`
-embedded tests, which failed with "Invalid input" from devalue's `unflatten()`.
-
-Three approaches were tried before landing on the final solution:
-
-1. **Base64 encoding** (`btoa`/`atob`) — worked but fragile. The decode side used
-   `typeof runInput.input === 'string'` as a discriminant, which was flagged as
-   dangerous since non-binary inputs could also be strings.
-2. **`Array.from()`/`new Uint8Array()`** — replaced base64 with a plain number array.
-   Two problems: (a) 3x JSON size regression vs base64, and (b) `Array.isArray()`
-   false-positives on v1Compat runs where `dehydrateWorkflowArguments` returns
-   devalue's flat Array format.
-3. **CBOR + BufferTransport** (final) — world-vercel CBOR-encodes the queue payload;
-   world-local and world-postgres use a `TypedJsonTransport` with a tagged envelope.
-
-### 2. Forgot to commit world-postgres transport fix (twice)
-
-After fixing world-local and world-vercel queue transports, the same `JsonTransport`
-corruption bug existed in world-postgres. The fix was written during a session but
-never committed — lost when the working directory was reset via stash/checkout. This
-happened twice. The fix only landed on the third attempt when it was committed and
-pushed immediately. All 14 Postgres e2e jobs failed each time.
-
-### 3. Incorrect diagnosis of Vercel Prod 409 errors
-
-Multiple Vercel Prod e2e tests failed with `EntityConflictError: Workflow run with
-ID wrun_... already exists` on `run_created`. The initial assumption was that VQS
-couldn't deliver the queue message fast enough to beat the `run_created` call.
-
-Datadog logs showed otherwise: the `run_created` request arrived at Vercel's edge
-116ms before `run_started`, but `run_created` hit a cold-start lambda (727ms) while
-`run_started` hit a warm one (23ms). Cold starts can invert expected execution order.
-
-### 4. Removed EntityConflictError catch, then had to restore it
-
-The `workflowEntrypoint` error handler originally caught both `EntityConflictError`
-and `RunExpiredError`. When adding the "already-running returns run without event"
-behavior, `EntityConflictError` was removed from the catch since the new worlds
-wouldn't throw it. Reviewer flagged this: old worlds or world-vercel hitting an
-older workflow-server could still throw it. The catch was restored.
-
-### 5. Duplicate `startedAt` check
-
-After refactoring the `run_started` flow, a `workflowRun.startedAt` null check
-existed both inside the `try` block and after the `catch` block. The second was
-unreachable. Removed after review.
-
-### 6. WORKFLOW_SERVER_URL_OVERRIDE left set
-
-During development, `WORKFLOW_SERVER_URL_OVERRIDE` was set to a test URL pointing
-at the workflow-server preview deployment and accidentally committed. The Vercel
-bot flagged this. Reset to empty string.
-
-### 7. e2e test assertion was wrong
-
-The resilient start e2e test stubbed `world.events.create` and asserted
-`createCallCount >= 2`. But the stub only intercepts calls from the test runner
-process — the server uses its own world. `createCallCount` was always 1. Changed
-to `expect(createCallCount).toBe(1)`.
-
-### 8. Misattributed Local Prod timeouts as "pre-existing"
-
-Local Prod tests showed 60-second timeouts across various tests. Initially dismissed
-as CI flakes. Checking main's CI showed all Local Prod tests pass on main — the
-timeouts are caused by our changes. Should have compared against main immediately.
-
-### 9. Attempted to revert parallel dispatch
-
-After identifying Local Prod timeouts, `start()` was partially reverted back to
-sequential dispatch. The user pointed out that parallel dispatch is the core value
-proposition of the PR. The revert was undone.
-
-### 10. WorkflowRunNotFoundError retry was unconditional
-
-The initial `pollReturnValue` retry on `WorkflowRunNotFoundError` applied to all
-`Run` instances. A user calling `getRun()` with a wrong ID would wait 10 seconds
-before getting a 404. Fixed by adding a `resilientStart` flag: only retries when
-`run_created` actually failed.
-
-### 11. Changeset `minor` vs `patch`
-
-The changeset was created with `"@workflow/core": minor`. Reviewer flagged this as
-violating repo rules ("all changes should be patch"). Changed after discussion.
-
-### 12. world-local TOCTOU race causing duplicate `run_created` events (Windows CI)
-
-The resilient start path AND the normal `run_created` path in `world-local/events-storage.ts`
-both used `writeJSON` to create the run entity. `writeJSON` checks file existence with
-`fs.access()` then writes via temp+rename — a classic TOCTOU race. On the local world,
-the queue delivers via an async IIFE in the same event loop, so `events.create(run_created)`
-and `events.create(run_started)` (with resilient start) run concurrently:
-
-1. Both paths call `fs.access(runPath)` → ENOENT (file doesn't exist yet)
-2. Both proceed to write → the last `fs.rename` wins
-3. Both succeed → both write their own `run_created` event with different event IDs
-4. During replay, the consumer sees two `run_created` events → "Unconsumed event" error
-
-This caused consistent failures in `world-testing` embedded tests on Windows CI (`hooks`,
-`supports null bytes in step results`, `retriable and fatal errors` — all timing out at
-60s with "Unconsumed event in event log" errors). Linux CI was not affected because the
-timing was different enough that the race window was rarely hit.
-
-Fixed by switching BOTH paths to `writeExclusive` (O_CREAT|O_EXCL), which is atomic at
-the OS level — exactly one writer wins, the other gets EEXIST. The normal `run_created`
-path throws `EntityConflictError` on conflict (handled by `start()` as 409). The resilient
-start path re-reads the run from disk on conflict. Either way, only one `run_created`
-event is written.
-
-### 13. Non-atomic run + run_created event in world-postgres resilient path
-
-The resilient start path in `world-postgres/storage.ts` did two separate writes (run
-insert, then event insert) without a transaction. If the process crashed between them,
-the run would exist without a `run_created` event — an inconsistent event log.
-
-A `drizzle.transaction()` wrapper was attempted but dropped due to TypeScript inference
-issues with drizzle's transaction callback and the insert builder's overloads. The current
-fix keeps the two writes sequential but adds the same conflict-aware re-read pattern as
-world-local: when `onConflictDoNothing` produces no result (run already existed), the run
-is re-read so downstream logic sees the real state. The narrow crash window between the
-two writes is acceptable — if the run insert succeeds but the event insert crashes, the
-run exists and `run_started` will still proceed normally (the event log will be missing a
-`run_created` entry, but the run itself is functional).
-
-### 14. Missing `WorkflowRunStatus` span attribute after parallel refactor
-
-The `start()` span previously set `Attribute.WorkflowRunStatus(result.run.status)`, but
-this was dropped in the parallel refactor because `result.run` is only available when
-`runCreatedResult` fulfilled. The attribute is now conditionally set when the result is
-available. In the resilient start case (run_created failed), the attribute is omitted
-rather than erroring.
-
-### 15. `run_started` eventData leak in world-postgres result
-
-The `...data` spread in the result construction leaked `eventData` from `run_started`
-into the returned event object. Storage was already correct (`storedEventData` is
-`undefined` for `run_started`), but the returned result carried the input data. While
-harmless (the runtime doesn't use `result.event.eventData`), it was restored to match
-the pre-refactor behavior where eventData was explicitly stripped from the result.
-
-### 16. Normal `run_created` path also needed `writeExclusive` (Windows CI)
-
-The initial TOCTOU fix (item 12) only changed the resilient start path to use
-`writeExclusive`. The normal `run_created` entity write still used `writeJSON` which
-checks existence with `fs.access()` then writes via temp+rename — not atomic. On
-Windows CI, the local queue's async IIFE delivered fast enough for both paths to pass
-their existence checks simultaneously, producing two `run_created` events with different
-event IDs. The events consumer saw the duplicate as "Unconsumed event in event log,"
-causing `hooks`, `supports null bytes in step results`, and `retriable and fatal errors`
-tests to time out at 60s. Fixed by also switching the normal `run_created` entity write to
-`writeExclusive`, making both paths use the same atomic gate.
-
-### 17. CborTransport was a pass-through wrapper
-
-`world-vercel/queue.ts` had `CborTransport` implementing `Transport<Buffer>` with a
-no-op `serialize` (identity function) and a `deserialize` that reassembled chunks into
-a Buffer without decoding. The actual CBOR `encode()`/`decode()` calls happened at the
-call sites — `queue()` pre-encoded before calling `client.send()`, and the handler
-post-decoded after receiving from `client.handleCallback()`. This violated the transport
-abstraction (every other transport does its encoding inside serialize/deserialize) and
-meant the call site had to remember to pre-encode. Refactored to move `encode()`/`decode()`
-into the transport methods and changed the type from `Transport<Buffer>` to
-`Transport<unknown>`.
-
-## Follow-up work (additional)
-
-- [x] ~~**CborTransport is a pass-through**~~ — Resolved. Moved `encode()`/`decode()`
-  into `CborTransport.serialize()`/`CborTransport.deserialize()`. The transport is now
-  self-contained: call sites pass plain objects, and the handler receives decoded objects.
-  See retrospective item 17.
+The narrow crash window in `world-postgres` between the run insert and the event insert is acceptable — if the run insert succeeds but the event insert crashes, the run exists and `run_started` will still proceed normally (the event log will be missing a `run_created` entry, but the run itself is functional).