PS-11265: Survive spot worker loss in parallel-mtr test stages#523
PS-11265: Survive spot worker loss in parallel-mtr test stages#523nogueiraanderson wants to merge 1 commit into
Conversation
… cause - Add retry(count: 2, conditions: [agent()]) stage option to Test 2-8 so a worker lost to a spot reclaim re-acquires a fresh agent instead of failing the whole build; genuine test and script failures are not retried - Replace getBuildCauses()[0].userId with a typed UserIdCause lookup; a cause-less run (Replay, EC2 Fleet plugin task resubmit) crashed with an NPE at WorkflowScript:740 before any work started
Root cause and evidenceOn 2026-06-08 AWS issued three spot interruption notices against the Why not Dependency: the EC2 Fleet plugin's resubmit-on-disconnect interrupts the branch with ABORTED before the durable task can surface a retryable error, and Validation on ps80 (real job, A/B against the same kill)All runs used the production job shape (binaries reused from build 479, one MTR shard on Test 2, worker terminated mid-suite via the EC2 API, which is the same call the ASG makes on a spot notice):
Known limitation: Test 1 inherits the Build agent (its |
Bug
getBuildCauses()[0].userId(ps80 builds 487-489)Fix
options { retry(count: 2, conditions: [agent()]) }to Test 2-8; agent loss re-acquires a fresh worker and re-runs only that shard. Genuine test and script failures are not retried[0]indexing with a typedUserIdCauselookupTickets