PS-11265: Survive spot worker loss in parallel-mtr test stages by nogueiraanderson · Pull Request #523 · Percona-Lab/ps-build

nogueiraanderson · 2026-06-09T19:03:44Z

Bug

One spot-reclaimed Graviton worker fails the whole parallel-mtr build; the lost shard is never retried (ps80 builds 479/484, three AWS spot interruption notices on 2026-06-08)
Any cause-less run (Replay, EC2 Fleet plugin resubmit) dies in seconds on an NPE at getBuildCauses()[0].userId (ps80 builds 487-489)

Fix

Add options { retry(count: 2, conditions: [agent()]) } to Test 2-8; agent loss re-acquires a fresh worker and re-runs only that shard. Genuine test and script failures are not retried
Replace the [0] indexing with a typed UserIdCause lookup

Tickets

PS-11265

… cause - Add retry(count: 2, conditions: [agent()]) stage option to Test 2-8 so a worker lost to a spot reclaim re-acquires a fresh agent instead of failing the whole build; genuine test and script failures are not retried - Replace getBuildCauses()[0].userId with a typed UserIdCause lookup; a cause-less run (Replay, EC2 Fleet plugin task resubmit) crashed with an NPE at WorkflowScript:740 before any work started

nogueiraanderson · 2026-06-09T19:04:56Z

Root cause and evidence

On 2026-06-08 AWS issued three spot interruption notices against the jenkins-ps80-arm-graviton fleet (ASG activity: "taken out of service in response to an EC2 Spot Instance interruption notice", 20:27/20:44/20:51 UTC). Builds 479 and 484 lost workers mid-MTR and failed. The fleet's history shows 7 such reclaims in 10 days, so this recurs.

Why not ALLOW_ABORTED_WORKERS_RERUN: it is cause-blind (it also reruns faulty-script failures), which is why it is disabled. The agent() retry condition matches only agent-loss errors, so genuine test and script failures keep failing exactly as today.

Dependency: the EC2 Fleet plugin's resubmit-on-disconnect interrupts the branch with ABORTED before the durable task can surface a retryable error, and retry correctly refuses interruptions. percona-cd-platform PR 118 disables it on the arm fleets (merged; applied live to ps80, ps57, pxb, pxc and the S3 init-config canonicals).

Validation on ps80 (real job, A/B against the same kill)

All runs used the production job shape (binaries reused from build 479, one MTR shard on Test 2, worker terminated mid-suite via the EC2 API, which is the same call the ASG makes on a spot notice):

Unpatched pipeline + kill (parallel-mtr build 498): AgentOfflineException, branch dead in seconds, build FAILURE. The incident behavior, reproduced
Patched pipeline + same kill (gate build, a job clone whose definition points at this branch): Retrying on the same AgentOfflineException, fresh Graviton acquired, full shard re-ran, build SUCCESS in 27 min, 0 test failures
Genuine non-infra failure (build 496, bad suite config): no retry, clean FAILURE. Nothing gets masked

Known limitation: Test 1 inherits the Build agent (its agent line is deliberately commented out for unit-test workspace reuse), so losing that node is still fatal. Covering it needs an owner decision on Test 1's agent.

This was referenced Jun 9, 2026

PS-11265: Fix NPE on cause-less runs of pxc80 parallel-mtr Percona-Lab/jenkins-pipelines#4182

Open

PS-11265: Disable EC2 Fleet task resubmit on the arm fleets percona/percona-cd-platform#118

Merged

nogueiraanderson marked this pull request as ready for review June 9, 2026 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PS-11265: Survive spot worker loss in parallel-mtr test stages#523

PS-11265: Survive spot worker loss in parallel-mtr test stages#523
nogueiraanderson wants to merge 1 commit into
8.0from
PS-11265-retry-agent-loss

nogueiraanderson commented Jun 9, 2026

Uh oh!

nogueiraanderson commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nogueiraanderson commented Jun 9, 2026

Uh oh!

nogueiraanderson commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause and evidence

Validation on ps80 (real job, A/B against the same kill)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nogueiraanderson commented Jun 9, 2026 •

edited

Loading