Skip to content

PS-11265: Survive spot worker loss in parallel-mtr test stages#523

Open
nogueiraanderson wants to merge 1 commit into
8.0from
PS-11265-retry-agent-loss
Open

PS-11265: Survive spot worker loss in parallel-mtr test stages#523
nogueiraanderson wants to merge 1 commit into
8.0from
PS-11265-retry-agent-loss

Conversation

@nogueiraanderson

Copy link
Copy Markdown

Bug

  • One spot-reclaimed Graviton worker fails the whole parallel-mtr build; the lost shard is never retried (ps80 builds 479/484, three AWS spot interruption notices on 2026-06-08)
  • Any cause-less run (Replay, EC2 Fleet plugin resubmit) dies in seconds on an NPE at getBuildCauses()[0].userId (ps80 builds 487-489)

Fix

  • Add options { retry(count: 2, conditions: [agent()]) } to Test 2-8; agent loss re-acquires a fresh worker and re-runs only that shard. Genuine test and script failures are not retried
  • Replace the [0] indexing with a typed UserIdCause lookup

Tickets

… cause

- Add retry(count: 2, conditions: [agent()]) stage option to Test 2-8 so a
  worker lost to a spot reclaim re-acquires a fresh agent instead of failing
  the whole build; genuine test and script failures are not retried
- Replace getBuildCauses()[0].userId with a typed UserIdCause lookup; a
  cause-less run (Replay, EC2 Fleet plugin task resubmit) crashed with an
  NPE at WorkflowScript:740 before any work started
@nogueiraanderson

nogueiraanderson commented Jun 9, 2026

Copy link
Copy Markdown
Author

Root cause and evidence

On 2026-06-08 AWS issued three spot interruption notices against the jenkins-ps80-arm-graviton fleet (ASG activity: "taken out of service in response to an EC2 Spot Instance interruption notice", 20:27/20:44/20:51 UTC). Builds 479 and 484 lost workers mid-MTR and failed. The fleet's history shows 7 such reclaims in 10 days, so this recurs.

Why not ALLOW_ABORTED_WORKERS_RERUN: it is cause-blind (it also reruns faulty-script failures), which is why it is disabled. The agent() retry condition matches only agent-loss errors, so genuine test and script failures keep failing exactly as today.

Dependency: the EC2 Fleet plugin's resubmit-on-disconnect interrupts the branch with ABORTED before the durable task can surface a retryable error, and retry correctly refuses interruptions. percona-cd-platform PR 118 disables it on the arm fleets (merged; applied live to ps80, ps57, pxb, pxc and the S3 init-config canonicals).

Validation on ps80 (real job, A/B against the same kill)

All runs used the production job shape (binaries reused from build 479, one MTR shard on Test 2, worker terminated mid-suite via the EC2 API, which is the same call the ASG makes on a spot notice):

  • Unpatched pipeline + kill (parallel-mtr build 498): AgentOfflineException, branch dead in seconds, build FAILURE. The incident behavior, reproduced
  • Patched pipeline + same kill (gate build, a job clone whose definition points at this branch): Retrying on the same AgentOfflineException, fresh Graviton acquired, full shard re-ran, build SUCCESS in 27 min, 0 test failures
  • Genuine non-infra failure (build 496, bad suite config): no retry, clean FAILURE. Nothing gets masked

Known limitation: Test 1 inherits the Build agent (its agent line is deliberately commented out for unit-test workspace reuse), so losing that node is still fatal. Covering it needs an owner decision on Test 1's agent.

@nogueiraanderson nogueiraanderson marked this pull request as ready for review June 9, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant