Skip to content

PS-11265: Disable EC2 Fleet task resubmit on the arm fleets#118

Merged
nogueiraanderson merged 4 commits into
mainfrom
PS-11265-disable-task-resubmit
Jun 9, 2026
Merged

PS-11265: Disable EC2 Fleet task resubmit on the arm fleets#118
nogueiraanderson merged 4 commits into
mainfrom
PS-11265-disable-task-resubmit

Conversation

@nogueiraanderson

Copy link
Copy Markdown
Collaborator

Bug

  • On a spot interruption the EC2 Fleet plugin interrupts the pipeline node body with ABORTED, preempting durable-task recovery and defeating retry(conditions: [agent()])
  • Its resubmit then re-schedules the WorkflowJob via getLastBuild() params with no cause, minting broken ghost builds (ps80 487-489, one per interruption, NPE in seconds)

Fix

  • Set disableTaskResubmit=true in all five masters' ec2FleetCloud.groovy; agent loss then surfaces as a retryable agent error
  • Already applied live on ps80 (disk + in-memory, backup kept); kill-probe verified the stage retry recovers end to end

Tickets

- The plugin's resubmit-on-disconnect interrupts pipeline node bodies with
  ABORTED (killing the branch before the durable-task agent-wait can run)
  and re-schedules the WorkflowJob via getLastBuild() params with no cause
- For pipeline-only fleets this produces broken ghost builds instead of
  recovery; with the flag on, agent loss surfaces as a retryable agent
  error that stage-level retry(conditions: [agent()]) can handle
… dead groovy tree

- ps3's EC2 master is decommissioned; resources/jenkins-masters/ps3 is no
  longer consumed by any terraform (revert the pointless patch there)
- The live in-cluster ps3-k8s defines the fleet via the clouds catalog;
  flip it there and re-render the instance values (render-clouds.py apply)
@nogueiraanderson nogueiraanderson force-pushed the PS-11265-disable-task-resubmit branch from 4addd8c to f1613c2 Compare June 9, 2026 19:59
@nogueiraanderson nogueiraanderson marked this pull request as ready for review June 9, 2026 21:32
@nogueiraanderson nogueiraanderson merged commit 0fca37b into main Jun 9, 2026
8 checks passed
@nogueiraanderson

Copy link
Copy Markdown
Collaborator Author

Rollout

  • ps80, ps57, pxb, pxc: applied live (disk file patched on the EBS plus uberClassLoader re-evaluate, timestamped backups kept); verified disableTaskResubmit=true on each running cloud
  • S3 init-config canonicals for all four masters verified carrying the new content, so fresh-EBS rebuilds inherit the flag
  • ps3-k8s: lands with the next ArgoCD sync of jenkins-ps3-k8s (manual-sync app); the only other OutOfSync resources are the known cosmetic ESO drift

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant