Skip to content

Add outer timeout for mpi testing#5150

Open
cpjordan wants to merge 1 commit into
firedrakeproject:mainfrom
cpjordan:split-tests-outer-timeout
Open

Add outer timeout for mpi testing#5150
cpjordan wants to merge 1 commit into
firedrakeproject:mainfrom
cpjordan:split-tests-outer-timeout

Conversation

@cpjordan
Copy link
Copy Markdown
Contributor

@cpjordan cpjordan commented Jun 3, 2026

Description

Fixes thetisproject/thetis#459 as demonstrated on thetisproject/thetis#438 (@connorjward).

This PR adds a shell-level timeout around each split job run by firedrake-run-split-tests.

The existing pytest timeout catches long-running pytest items, but it does not protect the whole mpiexec ... pytest process tree. We have observed (in Thetis - locally and in CI) MPI split jobs that print pytest success and then hang indefinitely during PETSc/OpenMPI finalization. In that state, the wrapper never writes jobN.errcode, GNU parallel never returns, and CI remains stuck.

Change

Each split job now runs as:

timeout --kill-after=${kill_after} ${outer_timeout} ${pytest_cmd}

inside the existing tee pipeline.

The timeout settings are configurable:

FIREDRAKE_RUN_SPLIT_TESTS_TIMEOUT=1800s
FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER=60s

This PR assumes GNU timeout is available, matching the current Linux CI environment where firedrake-run-split-tests is used. If this helper is later used in macOS CI, a follow-up can either install Homebrew coreutils and use gtimeout, or add a small platform-specific wrapper.

Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go into release so it becomes available for Thetis sooner.

kill_after=${FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER:-60s}

if ! command -v timeout >/dev/null 2>&1; then
echo "GNU timeout is required" >&2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your other PR you allow for GNU parallel not being there. This should also be optional no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we definitely want this to run on macOS, so making it optional would help with that if you don't want to setup gtimeout as you suggested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense but I was waiting for #5147 to be merged (I don't have the permissions to do so).

Do you want me to make the target branch for #5147 to be release as well? Then I can update this branch for your comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah #5147 should also go into release. I plan to merge that ASAP

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the base on #5147, will update this branch once merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel MPI testing hangs

2 participants