fix(workflow): enforce per-task timeout in workflow execution#479
Open
brycehemme wants to merge 1 commit into
Open
fix(workflow): enforce per-task timeout in workflow execution#479brycehemme wants to merge 1 commit into
brycehemme wants to merge 1 commit into
Conversation
The workflow tool accepted a `timeout` field per task and applied a 300s default, but the value was never enforced. `wait()` was called without a timeout and `future.result()` was called without a timeout, so a hung task blocked the workflow loop indefinitely. Track per-task timeouts in two places: * Compute the soonest active-task deadline and pass it as `timeout=` to `wait()`, so the loop wakes up no later than the next task limit. * After `wait()` returns, check every still-running future against its per-task timeout. Tasks past their deadline are marked `error` with a `"Task execution timeout after Ns"` message and removed from `active_futures`, so the workflow can progress instead of blocking. `future.cancel()` is attempted for timed-out tasks but cannot reliably terminate an already-running worker thread (Python doesn't expose a portable mechanism). The runaway task continues consuming CPU until its function returns; this PR only unblocks the workflow loop and dependent tasks. Documented inline. Adds `test_start_workflow_enforces_task_timeout`: installs a hung `execute_task` that sleeps 2.0s, configures `timeout: 0.2`, and asserts the workflow returns in under 1.5s with the task marked as a timeout error. Resets the `WorkflowManager` singleton around the test to avoid leaking state. Fixes strands-agents#326
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Issue #326 reported that the
workflowtool accepts a per-tasktimeoutfield (and applies a 300s default increate) but never enforces it. A hung task blocks the entire workflow loop indefinitely, sincewait(active_futures.values(), return_when=FIRST_COMPLETED)is called without a timeout andfuture.result()is likewise unbounded.This PR enforces the per-task timeout in two places:
start_time + timeoutminimum across all active futures) and pass it astimeout=towait(). The loop now wakes up no later than the next task limit, even if no future completes.wait()returns, iterate every active future. Done futures process as before. Still-running futures get checked against their per-task timeout — if exceeded, the task is markedstatus: "error"with"Task execution timeout after Ns", added tocompleted_tasks, and removed fromactive_futures.future.cancel()is called as a best effort.Two small helpers are added to
WorkflowManager:_get_task_timeout(workflow, task_id)and_next_task_deadline(active_futures, workflow).Caveats (worth surfacing for review)
Worker thread can't be terminated. Python doesn't expose a portable way to kill a running thread, so
future.cancel()on a started future returnsFalseand the worker function keeps running until it returns. This PR only unblocks the workflow loop and dependent tasks; a runaway tool function still consumes CPU until it finishes. An inline comment instart_workflowexplains the constraint. The reporter's example (atime.sleep(120)tool withtimeout: 30) is handled correctly under this constraint: the workflow finishes around 30s, the underlying sleep keeps running another ~90s in the background.Related to issue [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 (deadlock on failed tasks) but doesn't fix it. The reporter noted that without timeout enforcement, hung tasks never fail and so never trigger the dependent-task-deadlock condition described in [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325. This PR makes hung tasks fail, which exposes [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 — dependent tasks may now be blocked instead of the workflow itself. [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 should be tracked separately.
Related Issues
Fixes #326
Type of Change
Bug fix
Testing
Ran
hatch run prepareend-to-end viapython:3.12in Docker:format: clean (no changes to PR files)lint: cleantest-lint(ruff format --check): cleantest: 1211 passed, 33 skipped, 1 failureThe single failure is
tests/test_file_write.py::test_file_write_error_handling. I verified it reproduces onmainwithout this PR (same Python 3.12 / hatch environment), so it's pre-existing and unrelated.Added
test_start_workflow_enforces_task_timeout(inTestWorkflowExecution): installs a hungexecute_taskthat sleeps 2.0s, configurestimeout: 0.2, asserts the workflow completes in under 1.5s with the task markedstatus: "error"and the result text containing the timeout. Resets theWorkflowManagersingleton around the test to avoid leaking the patchedexecute_task/ shut-downTaskExecutorinto other tests.hatch run prepareChecklist
start_workflowexplaining the thread-termination caveat)