fix: drain result_queue in LocalExecutor to prevent process join dead…#67881
Open
SakshamKapoor2911 wants to merge 2 commits into
Open
fix: drain result_queue in LocalExecutor to prevent process join dead…#67881SakshamKapoor2911 wants to merge 2 commits into
SakshamKapoor2911 wants to merge 2 commits into
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a potential deadlock in LocalExecutor shutdown where workers blocked on writing to a full result_queue would never exit, hanging the scheduler. Also makes terminate() actually terminate worker processes.
Changes:
- Drain
result_queuewhile waiting for workers to join inend(), withKeyboardInterrupt/SystemExitfallback that force-terminates workers. - Harden
_read_resultsagainstOSError/EOFErrorfrom a closed/broken queue. - Implement
terminate()to forcefully terminate all worker processes (previously a no-op).
…eues in finally block
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes: #67870
Description
This PR fixes a deadlock in the
LocalExecutorduring executor shutdown (end()).Currently, the executor calls
proc.join()on active worker processes. However, if a worker has written enough results to theresult_queueto fill the OS-level pipe buffer (typically 64KB), the worker process blocks indefinitely on itsput()call.Because the parent scheduler process is blocked on the unbounded
proc.join()and not reading fromresult_queue, a classic multiprocessing deadlock occurs. The scheduler hangs indefinitely, stopping heartbeats and preventing systemd/Kubernetes from restarting the process (as it never exits).Changes
LocalExecutor.end()to continuously drain theresult_queueusing_read_results()while waiting for worker processes to join with a small timeout._read_results()with atry/except (OSError, EOFError)block to gracefully handle cases where pipes are already broken or closed during task exit.LocalExecutor.terminate()method to forcefully kill any remaining workers when a hard stop is requested.Context
This contribution was created as part of the required capstone of the CodePath AI301 course, where students learn how to make responsible and effective use of generative AI tools for open-source software contributions.
Was generative AI tooling used to co-author this PR?
Generated-by: Antigravity IDE (Gemini) following the guidelines