Fix Edge worker fork mode reporting supervisor failures as success#67887
Merged
Conversation
1 task
1 task
potiuk
added a commit
to potiuk/airflow
that referenced
this pull request
Jun 2, 2026
Incremental-update pass for commits that landed on main after this wave was prepared: - amazon: Propogate verify/botocore_config in redshift cluster triggers (apache#67876) → Features (mirrors the batch-triggers entry apache#67508). - databricks: Lock in workflow depends_on parent-key behavior (apache#66681) → Bug Fixes. - edge3: Fix Edge worker fork mode reporting supervisor failures as success (apache#67887) → Bug Fixes. - google: Migrate Stackdriver logging config to RemoteLogIO pattern (apache#66513) → Misc.
potiuk
added a commit
that referenced
this pull request
Jun 3, 2026
* Prepare provider documentation 2026-06-02 * Address review feedback on provider changelogs - amazon: move S3 transfer-operators fix (#67378, vincbeck) and EcsRunTaskOperator log-level fix (#67180, jscheffl) from Features to Bug Fixes; beautify the EksPodOperator entry (#65335). - google: beautify the idle/auto-stop TTL entry (#65653) — drop conventional-commit prefix. - openlineage: beautify the ProcessPoolExecutor self-heal entry (#67400). - edge3: reword the 3.8.0 note — the provider still supports Airflow 3.0+, only the execute-callback feature (#67679) needs 3.3+ (jscheffl); move the Swagger API docs entry (#67390) to Doc-only; beautify the team_name clarification (#66718). - apache/drill: move the flit.sdist housekeeping entry (#65861) to the excluded block to match the kafka convention (jscheffl). * Make apache/drill 3.3.3 a doc-only release The only non-excluded drill change in this wave was the flit.sdist housekeeping entry, which jscheffl asked to exclude — leaving an empty changelog. Promote the DAG-to-Dag wording change (#66153) into a Doc-only section so drill 3.3.3 ships as a legitimate doc-only release instead of an empty one. * Fold post-prep provider commits into changelogs Incremental-update pass for commits that landed on main after this wave was prepared: - amazon: Propogate verify/botocore_config in redshift cluster triggers (#67876) → Features (mirrors the batch-triggers entry #67508). - databricks: Lock in workflow depends_on parent-key behavior (#66681) → Bug Fixes. - edge3: Fix Edge worker fork mode reporting supervisor failures as success (#67887) → Bug Fixes. - google: Migrate Stackdriver logging config to RemoteLogIO pattern (#66513) → Misc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
related: #67679 (comment)
Body
In the Edge worker fork path,
_run_job_via_supervisorruns as amultiprocessing.Processtarget. It returned anintexit code, but multiprocessing ignores a target's return value and sets the child's exit code to0on any normal return.So when the supervisor exited abnormally (crash, kill, or a failed terminal-state delivery — a non-zero
run_workloadresult), the forked child still exited
0, and the parent'sJob.is_success(exitcode == 0) reported the failed job as success.This propagates the supervisor's exit code via
sys.exit()so the child's exit status reflects the real outcome and the failure-handling branch (failure_details()+ log push +FAILEDstate) runs as intended.note
this fix only makes errors in the supervisor itself be treated as failures; it does not correctly detect failures of the user-defined task that the supervisor runs. In those cases, the edge job stays marked as success.
Question
A normal task failure (user code raising) is reported by the supervisor through the Execution API and the subprocess exits
0that path is unchanged here and the task instance is still correctly markedFAILED.So while the task instance state is updated correctly, the edge job is still marked as success even when the task fails, and it also shows up as success in the edge executor's job list UI. Is this intended behavior?
Was generative AI tooling used to co-author this PR?
claude opus
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.