Skip to content

Fix Edge worker fork mode reporting supervisor failures as success#67887

Merged
jscheffl merged 2 commits into
apache:mainfrom
wjddn279:fix-edge-run-workload-exception
Jun 2, 2026
Merged

Fix Edge worker fork mode reporting supervisor failures as success#67887
jscheffl merged 2 commits into
apache:mainfrom
wjddn279:fix-edge-run-workload-exception

Conversation

@wjddn279
Copy link
Copy Markdown
Contributor

@wjddn279 wjddn279 commented Jun 2, 2026

related: #67679 (comment)

Body

In the Edge worker fork path, _run_job_via_supervisor runs as a multiprocessing.Process target. It returned an int exit code, but multiprocessing ignores a target's return value and sets the child's exit code to 0 on any normal return.
So when the supervisor exited abnormally (crash, kill, or a failed terminal-state delivery — a non-zero run_workload
result), the forked child still exited 0, and the parent's Job.is_success (exitcode == 0) reported the failed job as success.

This propagates the supervisor's exit code via sys.exit() so the child's exit status reflects the real outcome and the failure-handling branch (failure_details() + log push + FAILED state) runs as intended.

note

this fix only makes errors in the supervisor itself be treated as failures; it does not correctly detect failures of the user-defined task that the supervisor runs. In those cases, the edge job stays marked as success.

Question

A normal task failure (user code raising) is reported by the supervisor through the Execution API and the subprocess exits 0 that path is unchanged here and the task instance is still correctly marked FAILED.

So while the task instance state is updated correctly, the edge job is still marked as success even when the task fails, and it also shows up as success in the edge executor's job list UI. Is this intended behavior?


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    claude opus

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg Bot added area:providers provider:edge Edge Executor / Worker (AIP-69) / edge3 labels Jun 2, 2026
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@jscheffl jscheffl merged commit bf3afce into apache:main Jun 2, 2026
92 checks passed
potiuk added a commit to potiuk/airflow that referenced this pull request Jun 2, 2026
Incremental-update pass for commits that landed on main after this wave was prepared:

- amazon: Propogate verify/botocore_config in redshift cluster triggers (apache#67876) → Features
  (mirrors the batch-triggers entry apache#67508).
- databricks: Lock in workflow depends_on parent-key behavior (apache#66681) → Bug Fixes.
- edge3: Fix Edge worker fork mode reporting supervisor failures as success (apache#67887) → Bug Fixes.
- google: Migrate Stackdriver logging config to RemoteLogIO pattern (apache#66513) → Misc.
potiuk added a commit that referenced this pull request Jun 3, 2026
* Prepare provider documentation 2026-06-02

* Address review feedback on provider changelogs

- amazon: move S3 transfer-operators fix (#67378, vincbeck) and EcsRunTaskOperator
  log-level fix (#67180, jscheffl) from Features to Bug Fixes; beautify the
  EksPodOperator entry (#65335).
- google: beautify the idle/auto-stop TTL entry (#65653) — drop conventional-commit prefix.
- openlineage: beautify the ProcessPoolExecutor self-heal entry (#67400).
- edge3: reword the 3.8.0 note — the provider still supports Airflow 3.0+, only the
  execute-callback feature (#67679) needs 3.3+ (jscheffl); move the Swagger API docs
  entry (#67390) to Doc-only; beautify the team_name clarification (#66718).
- apache/drill: move the flit.sdist housekeeping entry (#65861) to the excluded block
  to match the kafka convention (jscheffl).

* Make apache/drill 3.3.3 a doc-only release

The only non-excluded drill change in this wave was the flit.sdist housekeeping
entry, which jscheffl asked to exclude — leaving an empty changelog. Promote the
DAG-to-Dag wording change (#66153) into a Doc-only section so drill 3.3.3 ships as
a legitimate doc-only release instead of an empty one.

* Fold post-prep provider commits into changelogs

Incremental-update pass for commits that landed on main after this wave was prepared:

- amazon: Propogate verify/botocore_config in redshift cluster triggers (#67876) → Features
  (mirrors the batch-triggers entry #67508).
- databricks: Lock in workflow depends_on parent-key behavior (#66681) → Bug Fixes.
- edge3: Fix Edge worker fork mode reporting supervisor failures as success (#67887) → Bug Fixes.
- google: Migrate Stackdriver logging config to RemoteLogIO pattern (#66513) → Misc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:edge Edge Executor / Worker (AIP-69) / edge3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants