Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator by rahul-madaan · Pull Request #67894 · apache/airflow

rahul-madaan · 2026-06-02T10:27:43Z

DatabricksSubmitRunOperator previously emitted no OpenLineage information. This adds optional injection of OpenLineage parent job and transport configuration into the submitted job's new_cluster.spark_conf, so the Spark job running on the Databricks cluster can correlate its lineage events with the triggering Airflow task and ship them to the same OpenLineage backend.

This mirrors the existing automatic-injection support in the Dataproc, EMR (Serverless / on-EKS) and Glue operators.

What changed

Two new parameters on DatabricksSubmitRunOperator:
openlineage_inject_parent_job_info — injects spark.openlineage.parentJobNamespace/parentJobName/parentRunId and the rootParent* properties.
openlineage_inject_transport_info — injects the spark.openlineage.transport.* properties.

Each defaults to the corresponding openlineage.spark_inject_parent_job_info / openlineage.spark_inject_transport_info config option, so injection can be enabled globally or per-operator (matching the other operators).

A provider-local helper inject_openlineage_properties_into_databricks_job that reuses the shared inject_*_into_spark_properties helpers (viaapache-airflow-providers-common-compat) and handles both the single-task (top-level new_cluster) and multi-task (tasks[].new_cluster) forms.
Injection is safely skipped when the OpenLineage provider is unavailable, when the relevant spark.openlineage.* properties are already present, or when the job has no new_cluster to modify (e.g. it targets an existing_cluster_id). Existingspark_conf entries are preserved.

Scope

This PR covers DatabricksSubmitRunOperator. DatabricksRunNowOperator triggers a pre-defined job and therefore has no new_cluster to inject into — its only injection surface is spark_submit_params (a different, list-of-strings shape), so it is intentionally not part of this change.

Tests
- Unit tests for the operator (parent-only, transport-only, both, disabled, preserves existing spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).
- Verified on a real Databricks workspace that a runs/submit payload carrying the injected spark.openlineage.* properties is accepted and the properties round-trip
- Unit tests for the operator (parent-only, transport-only, both, disabled, preserves existing spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).
- Verified on a real Databricks workspace that a runs/submit payload carrying the injected spark.openlineage.* properties is accepted and the properties round-trip unchanged (confirmed via runs/get), with pre-existing spark_conf preserved.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)
Generated-by: Claude Opus 4.7 following the guidelines

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

…nOperator DatabricksSubmitRunOperator did not emit any OpenLineage information. This adds optional injection of OpenLineage parent job and transport configuration into the job's ``new_cluster.spark_conf`` (single-task and multi-task forms), so the Spark job running on Databricks can correlate its lineage events with the Airflow task and send them to the same backend. The behaviour is controlled by two new operator parameters, ``openlineage_inject_parent_job_info`` and ``openlineage_inject_transport_info``, each defaulting to the corresponding ``openlineage.spark_inject_*_info`` config option, mirroring the existing Dataproc, EMR and Glue operators. Injection is skipped when the provider is unavailable, when the relevant properties are already present, or when the job has no ``new_cluster`` to modify (e.g. an existing cluster). Signed-off-by: rahul-madaan <madan.rahul9@gmail.com>

boring-cyborg Bot added area:providers kind:documentation provider:databricks provider:openlineage AIP-53 labels Jun 2, 2026

Merge branch 'main' into rahul-madaan-databricks-ol-injection

e0e1e6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894

Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894
rahul-madaan wants to merge 2 commits into
apache:mainfrom
rahul-madaan:rahul-madaan-databricks-ol-injection

rahul-madaan commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahul-madaan commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Scope

Tests

Was generative AI tooling used to co-author this PR?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahul-madaan commented Jun 2, 2026 •

edited

Loading