Skip to content

Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894

Draft
rahul-madaan wants to merge 2 commits into
apache:mainfrom
rahul-madaan:rahul-madaan-databricks-ol-injection
Draft

Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894
rahul-madaan wants to merge 2 commits into
apache:mainfrom
rahul-madaan:rahul-madaan-databricks-ol-injection

Conversation

@rahul-madaan
Copy link
Copy Markdown
Contributor

@rahul-madaan rahul-madaan commented Jun 2, 2026

DatabricksSubmitRunOperator previously emitted no OpenLineage information. This adds optional injection of OpenLineage parent job and transport configuration into the submitted job's new_cluster.spark_conf, so the Spark job running on the Databricks cluster can correlate its lineage events with the triggering Airflow task and ship them to the same OpenLineage backend.

This mirrors the existing automatic-injection support in the Dataproc, EMR (Serverless / on-EKS) and Glue operators.

What changed

  • Two new parameters on DatabricksSubmitRunOperator:
  • openlineage_inject_parent_job_info — injects spark.openlineage.parentJobNamespace/parentJobName/parentRunId and the rootParent* properties.
  • openlineage_inject_transport_info — injects the spark.openlineage.transport.* properties.

Each defaults to the corresponding openlineage.spark_inject_parent_job_info / openlineage.spark_inject_transport_info config option, so injection can be enabled globally or per-operator (matching the other operators).

  • A provider-local helper inject_openlineage_properties_into_databricks_job that reuses the shared inject_*_into_spark_properties helpers (viaapache-airflow-providers-common-compat) and handles both the single-task (top-level new_cluster) and multi-task (tasks[].new_cluster) forms.

  • Injection is safely skipped when the OpenLineage provider is unavailable, when the relevant spark.openlineage.* properties are already present, or when the job has no new_cluster to modify (e.g. it targets an existing_cluster_id). Existingspark_conf entries are preserved.

    Scope

    This PR covers DatabricksSubmitRunOperator. DatabricksRunNowOperator triggers a pre-defined job and therefore has no new_cluster to inject into — its only injection surface is spark_submit_params (a different, list-of-strings shape), so it is intentionally not part of this change.

    Tests

    • Unit tests for the operator (parent-only, transport-only, both, disabled, preserves existing spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).

    • Verified on a real Databricks workspace that a runs/submit payload carrying the injected spark.openlineage.* properties is accepted and the properties round-trip

    • Unit tests for the operator (parent-only, transport-only, both, disabled, preserves existing spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).

    • Verified on a real Databricks workspace that a runs/submit payload carrying the injected spark.openlineage.* properties is accepted and the properties round-trip unchanged (confirmed via runs/get), with pre-existing spark_conf preserved.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    Generated-by: Claude Opus 4.7 following the guidelines

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

…nOperator

DatabricksSubmitRunOperator did not emit any OpenLineage information. This adds
optional injection of OpenLineage parent job and transport configuration into the
job's ``new_cluster.spark_conf`` (single-task and multi-task forms), so the Spark
job running on Databricks can correlate its lineage events with the Airflow task
and send them to the same backend.

The behaviour is controlled by two new operator parameters,
``openlineage_inject_parent_job_info`` and ``openlineage_inject_transport_info``,
each defaulting to the corresponding ``openlineage.spark_inject_*_info`` config
option, mirroring the existing Dataproc, EMR and Glue operators. Injection is
skipped when the provider is unavailable, when the relevant properties are already
present, or when the job has no ``new_cluster`` to modify (e.g. an existing
cluster).

Signed-off-by: rahul-madaan <madan.rahul9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant