Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894
Draft
rahul-madaan wants to merge 2 commits into
Draft
Add OpenLineage Spark conf injection to DatabricksSubmitRunOperator#67894rahul-madaan wants to merge 2 commits into
rahul-madaan wants to merge 2 commits into
Conversation
…nOperator DatabricksSubmitRunOperator did not emit any OpenLineage information. This adds optional injection of OpenLineage parent job and transport configuration into the job's ``new_cluster.spark_conf`` (single-task and multi-task forms), so the Spark job running on Databricks can correlate its lineage events with the Airflow task and send them to the same backend. The behaviour is controlled by two new operator parameters, ``openlineage_inject_parent_job_info`` and ``openlineage_inject_transport_info``, each defaulting to the corresponding ``openlineage.spark_inject_*_info`` config option, mirroring the existing Dataproc, EMR and Glue operators. Injection is skipped when the provider is unavailable, when the relevant properties are already present, or when the job has no ``new_cluster`` to modify (e.g. an existing cluster). Signed-off-by: rahul-madaan <madan.rahul9@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DatabricksSubmitRunOperatorpreviously emitted no OpenLineage information. This adds optional injection of OpenLineage parent job and transport configuration into the submitted job'snew_cluster.spark_conf, so the Spark job running on the Databricks cluster can correlate its lineage events with the triggering Airflow task and ship them to the same OpenLineage backend.This mirrors the existing automatic-injection support in the Dataproc, EMR (Serverless / on-EKS) and Glue operators.
What changed
DatabricksSubmitRunOperator:openlineage_inject_parent_job_info— injectsspark.openlineage.parentJobNamespace/parentJobName/parentRunIdand therootParent*properties.openlineage_inject_transport_info— injects thespark.openlineage.transport.*properties.Each defaults to the corresponding
openlineage.spark_inject_parent_job_info/openlineage.spark_inject_transport_infoconfig option, so injection can be enabled globally or per-operator (matching the other operators).A provider-local helper
inject_openlineage_properties_into_databricks_jobthat reuses the sharedinject_*_into_spark_propertieshelpers (viaapache-airflow-providers-common-compat) and handles both the single-task (top-levelnew_cluster) and multi-task (tasks[].new_cluster) forms.Injection is safely skipped when the OpenLineage provider is unavailable, when the relevant
spark.openlineage.*properties are already present, or when the job has nonew_clusterto modify (e.g. it targets anexisting_cluster_id). Existingspark_confentries are preserved.Scope
This PR covers
DatabricksSubmitRunOperator.DatabricksRunNowOperatortriggers a pre-defined job and therefore has nonew_clusterto inject into — its only injection surface isspark_submit_params(a different, list-of-strings shape), so it is intentionally not part of this change.Tests
Unit tests for the operator (parent-only, transport-only, both, disabled, preserves existing
spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).Verified on a real Databricks workspace that a
runs/submitpayload carrying the injectedspark.openlineage.*properties is accepted and the properties round-tripUnit tests for the operator (parent-only, transport-only, both, disabled, preserves existing
spark_conf) and for the helper's traversal (single-task, multi-task, existing-cluster skip, provider-inaccessible, no-mutation-of-input).Verified on a real Databricks workspace that a
runs/submitpayload carrying the injectedspark.openlineage.*properties is accepted and the properties round-trip unchanged (confirmed viaruns/get), with pre-existingspark_confpreserved.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.7 following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.