Make the DAG processor access metadata exclusively through the API server by kaxil · Pull Request #67878 · apache/airflow

kaxil · 2026-06-01T22:57:16Z

Once this PR is merged, the standalone DAG processor (airflow dag-processor) no longer connects to the metadata
database directly. It persists parse results and reads all metadata through the API server, the
same way workers already operate. This removes one of the last few components that
runs user-adjacent code while also holding a direct database connection.

Persistence (serialized DAGs, import errors, warnings), stale-DAG and orphaned-import-error
reconciliation, bundle sync and state, priority-parse-request and callback claiming, and the
processor's own Job liveness record all go through a new /dag-processing API app. Parse-time
and bundle-initialization Connection/Variable reads resolve through the Execution API.

What changed

New /dag-processing FastAPI sub-app mounted on the API server
(airflow.api_fastapi.dag_processing), split into app.py (routes), datamodels.py, and
security.py.
New DagProcessingApiClient (httpx) used by the processor: pooled, with bounded retry/backoff
and a startup readiness wait.
DagFileProcessorManager routes all persistence and metadata reads through the client.
Bundle-initialization credentials resolve through the Execution API (the same path workers and
triggerers use), so a git connection stored in the metadata database keeps working without
direct DB access.
New config [core] dag_processing_api_server_url (defaults to the /dag-processing mount of
the configured API server) and [dag_processor] jwt_audience.

Design notes

Auth. The processor self-signs a token for [dag_processor] jwt_audience with the
deployment signing key, and the endpoints validate it via JWTBearer. Validation goes through
the same get_sig_validation_args path as the Execution API, so a deployment that configures
[api_auth] trusted_jwks_url validates externally-issued tokens for /dag-processing exactly
as it does for /execution. /health stays unauthenticated for readiness probes.
Resilience. Per-loop API calls are guarded so a transient API outage skips a cycle instead
of crashing the processor, the heartbeat is throttled, and startup waits for API readiness.

Config

[core]
# optional; defaults to the /dag-processing sibling mount of execution_api_server_url
dag_processing_api_server_url = http://api-server:8080/dag-processing

[dag_processor]
# optional; mirrors [execution_api] jwt_audience
jwt_audience = urn:airflow.apache.org:dag-processing

Future Work

Ideally, the entire client side of the dag-processor will be moved to task-sdk in follow-up PRs.

ashb · 2026-06-02T08:01:40Z

Auth. The processor self-signs a token for [dag_processor] jwt_audience with the
deployment signing key, and the endpoints validate it via JWTBearer

I am wary of giving the dag processor the ability to mint any tokens at all -- given it runs user code this seems like it's a huge security risk 🤔

ashb · 2026-06-02T08:05:04Z

+@router.post("/jobs", status_code=201)
+def register_job(body: JobRegisterBody) -> dict:
+    """Register the processor's liveness Job row (server-side) and return its id."""
+    job = Job()
+    job.job_type = body.job_type
+    with create_session() as session:
+        job.prepare_for_execution(session=session)
+    return {"job_id": job.id}
+
+
+@router.post("/jobs/{job_id}/heartbeat")
+def job_heartbeat(job_id: int) -> dict:
+    """Update the processor Job's latest_heartbeat so the health check sees it alive."""
+    with create_session() as session:
+        job = session.get(Job, job_id)
+        if job is None:
+            raise HTTPException(status_code=404, detail="Job not found")
+        job.latest_heartbeat = timezone.utcnow()
+        session.merge(job)
+    return {"alive": True}
+
+
+@router.post("/jobs/{job_id}/complete")
+def complete_job(job_id: int, body: JobCompleteBody) -> dict:
+    """Record the processor Job's terminal state and end time."""
+    with create_session() as session:
+        job = session.get(Job, job_id)
+        if job is not None:
+            job.end_date = timezone.utcnow()
+            job.state = body.state
+            session.merge(job)
+    return {"completed": True}


I'm not sure we want to "encode" this into the API -- mostly I'm not sure that it is a good pattern that we want to follow, especially for things like "static" dags which don't need re-parsing.

ashb · 2026-06-03T12:32:14Z

+# The Execution API is task-instance scoped: its ``sub`` is validated as a UUID. The DAG processor
+# is not a task instance, so its token carries an all-zero sentinel UUID rather than a real id.
+DAG_PROCESSOR_TOKEN_SUBJECT = "00000000-0000-0000-0000-000000000000"


This feels like too much of a hack. There is nothing in JWT that says the sub claim must be a UUID, that is just our choice, so I think for dag processing the sub should be something else.

Agreed it's a hack. It's a UUID only because the processor reuses the Execution API for parse-time Connection/Variable reads, and those routes go through CurrentTIToken -> TIToken(id: UUID) (the id also feeds team scoping), so a non-UUID sub is rejected before the route runs. The all-zero value is a stand-in non-TI principal, and the one token currently carries both the execution and dag-processing audiences.

The DAG Processing validator accepts any sub, so the UUID is only needed on the Execution side. Two ways to give dag processing a real sub:

Two tokens: a dag-processing token with sub=dag-processor and a separate execution token that keeps the UUID (the Execution API genuinely is TI-scoped).

Generalise the Execution principal so the read-only conn/var routes accept a non-TI sub.

I lean to (2) if you're open to it (one token, no sentinel anywhere); otherwise I'll do (1). Which would you prefer?

sub=dag-processor -- this leads to an interesting connudrum. I'd really like to be able to tie requests to an Connection or Variable to a specific file, which means some kind of token exchange to get a per-file scoped token.

I think the point about CurrentTIToken leads to a more sailient design question though: is it right to use the Execution API for this. The "Execution" part of the API is not really true for dag parsing. I also don't want us to have to duplicate things into the /dag-processing/ API (I'm already not happy with how much we have to duplicate from the public API to the Execution API, doing that a third time makes me sad.

I'm in favour of 2 generally, and probably the TIClaims and TIToken classes are a mistake/overly specific naming. Nothing seems to look at TIClaims that I can see.

$ rg 'token\.' airflow-core/src/airflow/api_fastapi/execution_api airflow-core/src/airflow/api_fastapi/execution_api/routes/dag_runs.py 130: parent_ti = session.get(TaskInstance, token.id) airflow-core/src/airflow/api_fastapi/execution_api/routes/xcoms.py 55: token.id, airflow-core/src/airflow/api_fastapi/execution_api/routes/connections.py 38: token.id, airflow-core/src/airflow/api_fastapi/execution_api/routes/variables.py 47: token.id, airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py 318: if token.claims.scope == "workload": airflow-core/src/airflow/api_fastapi/execution_api/datamodels/token.py 32: Validated JWT claims for a task identity token. airflow-core/src/airflow/api_fastapi/execution_api/security.py 97: dedup or Cadwyn replays) return the cached token. 160: token_scope = token.claims.scope 187: if str(token.id) != ti_self_id: 239: return await session.scalar(_team_name_for_ti_stmt(token.id))

Also the "per-team connection" in exec API currently wouldn't work anyway as the team is looked up from task_instance -> dag_model -> dag_bundle -> dag_bundle.teams, so that wouldn't work with a fake uuid anyway.

…rver The standalone DAG processor no longer connects to the metadata database. It persists parse results and reads all metadata through the DAG Processing API (AIP-92): a single DagProcessingApiClient routes persistence, bundle state and sync, stale-DAG and warning sweeps, priority-parse and callback claims, and the processor Job lifecycle, with no ORM session in the manager. Bundle-initialization connection/variable reads resolve through the Execution API (the same path workers and triggerers use), so a git connection stored in the metadata database keeps working without direct DB access. The processor parses user code, so it does not hold the signing key or mint its own token: it presents a bearer token the deployment provisions, read from [dag_processor] api_token_path, to both the DAG Processing and Execution APIs. In standalone (a trusted launcher that already holds the signing key), the token is minted and provisioned automatically. Per-loop API calls are guarded so a transient API outage skips a cycle instead of crashing the processor, the heartbeat is throttled, the client retries transient failures, and startup waits for API readiness.

Mount a /dag-processing FastAPI app on the API server with the endpoints the DAG processor persists through: parsing-results, bundle reconcile/state/sync, stale-dags, purge-warnings, priority-parse and callback claim, and Job register/heartbeat/complete. Split into app.py (routes), datamodels.py, and security.py, matching the execution_api layout. The endpoints validate the bearer token the processor presents via JWTBearer, using the same get_sig_validation_args path as the Execution API, so [api_auth] trusted_jwks_url applies equally. /health is open for readiness probes.

The standalone DAG processor now requires an API server that mounts the dag-processing app, and a deployment-provisioned bearer token. List dag-processing in the api-server --apps options, note the requirement in the web-stack docs, and add a significant newsfragment for the breaking change.

The DAG processor parses user code, so it must not hold the JWT signing key or mint its own token. It now carries a bearer token a trusted component provisions to the file at [dag_processor] api_token_path and only reads that file, re-reading it as the token is rotated so a refreshed token is picked up without a restart. - DagProcessingApiClient reads its token via a callable bearer-auth on each request (short cache plus a 401-triggered re-read), and the Execution API client used for parse-time Connection/Variable reads is re-read per parser spawn. - airflow.api_fastapi.auth.dag_processor_token mints the dual-audience token. It signs with whatever get_signing_args resolves (symmetric secret or an asymmetric private key), so a JWKS-based control plane validates externally-issued tokens unchanged. - 'airflow provision-dag-processor-token' is the trusted minter CLI; airflow standalone now provisions through it. [dag_processor] jwt_expiration_time sets the token lifetime.

The DAG processor needs a bearer token but must not hold the signing key. These deployments now mint the token in a trusted step and share it with the processor: - Helm: an init container (which holds the signing key) mints the token into a shared emptyDir; the dag-processor container reads it read-only and is not given the key. Toggle with dagProcessor.apiToken.provisionViaInitContainer. - docker-compose (quick-start docs and task-sdk integration tests): airflow-init mints the token to a shared volume that the dag-processor reads.

The test ran the manager three times expecting one parse per run, relying on dag_path.touch() to beat the default 30s min_file_process_interval via an mtime comparison. Under load that filesystem-granularity race left a run waiting out the interval and tripping the per-test timeout. Pin min_file_process_interval=0 so each run re-parses unconditionally; the touch is no longer needed.

klisitsynaws · 2026-06-04T16:33:45Z

+                yield request
+
+
+class DagProcessingApiClient:


Would it be beneficial to use API data models declared in datamodels.py here as type hints? Otherwise those are just free-form payloads and it's anyone's guess if they are current.

…stname The /dag-processing /jobs endpoint built the Job server-side, so it recorded the API server's hostname/pid rather than the processor's. The dag-processor health check (airflow jobs check --job-type DagProcessorJob --hostname <host>) filters by the processor's hostname, so it never matched the row and 'docker compose up --wait' timed out waiting for the processor to become healthy (the e2e and remote-logging PROD-image tests). The processor now reports its hostname/unixname/pid when registering and the endpoint records them, restoring the in-process behaviour.

boring-cyborg Bot added area:API Airflow's REST/HTTP API area:CLI area:ConfigTemplates area:DAG-processing kind:documentation labels Jun 1, 2026

kaxil force-pushed the dag-processor-api-persistence branch 2 times, most recently from a536afc to 47d927f Compare June 1, 2026 23:16

ashb reviewed Jun 2, 2026

View reviewed changes

kaxil added this to the Airflow 3.4.0 milestone Jun 2, 2026

kaxil force-pushed the dag-processor-api-persistence branch 3 times, most recently from c7295a6 to d58de85 Compare June 3, 2026 12:21

ashb reviewed Jun 3, 2026

View reviewed changes

kaxil added 6 commits June 3, 2026 23:54

kaxil force-pushed the dag-processor-api-persistence branch from d58de85 to 327717c Compare June 3, 2026 22:57

klisitsynaws reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the DAG processor access metadata exclusively through the API server#67878

Make the DAG processor access metadata exclusively through the API server#67878
kaxil wants to merge 7 commits into
apache:mainfrom
astronomer:dag-processor-api-persistence

kaxil commented Jun 1, 2026 •

edited

Loading

Uh oh!

ashb commented Jun 2, 2026

Uh oh!

ashb Jun 2, 2026

Uh oh!

ashb Jun 3, 2026

Uh oh!

kaxil Jun 3, 2026

Uh oh!

ashb Jun 4, 2026

Uh oh!

klisitsynaws Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaxil commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Design notes

Config

Future Work

Uh oh!

ashb commented Jun 2, 2026

Uh oh!

ashb Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ashb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ashb Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

klisitsynaws Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaxil commented Jun 1, 2026 •

edited

Loading