Add task.execute detail span around task execute callable#67877
Add task.execute detail span around task execute callable#67877dstandish wants to merge 1 commit into
Conversation
Extract the execute-callable invocation (including the execution-timeout
wrapper) from _execute_task into a dedicated _run_execute_callable helper
decorated with @detail_span("task.execute"). This emits a child span around
the actual task execution when task span detail level is greater than 1,
giving finer-grained tracing of where time is spent within a task run.
Regular task failures mark the span as errored automatically via
OpenTelemetry. AirflowTaskTimeout inherits from BaseException, which
OpenTelemetry does not auto-record, so the timeout handler sets the span
status to ERROR explicitly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| log.exception("Failed to send email notification") | ||
|
|
||
|
|
||
| @detail_span("task.execute") |
There was a problem hiding this comment.
The ctx passed in here is snapshotted back in _execute_task (ctx = contextvars.copy_context(), line 1917), which runs before this task.execute span is entered. So inside ctx.run(execute, context=context) the current OTel span in that snapshot is still _execute_task, not task.execute.
Doesn't that mean spans the operator emits during execute() parent to _execute_task and land as siblings of task.execute rather than nesting under it? task.execute would record the right duration but stay empty of the operator's own child spans, which seems opposite to wrapping "the actual task execution".
Was the intent for operator child spans to nest under task.execute? If so, would re-snapshotting the context inside this function (after the span is current) be the fix? One wrinkle: ctx also carries ExecutorSafeguard.tracker.set(task) from line 1919, so a plain copy_context() here would drop that and it would need re-applying.
|
|
||
| assert result == "ok" | ||
| names = [s.name for s in exporter.get_finished_spans()] | ||
| assert "task.execute" in names |
There was a problem hiding this comment.
These tests assert that task.execute exists and check its status, but none of them assert its parent/child relationship. Given the nesting question I raised on _run_execute_callable, would it be worth asserting that the operator's child spans actually nest under task.execute (e.g. checking span.parent.span_id)? As written, would this suite stay green even if task.execute ends up a sibling of the operator's spans rather than their parent?
| # AirflowTaskTimeout inherits from BaseException, so OpenTelemetry's | ||
| # start_as_current_span won't mark the span as errored on its own | ||
| # (it only does so for Exception subclasses). Set it explicitly. | ||
| trace.get_current_span().set_status(Status(StatusCode.ERROR, "AirflowTaskTimeout")) |
There was a problem hiding this comment.
At detail level 1, task.execute isn't created (the decorator returns INVALID_SPAN), and _execute_task is also a detail_span, so here trace.get_current_span() resolves to the worker.<task_id> span and this marks that span ERROR on timeout. A regular exception at level 1 sets no status explicitly, so the worker span is left untouched unless the exception propagates through its own with block.
Is it intended that a timeout flips the worker span to ERROR at level 1 while an ordinary failure doesn't? Just flagging the asymmetry, not blocking.
Extracts the execute-callable invocation (including the execution-timeout wrapper) from
_execute_taskinto a dedicated_run_execute_callablehelper decorated with@detail_span("task.execute"). This emits a child span around the actual task execution when the task span detail level is greater than 1, giving finer-grained tracing of where time is spent within a task run.Regular task failures mark the span as errored automatically via OpenTelemetry.
AirflowTaskTimeoutinherits fromBaseException, which OpenTelemetry does not auto-record, so the timeout handler sets the span status toERRORexplicitly.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines