Skip to content

Add task.execute detail span around task execute callable#67877

Open
dstandish wants to merge 1 commit into
apache:mainfrom
astronomer:add-task-execute-detail-span
Open

Add task.execute detail span around task execute callable#67877
dstandish wants to merge 1 commit into
apache:mainfrom
astronomer:add-task-execute-detail-span

Conversation

@dstandish
Copy link
Copy Markdown
Contributor

Extracts the execute-callable invocation (including the execution-timeout wrapper) from _execute_task into a dedicated _run_execute_callable helper decorated with @detail_span("task.execute"). This emits a child span around the actual task execution when the task span detail level is greater than 1, giving finer-grained tracing of where time is spent within a task run.

Regular task failures mark the span as errored automatically via OpenTelemetry. AirflowTaskTimeout inherits from BaseException, which OpenTelemetry does not auto-record, so the timeout handler sets the span status to ERROR explicitly.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

Extract the execute-callable invocation (including the execution-timeout
wrapper) from _execute_task into a dedicated _run_execute_callable helper
decorated with @detail_span("task.execute"). This emits a child span around
the actual task execution when task span detail level is greater than 1,
giving finer-grained tracing of where time is spent within a task run.

Regular task failures mark the span as errored automatically via
OpenTelemetry. AirflowTaskTimeout inherits from BaseException, which
OpenTelemetry does not auto-record, so the timeout handler sets the span
status to ERROR explicitly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
log.exception("Failed to send email notification")


@detail_span("task.execute")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctx passed in here is snapshotted back in _execute_task (ctx = contextvars.copy_context(), line 1917), which runs before this task.execute span is entered. So inside ctx.run(execute, context=context) the current OTel span in that snapshot is still _execute_task, not task.execute.

Doesn't that mean spans the operator emits during execute() parent to _execute_task and land as siblings of task.execute rather than nesting under it? task.execute would record the right duration but stay empty of the operator's own child spans, which seems opposite to wrapping "the actual task execution".

Was the intent for operator child spans to nest under task.execute? If so, would re-snapshotting the context inside this function (after the span is current) be the fix? One wrinkle: ctx also carries ExecutorSafeguard.tracker.set(task) from line 1919, so a plain copy_context() here would drop that and it would need re-applying.


assert result == "ok"
names = [s.name for s in exporter.get_finished_spans()]
assert "task.execute" in names
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assert that task.execute exists and check its status, but none of them assert its parent/child relationship. Given the nesting question I raised on _run_execute_callable, would it be worth asserting that the operator's child spans actually nest under task.execute (e.g. checking span.parent.span_id)? As written, would this suite stay green even if task.execute ends up a sibling of the operator's spans rather than their parent?

# AirflowTaskTimeout inherits from BaseException, so OpenTelemetry's
# start_as_current_span won't mark the span as errored on its own
# (it only does so for Exception subclasses). Set it explicitly.
trace.get_current_span().set_status(Status(StatusCode.ERROR, "AirflowTaskTimeout"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At detail level 1, task.execute isn't created (the decorator returns INVALID_SPAN), and _execute_task is also a detail_span, so here trace.get_current_span() resolves to the worker.<task_id> span and this marks that span ERROR on timeout. A regular exception at level 1 sets no status explicitly, so the worker span is left untouched unless the exception propagates through its own with block.

Is it intended that a timeout flips the worker span to ERROR at level 1 while an ordinary failure doesn't? Just flagging the asymmetry, not blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants