diff --git a/ddprof-lib/src/main/java/com/datadoghq/profiler/AgentPhase.java b/ddprof-lib/src/main/java/com/datadoghq/profiler/AgentPhase.java new file mode 100644 index 000000000..df4fcd2b2 --- /dev/null +++ b/ddprof-lib/src/main/java/com/datadoghq/profiler/AgentPhase.java @@ -0,0 +1,25 @@ +package com.datadoghq.profiler; + +public enum AgentPhase { + IDLE("idle"), + CONTEXT_BUILD("context_build"), + AWAITING_INFERENCE("awaiting_inference"), + TOOL_EXECUTION("tool_execution"), + TOKENIZING("tokenizing"), + EMBEDDING("embedding"), + RETRYING("retrying"), + SERIALIZING("serializing"), + INFERENCE_WAIT("inference_wait"); + + public static final String ATTRIBUTE_NAME = "llm.agent.phase"; + + private final String token; + + AgentPhase(String token) { + this.token = token; + } + + public String token() { + return token; + } +} diff --git a/ddprof-lib/src/main/java/com/datadoghq/profiler/ContextSetter.java b/ddprof-lib/src/main/java/com/datadoghq/profiler/ContextSetter.java index bff497b80..afb6a300f 100644 --- a/ddprof-lib/src/main/java/com/datadoghq/profiler/ContextSetter.java +++ b/ddprof-lib/src/main/java/com/datadoghq/profiler/ContextSetter.java @@ -61,4 +61,12 @@ public boolean clearContextValue(int offset) { } return false; } + + public String readContextValue(int offset) { + if (offset < 0) { + return null; + } + ThreadContext ctx = profiler.getThreadContext(); + return ctx != null ? ctx.readContextAttribute(offset) : null; + } } diff --git a/doc/specs/2026-06-02-llm-phase-attribution-design.md b/doc/specs/2026-06-02-llm-phase-attribution-design.md new file mode 100644 index 000000000..e553219d1 --- /dev/null +++ b/doc/specs/2026-06-02-llm-phase-attribution-design.md @@ -0,0 +1,157 @@ +# LLM Phase Attribution — Foundation + F1 Pilot + +**Status:** Approved design, pending implementation plan +**Date:** 2026-06-02 +**Branch:** `jb/llm` +**Scope source:** `doc/temp/llm-profiling-design.md` (15-feature roadmap) + +## Problem + +LLM pipeline state is invisible to the profiler. A CPU flamegraph from an +LLM-based JVM service shows undifferentiated work with no way to tell context +assembly from inference wait from tool execution. This spec covers the shared +*foundation* for LLM phase attribution plus the first feature (F1, "AI Phase +Tagging (CPU)"), proven end-to-end and demonstrated with real recorded data. + +This is one slice of a larger roadmap. The other 14 features are explicitly out +of scope here (see Non-Goals). + +## Goals + +1. Define a per-thread *agent phase* contract owned by `java-profiler`. +2. Let `dd-trace-java` set the phase from instrumentation, async-signal-safe to + read. +3. CPU samples carry the phase so flamegraphs become phase-keyed. +4. Ship a real, runnable LangChain4j scenario that produces a JFR recording + suitable for the profiling backend. + +## Key architectural decision: Hybrid representation + +The roadmap doc proposes a new ~24-byte typed struct in the per-thread native +context (`agentPhase` uint8, etc.). **For this pilot we do not do that.** + +Evidence from the codebase: + +- The per-thread context already has **10 attribute slots** (`DD_TAGS_CAPACITY`), + each a `u32` dictionary-encoded ID, read async-signal-safe in the signal + handler via `getOtelTagEncoding(i)` (`ddprof-lib/src/main/cpp/thread.h:250`). +- `writeCurrentContext()` already emits those tag encodings into **CPU execution + samples** — not only wall-clock — via `recordExecutionSample` + (`ddprof-lib/src/main/cpp/flightRecorder.cpp:1700-1713`). +- `dd-trace-java` already has the full attribute path: `ContextSetter`, + `Profiling.get().setContextValue(...)`, `ProfilingContextAttribute`, and + `DD_PROFILING_CONTEXT_ATTRIBUTES` slot registration + (`DatadogProfiler.java:153`, `DatadogProfilerConfig.getContextAttributes`). + +`agentPhase` is an enum with 9 values — ideal for dictionary encoding (tiny, +stable cardinality). For F1 the dedicated struct buys nothing over the existing +attribute path and costs native struct churn, new JFR metadata, new JNI setters, +and a wider blast radius. + +**Decision:** Expose a typed `AgentPhase` enum as the public contract, but back +it with the existing dictionary-encoded attribute slot. No native changes. + +The dedicated typed fields *are* justified later — for high-cardinality numeric +fields (`contextTokenCount`, `requestContextId`, `requestStartNs`) that would +explode a string dictionary, and because string-encoding the full roadmap would +exhaust the 10 slots. That work is deferred to the first Phase-2 feature that +needs it. + +## Constraints (carried from the roadmap) + +- No malloc / no heap access in signal handlers — reads stay plain field / + dictionary-id accesses on pre-allocated per-thread state. +- No O(N) work in any sampler hot path. +- The phase contract is owned by `java-profiler`; instrumentation lives in + `dd-trace-java`. + +## Component design + +### java-profiler + +1. **`AgentPhase` enum** in `com.datadoghq.profiler` — the layout contract. The + 9 phases from the roadmap (`IDLE`, `INFERENCE_WAIT`, `TOOL_EXECUTION`, + `CONTEXT_BUILD`, `TOKENIZING`, `EMBEDDING`, `AWAITING_INFERENCE`, `RETRYING`, + `SERIALIZING`), each with a stable lowercase string token (e.g. + `AWAITING_INFERENCE` → `"awaiting_inference"`). The profiling backend keys on + these tokens. +2. **Enum→token helper** for use with the existing + `ContextSetter.setContextValue(attribute, value)`. No JNI, no native code, no + new JFR metadata. +3. **Well-known attribute name constant** (`llm.agent.phase`) so both sides agree + on the slot name. + +### dd-trace-java + +1. **Startup registration.** Ensure `llm.agent.phase` is in the registered + attribute set returned by `getContextAttributes`, gated by a config flag + (working name `DD_PROFILING_LLM_PHASE_ENABLED`, default-on for the pilot), + mirroring the existing `PROFILING_CONTEXT_ATTRIBUTES_SPAN_NAME_ENABLED` + pattern. Consumes 1 of 10 slots. This is mandatory: `ContextSetter` is built + once at startup and `offsetOf()` returns `-1` (setter no-ops) for any name not + registered then. +2. **Typed setter.** Add `setAgentPhase(AgentPhase)` to the profiling API + surface, implemented in `DatadogProfilingIntegration` / `DatadogProfiler` by + mapping the enum to its token and calling `setContextValue(offset, token)`. + Use scoped set/restore (`Profiling.get().newScope()`) so nested phases restore + the previous value on exit — satisfying the roadmap's "zeroed at exit" + requirement without leaving a stale phase on the slot. +3. **New LangChain4j instrumentation module.** Tags three phases at LC4j + orchestration boundaries: + - **CONTEXT_BUILD** — AiServices prompt / message assembly (incl. RAG + retrieval if present). + - **AWAITING_INFERENCE** — the model call (`ChatModel.chat`, streaming + variant noted during implementation). + - **TOOL_EXECUTION** — tool executor dispatch. + loopIteration and other numeric counters are out (F3/deferred). Exact LC4j + version and class/method signatures are pinned during implementation + (LC4j renamed `ChatLanguageModel`→`ChatModel` at 1.0, so version matters). + +### JFR output + +No new event type. The phase rides as an existing context **tag encoding** +emitted on every `T_EXECUTION_SAMPLE`. Downstream, each CPU sample carries an +`llm.agent.phase` label → phase-keyed flamegraph. + +**Open verification (implementation-time, not assumed):** confirm the +`llm.agent.phase` label surfaces *by name* in the emitted JFR via the existing +attribute machinery. + +## Testing strategy + +1. **java-profiler unit (`ddprof-test`).** Caller-simulated, no dd-trace-java + dependency: register `llm.agent.phase`, set a phase from a worker thread, run + CPU load, assert the JFR execution samples carry the phase tag encoding. Runs + under `testRelease` / `testDebug` (the `:test` task is disabled). +2. **dd-trace-java instrumentation/muzzle test.** Assert the LangChain4j call + path drives the phase attribute (set on enter, restored on exit) across the + three boundaries. +3. **Demo — mock path (CI-safe).** A canned `ChatModel` returning fixed + responses with an artificial delay; a small AiServices agent with one tool, so + all three phases appear. Deterministic; exercises full instrumentation + JFR + path. +4. **Demo — Ollama path (manual).** Same agent app against a local Ollama model + (e.g. llama3) for realistic inference latency, producing a JFR recording for + the profiling backend. No API key, offline-reproducible. + +## Non-goals / explicit deferrals + +- No dedicated native typed fields; no new JNI setters; no new JFR event type. +- Numeric fields (`contextTokenCount`, `requestContextId`, `requestStartNs`) + deferred to their first consuming Phase-2 feature. +- Roadmap features F2–F15 deferred. +- Other framework modules (Spring AI, OpenAI Java SDK, DJL, HuggingFace) and the + real-OpenAI demo backend deferred. + +## Integration reference (grounding) + +| Concern | Location | +|---|---| +| Attribute slots, async-signal-safe read | `ddprof-lib/.../cpp/thread.h:250` (`getOtelTagEncoding`) | +| Tag encodings emitted on CPU samples | `ddprof-lib/.../cpp/flightRecorder.cpp:1700-1713` | +| Java attribute setter | `ddprof-lib/.../java/com/datadoghq/profiler/ContextSetter.java` | +| Startup attribute set | dd-trace-java `DatadogProfilerConfig.getContextAttributes` | +| ContextSetter wiring | dd-trace-java `DatadogProfiler.java:153` | +| Profiler context bridge | dd-trace-java `DatadogProfilingIntegration.java` | +| Profiling API surface | dd-trace-java `dd-trace-api/.../profiling/ProfilingContext.java` | +| Existing OpenAI instrumentation (pattern reference) | dd-trace-java `dd-java-agent/instrumentation/openai-java/` |