Skip to content

feat: add OTel test semantic convention attributes to Experiment spans#131

Draft
anirudha wants to merge 1 commit into
strands-agents:mainfrom
anirudha:feat/otel-test-semantic-conventions
Draft

feat: add OTel test semantic convention attributes to Experiment spans#131
anirudha wants to merge 1 commit into
strands-agents:mainfrom
anirudha:feat/otel-test-semantic-conventions

Conversation

@anirudha

@anirudha anirudha commented Feb 10, 2026

Copy link
Copy Markdown

Description

Adds test.* span attributes from the OTel test semantic conventions proposal to existing Experiment evaluation spans. All changes are strictly additive: no wrapper spans, no OTel events, all existing gen_ai.evaluation.* attributes preserved unchanged.

Related: open-telemetry/semantic-conventions-genai#79

What changed

src/strands_evals/experiment.py (+34 lines)

Change Details
name parameter Optional str on __init__, defaults to "unnamed_experiment". Stored as self._name, exposed via @property.
to_dict / from_dict Serializes and restores name. from_dict falls back to "unnamed_experiment" for legacy dicts without the key.
import uuid Added at module top.
run_evaluations (sync) Generates run_id = str(uuid.uuid4()) at method start. Adds test.suite.name, test.suite.run.id, test.case.name, test.case.id to eval_case span initial attributes. Adds test.case.result.status to each evaluator span's set_attributes.
run_evaluations_async Same run_id generation. Passes run_id to _worker.
_worker (async) New run_id parameter. Adds same test.* attributes to execute_case and evaluator spans.

No new classes, modules, or architectural changes. The diff is ~34 lines of production code.

pyproject.toml (+3 lines)

Added hypothesis>=6.0.0,<7.0.0 to three dependency sections:

  • [project.optional-dependencies] test
  • [tool.hatch.envs.hatch-test] dependencies
  • [tool.hatch.envs.default] dependencies

tests/strands_evals/test_experiment.py (+11 lines)

Updated 11 existing to_dict test assertions to include the new "name": "unnamed_experiment" key in expected dictionaries. No test logic changed.

tests/strands_evals/test_experiment_otel_conventions.py (new, 741 lines)

6 property-based tests (hypothesis, 100 iterations each) + 10 unit tests:

Test What it validates
Property 1 Experiment(name=s).name == s for any string
Property 2 to_dictfrom_dict round-trip preserves name
Property 3 (sync) eval_case spans have all 4 test.* attributes with correct values
Property 3 (async) execute_case spans have all 4 test.* attributes with correct values
Property 4 Evaluator span test.case.result.status matches aggregate_pass boolean
Property 5 Existing gen_ai.evaluation.* attributes preserved on async spans
Property 6 Experiments with/without name produce identical evaluation reports
Unit: default name "unnamed_experiment" when name not provided
Unit: legacy from_dict Dict without name key defaults correctly
Unit: no wrapper span No test_suite_run span created (sync + async)
Unit: no add_event No OTel events used for test.* data (sync + async)
Unit: run_id format test.suite.run.id is valid UUID4 (sync + async)

src/strands_evals/evaluators/coherence_evaluator.py (whitespace only)

Trailing whitespace cleanup on 2 docstring lines. Likely from formatter.

Design decisions worth reviewing

  1. No wrapper span: run_id is a flat attribute on each case span rather than derived from a parent test_suite_run span. This preserves the flat trace structure that backends like Langfuse/Jaeger expect for session_id-based grouping (important for ActorSimulator multi-turn conversations).

  2. Span attributes, not events: All test.* metadata uses span.set_attributes(). Maximizes backend compatibility since not all backends support event attributes.

  3. run_id per invocation, not per instance: Each call to run_evaluations/run_evaluations_async gets a fresh UUID4. Concurrent calls on the same Experiment instance get distinct IDs.

  4. Backward compatibility: name defaults to "unnamed_experiment", from_dict handles missing key gracefully. Constructor accepts all previously valid argument combinations.

Span attribute schema

eval_case <name>  /  execute_case <name>
├── test.suite.name      = experiment.name
├── test.suite.run.id    = UUID4 (unique per run_evaluations call)
├── test.case.name       = case.name
├── test.case.id         = case.session_id
├── gen_ai.evaluation.*  = (unchanged)
│
└── evaluator <Name>
    ├── test.case.result.status = "pass" | "fail"
    └── gen_ai.evaluation.*     = (unchanged)

How to verify

hatch test tests/ -vv

All 400 tests pass (69 existing + 17 new OTel convention tests + 314 other project tests).

@anirudha anirudha marked this pull request as draft February 10, 2026 15:21

@poshinchen poshinchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember why reverted this grouping logic....

In strands-evals we have the ActorSimulator which allows user to execute multi-turn conversation for evaluations.

With this current changes, the multi-turn conversation will be wrapped together into a single case span. But in general those are different requests, and it contradicts what other OTEL-supported backend (Langfuse / Jaeger...) has. Users should group the traces based on the session_id, instead of wrapping all the executions into one span.

I had a long discussion with @jjbuck and this was the final decision that we made.

We can debate it again to whether group all of the multi-turn conversation into a single span. But to start simple, I'll be fine with having those test* and so on as the attributes. I think they are from OTEL Test Attributes?

@poshinchen poshinchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I also intentionally made them into span_attributes instead of event_attributes because some sources do not support event.

And from the documentation, it seems like they can both be span_attributes and the event_attributes.

Add test.* span attributes from the OTel semantic conventions proposal to
existing evaluation spans, improving observability and interoperability
with OTel-compatible backends.

Changes:
- Add optional 'name' parameter to Experiment (default: 'unnamed_experiment')
  with serialization round-trip support in to_dict/from_dict/from_file
- Add test.suite.name, test.suite.run.id, test.case.name, test.case.id
  attributes to eval_case (sync) and execute_case (async) spans
- Add test.case.result.status ('pass'/'fail') to evaluator spans
- Generate unique UUID4 run_id per run_evaluations/run_evaluations_async call
- Add hypothesis to test dependencies in pyproject.toml
- Add property-based tests (Properties 1-6) and unit tests for all new
  functionality, backward compatibility, and edge cases

All changes are additive - no wrapper spans introduced, no OTel events used,
all existing gen_ai.evaluation.* attributes preserved unchanged.
@anirudha anirudha force-pushed the feat/otel-test-semantic-conventions branch from 82d2ff7 to 5787c93 Compare February 11, 2026 15:18
@anirudha anirudha changed the title feat: align Experiment telemetry with OTel test semantic conventions feat: add OTel test semantic convention attributes to Experiment spans Feb 11, 2026
queue: Queue containing cases to process
task: Task function to run on each case
results: List to store results
run_id: Unique identifier for this evaluation run

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as session.id?

"""
queue: asyncio.Queue[Case[InputT, OutputT]] = asyncio.Queue()
results: list[Any] = []
run_id = str(uuid.uuid4())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@poshinchen poshinchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need test_experiment_otel_conventions.py? =

It's basically mocks and verify the mock spans. This can reduce the need of hypothesis dependency

@anirudha

anirudha commented Feb 18, 2026

Copy link
Copy Markdown
Author

out on vacation, i'll address the comments next week

@poshinchen

Copy link
Copy Markdown
Contributor

Do you need us to adjust the PR?

@yonib05 yonib05 added area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-tracing Trace/session ingestion: providers, session mappers, extractors, telemetry/OTEL enhancement New feature or request labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-core Core eval framework: Case, Experiment, task handler, evaluation data stores area-tracing Trace/session ingestion: providers, session mappers, extractors, telemetry/OTEL enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants