Skip to content

Add OpenTelemetry tracing across the backbeat pipeline#2733

Draft
delthas wants to merge 3 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing
Draft

Add OpenTelemetry tracing across the backbeat pipeline#2733
delthas wants to merge 3 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 14, 2026

Summary

Add OpenTelemetry tracing across the backbeat pipeline, gated behind ENABLE_OTEL=true. When the flag is unset, no @opentelemetry/* package is loaded — zero overhead off the OTEL path.

The SDK bootstrap, trust-boundary host filter, and Kafka trace-context helpers now live in arsenal's shared lib/tracing module (scality/Arsenal#2632, ARSN-586); backbeat consumes it through a thin shim instead of carrying its own copy. Companion to the cloudserver (#6140, CLDSRV-884) and vault (#203, VAULT-708) PRs, so all four services share one implementation.

Commits

  1. chore: depend on arsenal OTEL tracing module — pin arsenal at the ARSN-586 branch (shared tracing module + the W3C trace-context stamping on MongoDB metadata writes, ARSN-572, which the Kafka pipeline relies on to continue traces across the oplog boundary). Drop the SDK-core packages now that arsenal carries them as optionalDependencies, and keep the four instrumentation packages backbeat configures itself: instrumentation-http / -ioredis / -mongodb / -aws-sdk.

  2. feat: replace in-tree tracing with arsenal shimlib/tracing/index.js becomes a thin shim over require('arsenal/build/lib/tracing') carrying backbeat's config in one place: serviceName: 'backbeat', the four instrumentations, and outbound-only HTTP (...makeHttpInstrumentationConfig() for the trust-boundary requestHook, plus disableIncomingRequestInstrumentation: true since backbeat pods serve no application HTTP). lib/tracing/kafkaTraceContext.js re-exports arsenal's kafka helpers so the existing require sites are unchanged. The trust-boundary filter and SDK bootstrap that used to live here are now arsenal's.

  3. feat: instrument backbeat pods and the Kafka pipeline — wire tracing into the replication, lifecycle, GC, notification, and oplog-populator pods: init() at each of the 8 entry points, per-pod spans, and trace-context propagation across the Kafka pipeline. Producers stamp traceparent onto message headers via the kafka helpers; consumers start a span linked to (not a child of) the upstream span — out-of-process Kafka hops can fire long after the original request, so links keep traces bounded.

Incidental, intentional: the replicationStatusProcessor SIGTERM handler previously had no process.exit(0) on the success path — an inconsistency with the other 7 entry points. Since this PR adds tracing.close() to that handler, it also adds the missing process.exit(0) so the pod exits cleanly on SIGTERM like the rest (rather than potentially hanging on a non-empty event loop).

Why a shim (vs cloudserver/vault's direct calls)

backbeat has 8 init() entry points and 6 Kafka-helper require sites. The shim keeps all 14 call sites untouched and the backbeat-specific config in one file; cloudserver and vault have a single entry point each, so they deep-require arsenal directly.

Configuration

OpenTelemetry environment variables are documented in the arsenal module.

Out of scope (follow-ups)

  • Repin arsenal to a release tag once ARSN-586 ships, replacing the branch pin.

The trust-boundary enforcement (read OTEL_TRUSTED_HOSTS, strip traceparent on untrusted outbound) lives entirely in arsenal and is wired automatically. Populating that env var per deployment is routine operator config — no backbeat-side work.

Related tickets

Issue: BB-764

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 60.19417% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.66%. Comparing base (c52fbcc) to head (8412b73).
⚠️ Report is 2 commits behind head on development/9.4.

Files with missing lines Patch % Lines
...tensions/lifecycle/conductor/LifecycleConductor.js 50.00% 11 Missing ⚠️
lib/tracing/index.js 33.33% 6 Missing ⚠️
bin/queuePopulator.js 0.00% 3 Missing ⚠️
extensions/gc/service.js 0.00% 3 Missing ⚠️
extensions/lifecycle/bucketProcessor/task.js 0.00% 3 Missing ⚠️
extensions/lifecycle/conductor/service.js 0.00% 3 Missing ⚠️
extensions/lifecycle/objectProcessor/task.js 0.00% 3 Missing ⚠️
extensions/notification/queueProcessor/task.js 0.00% 3 Missing ⚠️
extensions/replication/queueProcessor/task.js 0.00% 3 Missing ⚠️
...ons/replication/replicationStatusProcessor/task.js 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js 91.68% <100.00%> (+0.13%) ⬆️
...ensions/notification/NotificationQueuePopulator.js 98.21% <100.00%> (+0.03%) ⬆️
...cation/destination/KafkaNotificationDestination.js 81.63% <100.00%> (+2.56%) ⬆️
extensions/replication/ReplicationAPI.js 87.75% <100.00%> (+1.08%) ⬆️
...xtensions/replication/ReplicationQueuePopulator.js 91.93% <100.00%> (-1.40%) ⬇️
lib/BackbeatConsumer.js 93.39% <100.00%> (-1.49%) ⬇️
lib/BackbeatProducer.js 89.28% <ø> (-0.90%) ⬇️
lib/queuePopulator/QueuePopulatorExtension.js 94.44% <100.00%> (+3.26%) ⬆️
lib/tracing/kafkaTraceContext.js 100.00% <100.00%> (ø)
bin/queuePopulator.js 0.00% <0.00%> (ø)
... and 9 more

... and 6 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.22% <75.00%> (-0.01%) ⬇️
Core Library 81.19% <80.64%> (+0.20%) ⬆️
Ingestion 70.63% <ø> (-0.61%) ⬇️
Lifecycle 78.69% <48.88%> (-0.38%) ⬇️
Oplog Populator 85.83% <ø> (ø)
Replication 59.65% <50.00%> (-0.14%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.4    #2733      +/-   ##
===================================================
- Coverage            74.73%   74.66%   -0.07%     
===================================================
  Files                  199      201       +2     
  Lines                13650    13741      +91     
===================================================
+ Hits                 10201    10260      +59     
- Misses                3439     3471      +32     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.06% <0.00%> (-0.07%) ⬇️
api:routes 8.88% <0.00%> (-0.07%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.89% <9.70%> (+0.76%) ⬆️
ingestion 12.50% <8.73%> (-0.08%) ⬇️
lib 7.77% <10.67%> (-0.02%) ⬇️
lifecycle 18.99% <33.00%> (-0.01%) ⬇️
notification 1.01% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.60% <11.65%> (-0.12%) ⬇️
unit 51.45% <56.31%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/BackbeatConsumer.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 9d08f7b to 2f7afb0 Compare April 14, 2026 10:44
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 2f7afb0 to d562a0a Compare April 14, 2026 10:47
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread package.json Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread package.json Outdated
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from d562a0a to 51a9f61 Compare April 14, 2026 11:00
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 51a9f61 to b9d3528 Compare April 14, 2026 16:07
Comment thread lib/BackbeatConsumer.js
Comment thread OTEL.md Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 970a811 to 849d6b0 Compare April 15, 2026 15:22
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/BackbeatConsumer.js
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@delthas delthas requested a review from SylvainSenechal June 1, 2026 13:59
@delthas delthas added the claude-review-retro PRs with a Claude Code review that could be improved label Jun 1, 2026
Comment thread extensions/gc/service.js Outdated
Comment thread lib/tracing/index.js
Comment thread lib/tracing/index.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread extensions/replication/ReplicationAPI.js Outdated
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 6357120 to a54ec82 Compare June 2, 2026 15:05
Comment thread package.json Outdated
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
@scality scality deleted a comment from claude Bot Jun 2, 2026
@delthas delthas marked this pull request as draft June 3, 2026 13:37
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from a54ec82 to 54399e3 Compare June 3, 2026 15:40
Comment thread package.json
Comment thread extensions/replication/replicationStatusProcessor/task.js
@scality scality deleted a comment from claude Bot Jun 3, 2026
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 54399e3 to 3189d3a Compare June 3, 2026 17:08
Comment thread package.json
Comment thread extensions/replication/replicationStatusProcessor/task.js
@scality scality deleted a comment from claude Bot Jun 3, 2026
delthas added 2 commits June 3, 2026 19:47
Pin arsenal at the ARSN-586 branch (shared tracing module + W3C
trace-context stamping on MongoDB metadata writes). Drop the SDK-core
packages now that arsenal carries them as optionalDependencies, and
keep the four instrumentation packages (http, ioredis, mongodb,
aws-sdk) here — the consumer owns and configures them.

Issue: BB-764
lib/tracing/index.js becomes a thin shim over arsenal's shared module:
it carries backbeat's config (serviceName, the http/ioredis/mongodb/
aws-sdk instrumentations, outbound-only HTTP via
makeHttpInstrumentationConfig + disableIncomingRequestInstrumentation)
so the 8 entry points keep calling init() with no args.
kafkaTraceContext.js re-exports arsenal's kafka helpers so the existing
require sites are unchanged. The trust-boundary filter and SDK
bootstrap now live in arsenal.

Issue: BB-764
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 3189d3a to cedc91c Compare June 3, 2026 17:47
Comment thread package.json
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
@scality scality deleted a comment from claude Bot Jun 3, 2026
Comment thread package.json
Comment thread lib/BackbeatConsumer.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Wire arsenal's tracing into the replication, lifecycle, GC,
notification, and oplog-populator pods: init() at each entry point,
per-pod spans, and trace-context propagation across the Kafka pipeline
(producers stamp traceparent via the kafka helpers; consumers start
linked spans from it). Out-of-process Kafka hops use span links, not
parent/child, so traces stay bounded.

Issue: BB-764
Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

  • package.json:62 — arsenal pinned to a branch instead of a tag; must be repinned before merge

    The rest of the PR is well-structured: OTEL init at entry points with lazy instrumentation loading (zero overhead when disabled), proper span lifecycle with try/catch guards against leaks, linked (not child) spans across Kafka hops, trust-boundary header stripping on the customer-facing notification Kafka, and good test coverage for all the new paths. The incidental process.exit(0) fix in replicationStatusProcessor is correct.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review-retro PRs with a Claude Code review that could be improved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants