-
Notifications
You must be signed in to change notification settings - Fork 20
feat: add reusable OpenTelemetry tracing module #2632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
bert-e
merged 2 commits into
development/8.4
from
improvement/ARSN-586/otel-tracing-module
Jun 5, 2026
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # OpenTelemetry tracing | ||
|
|
||
| Shared OTEL tracing bootstrap and helpers for the S3 platform services | ||
| (backbeat, cloudserver, vault). Arsenal owns the parts that are hard to get | ||
| right and identical across services — SDK lifecycle, the outbound trust | ||
| boundary, span conventions — while each consumer supplies its own | ||
| instrumentation packages. | ||
|
|
||
| ## Import | ||
|
|
||
| Deep-require the built module; do **not** use `require('arsenal').tracing`: | ||
|
delthas marked this conversation as resolved.
|
||
|
|
||
| ```js | ||
| const tracing = require('arsenal/build/lib/tracing'); | ||
| ``` | ||
|
|
||
| The arsenal barrel (`require('arsenal')`) eagerly loads `ioredis` (via | ||
| `lib/metrics`) and `mongodb` (via `lib/storage`); reaching `init()` through it | ||
| would load those before OpenTelemetry can patch them, silently disabling the | ||
| instrumentation. The deep-require pulls only the tracing module, which loads | ||
| nothing instrumentable at import time. | ||
|
|
||
| ## Enabling | ||
|
|
||
| Tracing is **off** unless `ENABLE_OTEL=true`. When enabled, `init()` requires | ||
| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` and fails fast if it is unset. | ||
|
|
||
| - `ENABLE_OTEL` — master switch (`isEnabled()`); must be exactly `true`. | ||
| Default: off. | ||
| - `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` — OTLP/HTTP traces collector URL; | ||
| required when enabled. | ||
| - `OTEL_SAMPLING_RATIO` — root sampling ratio in `[0, 1]`. Default `0.01`. | ||
| - `OTEL_SERVICE_NAME` — overrides the `serviceName` option. | ||
| - `OTEL_SERVICE_VERSION` — overrides the `serviceVersion` option. | ||
| - `OTEL_SERVICE_NAMESPACE` — resource `service.namespace`. Default `scality`. | ||
| - `OTEL_TRUSTED_HOSTS` — comma-separated hostnames trusted for outbound trace | ||
| propagation; a `.suffix` entry matches any subdomain. Default: loopback only. | ||
|
|
||
| ## Lifecycle | ||
|
|
||
| ```js | ||
| const tracing = require('arsenal/build/lib/tracing'); | ||
|
|
||
| tracing.init({ | ||
| serviceName: 'cloudserver', | ||
| serviceVersion: require('./package.json').version, | ||
| // Lazy thunk — invoked once inside init(), only when OTEL is enabled, so | ||
| // the instrumentation packages (and their patch hooks) never load when | ||
| // OTEL is off. The consumer owns these packages and their options. | ||
| instrumentations: () => { | ||
| const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http'); | ||
| const { IORedisInstrumentation } = require('@opentelemetry/instrumentation-ioredis'); | ||
| const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb'); | ||
| return [ | ||
| // makeHttpInstrumentationConfig wires the trust boundary + health | ||
| // filter. Outbound-only services (backbeat) spread it and add | ||
| // disableIncomingRequestInstrumentation: true. | ||
| new HttpInstrumentation( | ||
| tracing.makeHttpInstrumentationConfig({ | ||
| healthPaths: ['/live', '/_/healthcheck', '/metrics'], | ||
| }), | ||
| ), | ||
| new IORedisInstrumentation({ requireParentSpan: true }), | ||
| new MongoDBInstrumentation({ enhancedDatabaseReporting: false }), | ||
| ]; | ||
| }, | ||
| }); | ||
|
|
||
| // On shutdown — bounded flush (~5s), safe to call concurrently / when never inited: | ||
| await tracing.close(); | ||
| ``` | ||
|
|
||
| `init()` is idempotent and a no-op when disabled. Build the instrumentation | ||
| array entirely inside the thunk so packages resolve from the consumer's | ||
| `node_modules` and load lazily. | ||
|
|
||
| ## API | ||
|
|
||
| - `init(options)` — boot the SDK. Options: `serviceName` (required), | ||
| `serviceVersion?`, `instrumentations?: () => Instrumentation[]`. | ||
| - `close()` — bounded-flush shutdown; idempotent. | ||
| - `isEnabled()` — `process.env.ENABLE_OTEL === 'true'`. | ||
| - `makeHttpInstrumentationConfig({ healthPaths? })` → params for the consumer's | ||
| `HttpInstrumentation`: the outbound trust-boundary `requestHook` plus an | ||
| `ignoreIncomingRequestHook` that drops OPTIONS and the given health/probe | ||
| paths (none by default). arsenal never disables inbound spans — an | ||
| outbound-only service spreads the result and adds | ||
| `disableIncomingRequestInstrumentation: true`. | ||
| - `instrumentApiMethod(handler, methodName)` — wrap a callback/async/sync | ||
| handler in an `api.<methodName>` span (scope `arsenal.api`, `err.code` → | ||
| `error.type`); returns the handler unchanged when OTEL is off. | ||
| - `kafka.*` — trace-context propagation over node-rdkafka headers (scope | ||
| `arsenal.kafka`): `traceHeadersFromEntry`, `traceHeadersFromCurrentContext`, | ||
| `contextFromKafkaHeaders`, `startLinkedSpanFromKafkaEntry`. | ||
|
delthas marked this conversation as resolved.
|
||
|
|
||
| ## Examples | ||
|
|
||
| Instrument API handlers (cloudserver / vault) — wrap each handler so calls | ||
| produce `api.<name>` spans: | ||
|
|
||
| ```js | ||
| const { instrumentApiMethod } = require('arsenal/build/lib/tracing'); | ||
| for (const [name, handler] of Object.entries(api)) { | ||
| api[name] = instrumentApiMethod(handler, name); | ||
| } | ||
| ``` | ||
|
|
||
| Kafka propagation (backbeat) — producer stamps headers from the active span or | ||
| an oplog entry; consumer starts a new trace linked to the upstream span: | ||
|
|
||
| ```js | ||
| const { kafka } = require('arsenal/build/lib/tracing'); | ||
|
|
||
| // producer side | ||
| producer.send(payload, kafka.traceHeadersFromCurrentContext()); | ||
| producer.send(payload, kafka.traceHeadersFromEntry(objectMd)); | ||
|
|
||
| // consumer side | ||
| const { ctx, span } = kafka.startLinkedSpanFromKafkaEntry(entry, 'replicate'); | ||
| try { | ||
| /* ...work within ctx... */ | ||
| } finally { | ||
| span.end(); | ||
| } | ||
| ``` | ||
|
|
||
| ## Dependencies | ||
|
|
||
| `@opentelemetry/api` is a hard dependency (inert until an SDK is registered). | ||
| The SDK-core packages (`sdk-node`, `sdk-trace-base`, `resources`, | ||
| `exporter-trace-otlp-http`) are **optional** dependencies, required lazily in | ||
| `init()`. The `instrumentation-*` packages are **not** arsenal dependencies — | ||
| consumers bring their own and pass them via the `instrumentations` thunk. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| import assert from 'assert'; | ||
|
|
||
| import { DEFAULT_SAMPLING_RATIO, SPAN_LIMITS, SHUTDOWN_DEADLINE_MS } from './constants'; | ||
|
|
||
| export interface InitOptions { | ||
| // service.name (OTEL_SERVICE_NAME env overrides). | ||
| serviceName: string; | ||
| // service.version (OTEL_SERVICE_VERSION env overrides). | ||
| serviceVersion?: string; | ||
| // Lazy invoked once inside init() only when OTEL is enabled, so the | ||
| // consumer's require()s and instrumentation patch hooks never load when off. | ||
| instrumentations?: () => any[]; | ||
| } | ||
|
|
||
| let sdk: any = null; | ||
|
|
||
| let diagLog: any = null; | ||
| function getDiagLog(): any { | ||
| if (!diagLog) { | ||
| const werelogs = require('werelogs'); | ||
| diagLog = new werelogs.Logger('tracing'); | ||
| } | ||
| return diagLog; | ||
| } | ||
|
|
||
| function makeDiagLogger(): any { | ||
| const log = getDiagLog(); | ||
| const fwd = | ||
| (level: string) => | ||
| (m: string, ...a: unknown[]) => | ||
| log[level](m, a.length ? { args: a } : undefined); | ||
| return { | ||
| error: fwd('error'), | ||
| warn: fwd('warn'), | ||
| info: fwd('info'), | ||
| debug: fwd('debug'), | ||
| verbose: fwd('trace'), | ||
| }; | ||
| } | ||
|
|
||
| export function isEnabled(): boolean { | ||
| return process.env.ENABLE_OTEL === 'true'; | ||
| } | ||
|
|
||
| export function init(options: InitOptions): void { | ||
| if (!isEnabled() || sdk) { | ||
| return; | ||
| } | ||
|
|
||
| const endpoint = process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT; | ||
| assert(endpoint, 'ENABLE_OTEL=true but OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is unset'); | ||
| assert(options.serviceName, 'tracing.init: options.serviceName is required'); | ||
|
|
||
| let samplingRatio = DEFAULT_SAMPLING_RATIO; | ||
| if (process.env.OTEL_SAMPLING_RATIO !== undefined) { | ||
| const parsed = Number(process.env.OTEL_SAMPLING_RATIO); | ||
| assert( | ||
| Number.isFinite(parsed) && parsed >= 0 && parsed <= 1, | ||
| `OTEL_SAMPLING_RATIO must be a finite number in [0, 1], got: ${process.env.OTEL_SAMPLING_RATIO}`, | ||
| ); | ||
| samplingRatio = parsed; | ||
| } | ||
|
|
||
| // diag + NodeSDK are required lazily here (never at module top) so OTEL-off | ||
| // processes load nothing beyond @opentelemetry/api. | ||
| const { diag, DiagLogLevel } = require('@opentelemetry/api'); | ||
| const { NodeSDK } = require('@opentelemetry/sdk-node'); | ||
| diag.setLogger(makeDiagLogger(), DiagLogLevel.WARN); | ||
|
|
||
| sdk = new NodeSDK(_buildSdkConfig(options, endpoint, samplingRatio)); | ||
| sdk.start(); | ||
| } | ||
|
|
||
| export function _buildSdkConfig(options: InitOptions, endpoint: string, samplingRatio: number): any { | ||
| const { resourceFromAttributes } = require('@opentelemetry/resources'); | ||
| const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); | ||
| const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base'); | ||
|
|
||
| return { | ||
| resource: resourceFromAttributes({ | ||
| 'service.name': process.env.OTEL_SERVICE_NAME || options.serviceName, | ||
| 'service.version': process.env.OTEL_SERVICE_VERSION || options.serviceVersion, | ||
| // Operators set OTEL_SERVICE_NAME per-pod but not the namespace, | ||
| // so the 'scality' default marks Scality-owned traces. | ||
| 'service.namespace': process.env.OTEL_SERVICE_NAMESPACE || 'scality', | ||
| }), | ||
| traceExporter: new OTLPTraceExporter({ url: endpoint }), | ||
| // Disable OTLP log + metric exporters with empties | ||
| logRecordProcessors: [], | ||
| metricReaders: [], | ||
| spanLimits: SPAN_LIMITS, | ||
| // ParentBased so a service honors the upstream sampled flag; without | ||
| // it, it re-samples at `ratio` and the pipeline rate collapses to | ||
| // ratio × ratio. | ||
| sampler: new ParentBasedSampler({ | ||
| root: new TraceIdRatioBasedSampler(samplingRatio), | ||
| }), | ||
| instrumentations: options.instrumentations ? options.instrumentations() : [], | ||
| }; | ||
| } | ||
|
|
||
| export async function close(): Promise<void> { | ||
| if (!sdk) { | ||
| return; | ||
| } | ||
| // Capture the sdk via the IIFE param and null the module ref synchronously | ||
| // (before any await) so concurrent callers don't both shutdown() — the SDK | ||
| // isn't idempotent. The IIFE owns the rejection so a late failure after the | ||
| // timeout wins the race can't crash the process. | ||
| const shutdown = (async (running: any) => { | ||
| try { | ||
| await running.shutdown(); | ||
| } catch (err) { | ||
| getDiagLog().error('tracing close failed', { err }); | ||
| } | ||
| })(sdk); | ||
| sdk = null; | ||
| await Promise.race([ | ||
| shutdown, | ||
| // .unref() so the timer doesn't keep the event loop alive. | ||
| new Promise<void>(resolve => { | ||
| setTimeout(resolve, SHUTDOWN_DEADLINE_MS).unref(); | ||
| }), | ||
| ]); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| export const DEFAULT_SAMPLING_RATIO = 0.01; | ||
|
|
||
| export const SPAN_LIMITS = { | ||
| attributeValueLengthLimit: 4096, | ||
| attributeCountLimit: 128, | ||
| eventCountLimit: 128, | ||
| linkCountLimit: 128, | ||
| }; | ||
|
|
||
| // Bound the shutdown flush so an unreachable collector can't block past | ||
| // Kubernetes' 30s grace (BatchSpanProcessor's own export timeout is also 30s). | ||
| export const SHUTDOWN_DEADLINE_MS = 5000; | ||
|
|
||
| // Instrumentation scopes + API span naming. Service identity comes from | ||
| // service.name on the resource, not these. | ||
| export const API_TRACER_NAME = 'arsenal.api'; | ||
| export const KAFKA_TRACER_NAME = 'arsenal.kafka'; | ||
| export const SPAN_PREFIX = 'api.'; | ||
| export const ERROR_TYPE_ATTR = 'error.type'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| import type { ClientRequest } from 'http'; | ||
| import type { Span } from '@opentelemetry/api'; | ||
|
|
||
| // makeHttpInstrumentationConfig builds the params a consumer passes to its | ||
| // HttpInstrumentation: the outbound trust boundary and the inbound health-path | ||
| // span filter. | ||
|
|
||
| const LOOPBACK_HOSTS = ['localhost', '127.0.0.1', '::1']; | ||
|
|
||
| const HOST_BRACKET_RE = /^\[([^\]]+)\](?::\d+)?$/; | ||
| const HOST_PORT_RE = /:\d+$/; | ||
| const TRAILING_DOT_RE = /\.$/; | ||
|
|
||
| // Operator-supplied bare hostnames (+ always-trusted loopback). A '.'-prefixed | ||
| // entry is a NO_PROXY-style suffix (see makeTrustedHostsHook). Unset → only | ||
| // loopback, so every other outbound call gets traceparent stripped. | ||
| export function loadTrustedHosts(): Set<string> { | ||
| const hosts = new Set(LOOPBACK_HOSTS); | ||
| const raw = process.env.OTEL_TRUSTED_HOSTS; | ||
| if (typeof raw === 'string' && raw.length > 0) { | ||
| raw.split(',').forEach(entry => { | ||
| if (entry) { | ||
|
delthas marked this conversation as resolved.
delthas marked this conversation as resolved.
|
||
| hosts.add(entry.toLowerCase()); | ||
|
delthas marked this conversation as resolved.
delthas marked this conversation as resolved.
|
||
| } | ||
|
delthas marked this conversation as resolved.
|
||
| }); | ||
|
delthas marked this conversation as resolved.
delthas marked this conversation as resolved.
delthas marked this conversation as resolved.
|
||
| } | ||
|
delthas marked this conversation as resolved.
|
||
| return hosts; | ||
| } | ||
|
|
||
| // Bare hostname for matching. IPv6 is bracket-handled separately so the :port | ||
| // strip can't eat into the address (naive strip of `[::1]` → `:`). A trailing | ||
| // dot (FQDN form) is stripped so it matches the same patterns. | ||
| function normalizeHostHeader(s: string): string { | ||
| const bracket = HOST_BRACKET_RE.exec(s); | ||
| if (bracket) { | ||
| return bracket[1].toLowerCase(); | ||
| } | ||
| return s.replace(HOST_PORT_RE, '').replace(TRAILING_DOT_RE, '').toLowerCase(); | ||
| } | ||
|
|
||
| // HTTP requestHook that strips traceparent/tracestate (and tags the span | ||
| // suppressed) on outbound calls to untrusted hosts, so trace IDs don't leak off-platform. | ||
| export function makeTrustedHostsHook(trustedHosts: Set<string>) { | ||
| // '.'-prefixed entries are suffix patterns, partitioned once (not per request). | ||
| const suffixes = [...trustedHosts].filter(h => h.startsWith('.')); | ||
| const isTrusted = (host: string): boolean => | ||
| trustedHosts.has(host) || | ||
| // endsWith keeps the dot boundary (`xsvc.cluster.local` not matched); | ||
| // the slice(1) check also trusts the bare apex. | ||
| suffixes.some(sfx => host.endsWith(sfx) || host === sfx.slice(1)); | ||
|
|
||
| return function requestHook(span: Span, request: ClientRequest): void { | ||
| // Only ClientRequest exposes getHeader/removeHeader, not inbound IncomingMessage. | ||
| if (!request || typeof request.getHeader !== 'function') { | ||
| return; | ||
| } | ||
| const host = normalizeHostHeader((request.getHeader('host') || '').toString()); | ||
| if (isTrusted(host)) { | ||
| return; | ||
| } | ||
| if (typeof request.removeHeader === 'function') { | ||
| request.removeHeader('traceparent'); | ||
| request.removeHeader('tracestate'); | ||
| } | ||
| if (span && typeof span.setAttribute === 'function') { | ||
| span.setAttribute('scality.trace.suppressed', true); | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| // True when url's path (query string stripped) exactly matches an entry in | ||
| // pathSet — used to skip spans for health/probe endpoints. The set differs per | ||
| // service, so the caller passes it in. | ||
| export function isHealthPath(url: string | undefined, pathSet: Set<string>): boolean { | ||
| if (typeof url !== 'string' || url.length === 0) { | ||
| return false; | ||
| } | ||
| const qIdx = url.indexOf('?'); | ||
| const path = qIdx === -1 ? url : url.slice(0, qIdx); | ||
| return pathSet.has(path); | ||
| } | ||
|
|
||
| // HttpInstrumentation params (the consumer constructs it): trust-boundary | ||
| // requestHook + an ignore hook for OPTIONS and the given health paths. | ||
| export function makeHttpInstrumentationConfig(options: { healthPaths?: string[] } = {}): Record<string, any> { | ||
| const healthPaths = new Set(options.healthPaths ?? []); | ||
| return { | ||
| ignoreIncomingRequestHook: (req: { method?: string; url?: string }) => | ||
| req.method === 'OPTIONS' || isHealthPath(req.url, healthPaths), | ||
| requestHook: makeTrustedHostsHook(loadTrustedHosts()), | ||
| }; | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| import * as bootstrap from './bootstrap'; | ||
|
|
||
| export { makeHttpInstrumentationConfig } from './httpHooks'; | ||
| export { instrumentApiMethod, endSpan } from './instrumentation'; | ||
| export type { InitOptions } from './bootstrap'; | ||
| export * as kafka from './kafkaTraceContext'; | ||
|
|
||
| export const isEnabled = bootstrap.isEnabled; | ||
| export const init = bootstrap.init; | ||
| export const close = bootstrap.close; |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.