Skip to content

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398

Merged
wpfleger96 merged 6 commits into
mainfrom
duncan/otel-migration
Jul 1, 2026
Merged

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398
wpfleger96 merged 6 commits into
mainfrom
duncan/otel-migration

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds distributed tracing to the buzz relay via OpenTelemetry while keeping metrics on the existing Prometheus scrape path. OTLP carries traces only; the :9102 Prometheus text endpoint and every existing metric name, label, and bucket are unchanged. Also adds additive DB/Redis connection-pool gauges on the existing metrics-rs path.

Changes

Tracing (new)

  • New telemetry.rs: installs an OTLP gRPC span exporter + SdkTracerProvider only when OTEL_EXPORTER_OTLP_ENDPOINT is set; no-ops (zero overhead, no connection attempted) when unset.
  • try_init_tracer returns a TracerInit enum (Enabled/Disabled/ExporterBuildFailed) so exporter-build failures are logged by the caller after tracing_subscriber is installed — prevents silent drop of the warning.
  • The trace Resource reads OTEL_SERVICE_NAME explicitly with a buzz-relay fallback, plus an EnvResourceDetector overlay for OTEL_RESOURCE_ATTRIBUTES.
  • OpenTelemetryLayer wired into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer — stdout structured logs are unchanged.
  • Spans added on hot paths: ws.auth, ws.event, ws.req, ws.count in connection.rs/handlers/auth.rs/handlers/event.rs (carrying conn_id/event_id/kind/sub_id), and #[instrument(skip_all)] on SubRegistry::fan_out_scoped.

Metrics — unchanged path

  • metrics.rs is unchanged from main: the metrics-rs / PrometheusBuilder setup, every metric name, label set, and histogram bucket boundary are preserved. Existing Prometheus scrapers and the Datadog Agent openmetrics annotation need no changes.

DB and Redis pool gauges (new, additive)

  • Db::pool_stats() -> DbPoolStats extended to include max: u32 (pool ceiling from DbConfig.max_connections). The max_connections field is stored on Db at construction so it's available without re-reading config.
  • Background task in main.rs polls pool stats (interval via BUZZ_POOL_METRICS_INTERVAL_SECS, clamped to >= 1s) and emits via metrics::gauge!:
    • buzz_db_pool_size, buzz_db_pool_idle, buzz_db_pool_active, buzz_db_pool_max
    • buzz_redis_pool_available, buzz_redis_pool_size, buzz_redis_pool_max, buzz_redis_pool_waiting

Graceful shutdown

  • The OTLP tracer provider is flushed on SIGTERM drain (after audit drain), with warning-only error handling. No meter-provider shutdown — metrics stay on the Prometheus exporter.

Environment variables

Variable Default Purpose
OTEL_EXPORTER_OTLP_ENDPOINT (unset = tracing disabled) OTLP gRPC trace endpoint
OTEL_SERVICE_NAME buzz-relay service.name resource attribute on traces
OTEL_RESOURCE_ATTRIBUTES Extra trace resource attributes
BUZZ_METRICS_PORT 9102 Prometheus scrape port (unchanged)
BUZZ_POOL_METRICS_INTERVAL_SECS 10 Pool stats poll interval

Backward compatibility

Metrics are unconditional. /metrics on :9102 serves the same Prometheus text format with identical metric names, labels, and buckets regardless of any OpenTelemetry setting — this PR does not change the metrics path at all (metrics.rs is byte-identical to main).

OpenTelemetry adds traces only, and is opt-in. OTEL_EXPORTER_OTLP_ENDPOINT gates tracing exclusively: when it is unset — every current deployment — no tracer is initialized and no OTLP connection is attempted. JSON stdout logs are unchanged. Net effect for existing deployments: zero behavioral change.

Notes

OTEL 0.31 / 0.32 dual version in Cargo.lock: opentelemetry-otlp 0.31.1 appears transitively via mesh-llm-host-runtime 0.68.0 (a dev-dependency). It does not affect the production binary; both versions coexist because they are independent crate versions in the dependency graph. Not controllable from this PR without bumping the mesh-llm pin.

Related: block-coder-tf-stacks#2267 — staging relay OTLP endpoint config.

…eus export

Replace metrics-rs/metrics-exporter-prometheus with OpenTelemetry native
instruments backed by both a Prometheus text endpoint (:9102) and an OTLP
gRPC exporter. Add distributed tracing via tracing-opentelemetry. Add DB
and Redis pool metrics.

## What changed

### Metrics
- Rewrote metrics.rs as an OTEL setup module: SdkMeterProvider with a
  PrometheusExporter (pull-based, same /metrics endpoint) and an optional
  PeriodicReader+OTLP exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT.
- Migrated all 41 metrics::counter!/histogram!/gauge! call sites across
  connection.rs, subscription.rs, state.rs, handlers/, and api/ to the
  pre-built Metrics struct (OnceLock, zero per-call-site allocation).
- Preserved every metric name, type, label set, and histogram bucket from
  the prior implementation so existing Prometheus scrapers (including the
  Datadog Agent openmetrics annotation) need no changes.
- Instruments lazy-init to OTEL noop meter when install() hasn't been
  called, matching prior metrics-rs behaviour in unit tests.

### Tracing
- Added telemetry.rs: try_init_tracer() initialises an OTLP gRPC span
  exporter + SdkTracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is set;
  returns None (zero overhead) when unset.
- Wired OpenTelemetryLayer into the tracing_subscriber stack in main.rs
  alongside the existing JSON fmt layer (stdout logs unchanged).
- Added #[instrument] spans on hot paths: handle_event, fan_out_pubsub_event,
  handle_auth, SubRegistry::fan_out_scoped.

### DB and Redis pool metrics
- Added Db::pool_stats() -> DbPoolStats in buzz-db (exposes sqlx pool
  size and num_idle).
- Added background task in main.rs polling pool stats every 10 s
  (configurable via BUZZ_POOL_METRICS_INTERVAL_SECS) and emitting
  buzz_db_pool_{size,idle,active} and buzz_redis_pool_{available,size,
  max,waiting} gauges.

### Graceful shutdown
- SdkMeterProvider and optional SdkTracerProvider flushed on SIGTERM
  drain.

## Environment variables (all optional)
- OTEL_EXPORTER_OTLP_ENDPOINT — unset disables OTEL entirely
- OTEL_SERVICE_NAME — defaults to buzz-relay
- OTEL_RESOURCE_ATTRIBUTES — extra resource attributes
- OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG — sampling strategy
- BUZZ_POOL_METRICS_INTERVAL_SECS — pool poll interval (default 10)

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 marked this pull request as draft June 30, 2026 15:50
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 30, 2026 12:02
…entity, metrics port bind

## Fixes

### IMPORTANT 1 — Prometheus bind failure now fails startup
Bind the metrics TcpListener synchronously in install() before tokio::spawn.
serve_prometheus() now accepts a pre-bound TcpListener instead of a port
number. A port conflict panics at startup (matching prior behaviour) rather
than silently dropping the metrics endpoint from a detached task.

### IMPORTANT 2 — OTLP service.name defaults to buzz-relay
Build a single shared Resource via service_resource() in telemetry.rs using
ResourceBuilder::with_service_name(buzz-relay) followed by
with_detector(EnvResourceDetector) so that OTEL_SERVICE_NAME and
OTEL_RESOURCE_ATTRIBUTES still win when set. Both SdkTracerProvider and
SdkMeterProvider receive the same Resource so traces and metrics correlate
under the same service identity in Datadog.

### IMPORTANT 3 — Span topology: WS flow now produces one connected trace
Create explicit parent spans in handle_text_message() for EVENT (ws.event),
REQ (ws.req), COUNT (ws.count), and AUTH (ws.auth) messages, each carrying
conn_id. Spawned handler futures are wrapped with .instrument(span) so the
tracing context is not dropped at the tokio::spawn boundary. handle_event()
and handle_auth() now call Span::current().record() to populate the event_id
and kind/conn_id fields declared in their #[instrument] attributes.

### NIT 4 — target_info series suppressed for byte-parity
Add .without_target_info() to the Prometheus exporter builder so the new
Resource (non-empty after fix 2) does not inject a target_info series that
the old metrics-rs endpoint never emitted.

### NIT 5 — BUZZ_POOL_METRICS_INTERVAL_SECS=0 no longer panics
Clamp interval_secs to >= 1. tokio::time::interval(Duration::ZERO) panics;
a config typo of 0 would have silently killed the pool metrics task.

### CI — cargo fmt drift
Run cargo fmt --all to fix rustfmt line-wrapping across the migrated
crate::metrics::metrics().<handle>.add(...) call sites.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
EnvResourceDetector reads OTEL_RESOURCE_ATTRIBUTES only, not
OTEL_SERVICE_NAME (opentelemetry_sdk 0.32.1 resource/env.rs:23).
SdkProvidedResourceDetector does read OTEL_SERVICE_NAME but always
emits a service.name key, falling back to unknown_service:<exe> when
unset — which would clobber the buzz-relay default.

Read OTEL_SERVICE_NAME explicitly: non-empty value wins over the
buzz-relay fallback; OTEL_RESOURCE_ATTRIBUTES (via EnvResourceDetector
overlaid last) still wins over OTEL_SERVICE_NAME per OTEL spec.
Correct the module and function doc comments that claimed the SDK
detector handled OTEL_SERVICE_NAME automatically.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96

Copy link
Copy Markdown
Collaborator Author

Exported metrics reference

All metric names, types, and label sets are preserved verbatim from the prior metrics-rs implementation, so existing Prometheus scrapers and Datadog dashboards need no changes. Names are emitted identically on both readers: the Prometheus /metrics text endpoint on :9102 (always on) and the OTLP push exporter (only when OTEL_EXPORTER_OTLP_ENDPOINT is set).

Naming note: the two HTTP framework metrics are intentionally unprefixed (http_*); every other relay metric carries the buzz_ prefix. The Prometheus exporter is configured .without_units() / .without_counter_suffixes() / .without_scope_info() / .without_target_info() to keep byte-parity with the old endpoint.

HTTP framework (track_metrics middleware)

Metric Type Labels Description
http_requests_total Counter code, caller, action HTTP requests served. caller from the Istio x-envoy-downstream-service-cluster header (validated, unknown fallback); action is the matched route pattern. Health/metrics/unmatched paths skipped to bound cardinality.
http_request_latency_ms Histogram code, caller, action Request latency in ms. Explicit buckets: 5/10/25/50/100/250/500/1000/2500/5000/10000.

WebSocket connections

Metric Type Labels Description
buzz_ws_connections_total Counter WebSocket connections accepted.
buzz_ws_connections_active UpDownCounter Currently-open WebSocket connections (incremented on register, decremented on close).
buzz_ws_backpressure_disconnects_total Counter Connections dropped because the client could not keep up with the send queue.
buzz_ws_auth_timeouts_total Counter Connections closed for failing to authenticate within the auth window.

Subscriptions

Metric Type Labels Description
buzz_subscriptions_active UpDownCounter Currently-active REQ subscriptions across all connections.

Event ingest

Metric Type Labels Description
buzz_events_received_total Counter kind Events received over WS. kind is bounded to a known allow-list (else other) to prevent cardinality explosion.
buzz_events_stored_total Counter kind Events successfully persisted.
buzz_events_rejected_total Counter reason Events rejected, labeled by reason.
buzz_event_processing_seconds Histogram End-to-end event processing time. Buckets (s): 0.001/0.005/0.01/0.025/0.05/0.1/0.25/0.5/1/5.

Fan-out / multi-node

Metric Type Labels Description
buzz_fanout_recipients Histogram Number of recipients per fanned-out event. Integer-count buckets: 0/1/5/10/25/50/100/500/1000.
buzz_multinode_fanout_total Counter Cross-pod fan-out operations published to the pub/sub bus.
buzz_multinode_fanout_lag_total Counter Messages dropped because a pod's multi-node fan-out consumer lagged the broadcast channel.
buzz_cache_invalidation_lag_total Counter Cache-invalidation messages dropped because a pod's consumer lagged.

Auth

Metric Type Labels Description
buzz_auth_attempts_total Counter method NIP-42 auth attempts (method=nip42).
buzz_auth_failures_total Counter reason Auth failures by reason (allowlist_denied, not_relay_member, nip42_invalid).

Media uploads

Metric Type Labels Description
buzz_media_uploads_total Counter mime Successful media uploads, labeled by MIME type.
buzz_media_upload_rejections_total Counter reason Upload rejections (rate_limit, concurrency).

Workflows

Metric Type Labels Description
buzz_workflow_runs_total Counter trigger Workflow runs, labeled by trigger kind.

Audit log

Metric Type Labels Description
buzz_audit_log_seconds Histogram Audit-log write latency. Buckets (s): same DURATION_BUCKETS_S as event processing.
buzz_audit_log_errors_total Counter Audit-log write failures.
buzz_audit_send_errors_total Counter Failures sending audit entries downstream.

Caches

Metric Type Labels Description
buzz_membership_cache_hits_total Counter Membership-cache hits.
buzz_membership_cache_misses_total Counter Membership-cache misses.
buzz_accessible_channels_cache_hits_total Counter Accessible-channels-cache hits.
buzz_accessible_channels_cache_misses_total Counter Accessible-channels-cache misses.

COUNT fallback

Metric Type Labels Description
buzz_count_fallback_rejections_total Counter COUNT queries rejected for requiring a too-broad fallback scan.

Connection-pool gauges (periodic, every BUZZ_POOL_METRICS_INTERVAL_SECS, default 10s)

Metric Type Labels Description
buzz_db_pool_size Gauge Total Postgres pool connections.
buzz_db_pool_idle Gauge Idle Postgres pool connections.
buzz_db_pool_active Gauge In-use Postgres pool connections (size - idle).
buzz_redis_pool_size Gauge Current Redis pool connections.
buzz_redis_pool_available Gauge Available (idle) Redis pool connections.
buzz_redis_pool_max Gauge Configured Redis pool max.
buzz_redis_pool_waiting Gauge Callers waiting on a Redis connection.

Two buzz_search_index_* handles (_seconds histogram, _errors_total counter) are declared but currently have no emitters — the search path moved off the Typesense backend, so they register as zero-value series. Left in place to avoid touching unrelated code in this migration PR; can be pruned in a follow-up.

@wpfleger96 wpfleger96 marked this pull request as ready for review June 30, 2026 17:39
@tlongwell-block

Copy link
Copy Markdown
Collaborator

I did the cross-check Tyler asked for.

Confirmation / recommendation: I would keep Prometheus/OpenMetrics scraped by the Datadog Agent as the primary production metrics path, and treat the OTEL/OTLP path as opt-in / less paved for now.

What I found:

  • Datadog Agent can scrape Prometheus/OpenMetrics from Kubernetes pods. Block docs explicitly say the Datadog Agent has built-in OpenMetrics support, configured per pod/container via ad.datadoghq.com/${CONTAINER_NAME}.checks annotations, and that the Agent watches the Kubernetes API for annotated pods/containers to scrape. See: https://dev-guides.sqprod.co/cash/docs/platform/architecture/custom_metrics_collection
  • Afterpay/Block observability docs give the same Kubernetes pod-annotation flow, including openmetrics_endpoint, metrics, histogram_buckets_as_distributions, send_distribution_buckets, and send_monotonic_counter: https://dev-guides.sqprod.co/afterpay/docs/observability/how-to/ingest-prom-metrics
  • This PR preserves a Prometheus text endpoint on :9102/metrics and configures the OTEL Prometheus exporter with .without_units(), .without_counter_suffixes(), .without_scope_info(), and .without_target_info() to keep metric names/output compatible with the old metrics-rs/Prometheus endpoint. That supports the claim that existing Datadog OpenMetrics scraping should not need to change.
  • Internal guidance/search results point to OTEL/OTLP metrics as an active migration/trial area rather than the universally paved path. I found support for Tyler’s “finicky internally” read: collector/OTLP metrics work exists, but docs and internal references repeatedly steer production custom metrics toward Datadog OpenMetrics scraping; OTEL collector/SDK migration is still more situational. I would phrase it as “not abandoned, but less mature/paved internally than Prometheus/OpenMetrics → Datadog Agent.”
  • For tracing, the companion infra PR’s target endpoint matches existing in-cluster usage in block-coder-tf-stacks (cachew and blox-orchestrator already point at http://datadog-agent.datadog-agent.svc.cluster.local:4317). So using the Datadog Agent OTLP gRPC receiver for staging traces is directionally reasonable, but I’d avoid coupling metrics to OTLP unless there’s a specific reason.

Net: Will’s stated refactor direction — Prometheus/OpenMetrics for relay health/activity metrics, Datadog Agent scrape from pods, OTEL only where explicitly needed (e.g. traces behind OTEL_EXPORTER_OTLP_ENDPOINT) — matches the internal paved path better than a full OTEL metrics migration.

…e, keep OTLP tracing

Drop the OpenTelemetry metrics exporter (opentelemetry-prometheus, prometheus
crate, OTLP metric push) in favour of the original metrics-rs facade
(metrics::counter!/gauge!/histogram! macros) backed by metrics-exporter-prometheus.

OTLP tracing (telemetry.rs, tracing-opentelemetry, OTEL trace spans in
connection/event/auth handlers) is intentionally preserved — only the metrics
path is reverted.

Changes:
- Cargo.toml: restore metrics + metrics-exporter-prometheus workspace deps;
  strip metrics/logs features from opentelemetry, opentelemetry_sdk,
  opentelemetry-otlp, tracing-opentelemetry; remove opentelemetry-prometheus
  and prometheus = "0.14"
- metrics.rs: restore PrometheusBuilder-based setup verbatim from origin/main
- All call sites: crate::metrics::metrics().<handle>.add/record -> metrics-rs
  macros (counter!/gauge!/histogram!) across connection, auth, event, count,
  state, subscription, bridge, media
- main.rs: fanout_lag + cache_inval_lag background consumers converted to
  metrics::counter!; pool gauge section (net-new, kept per spec) converted from
  relay_metrics::meter() OTEL API to metrics::gauge!.set(); meter_provider
  shutdown removed; telemetry comment updated to trace-only
- telemetry.rs: doc comment updated — Resource is for trace provider only
- Cargo.lock: opentelemetry-prometheus and prometheus crates removed

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 changed the title feat(relay): migrate telemetry to OpenTelemetry with OTLP and Prometheus export feat(relay): add OpenTelemetry tracing, keep Prometheus metrics Jun 30, 2026
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits July 1, 2026 14:42
…p, telemetry tests, db pool max

Finding 1: change try_init_tracer to return TracerInit enum (Enabled/Disabled/ExporterBuildFailed)
so the exporter-build failure warning is emitted by the caller after tracing_subscriber is
installed, preventing the warn! from firing with no subscriber attached.

Finding 2: remove opentelemetry-semantic-conventions from root Cargo.toml and
crates/buzz-relay/Cargo.toml — the crate had zero references in buzz-relay/src;
confirmed dropped from Cargo.lock after build.

Finding 3: add #[cfg(test)] unit tests in telemetry.rs covering service_resource()
(default, OTEL_SERVICE_NAME honored, empty string fallback) and try_init_tracer()
(Disabled when endpoint unset, not-Disabled when endpoint set). ENV_LOCK Mutex
serializes env-mutating tests to prevent cross-test races.

Finding 4: store max_connections on Db struct (set from DbConfig in new(),
read from pool.options() in from_pool()); add max: u32 to DbPoolStats; emit
buzz_db_pool_max gauge alongside existing db pool gauges in main.rs.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…sult helper

Extract Err->TracerInit classification into a private classify_exporter_result fn
so the test can synthesize an ExporterBuildError directly rather than relying on
tonic's URI-validation behaviour at build time (which may lazily accept bad URIs).

Rewrite test_try_init_tracer_exporter_build_failed_on_bad_endpoint as
test_classify_exporter_result_maps_err_to_exporter_build_failed: constructs
ExporterBuildError::InvalidUri directly, asserts matches!(result,
TracerInit::ExporterBuildFailed(_)). No env vars needed — no lock required.

Production behaviour unchanged: try_init_tracer delegates to the helper.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 merged commit b1d9d95 into main Jul 1, 2026
29 checks passed
@wpfleger96 wpfleger96 deleted the duncan/otel-migration branch July 1, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants