feat(relay): add OpenTelemetry tracing, keep Prometheus metrics by wpfleger96 · Pull Request #1398 · block/buzz

wpfleger96 · 2026-06-30T15:38:26Z

Summary

Adds distributed tracing to the buzz relay via OpenTelemetry while keeping metrics on the existing Prometheus scrape path. OTLP carries traces only; the :9102 Prometheus text endpoint and every existing metric name, label, and bucket are unchanged. Also adds additive DB/Redis connection-pool gauges on the existing metrics-rs path.

Changes

Tracing (new)

New telemetry.rs: installs an OTLP gRPC span exporter + SdkTracerProvider only when OTEL_EXPORTER_OTLP_ENDPOINT is set; no-ops (zero overhead, no connection attempted) when unset.
try_init_tracer returns a TracerInit enum (Enabled/Disabled/ExporterBuildFailed) so exporter-build failures are logged by the caller after tracing_subscriber is installed — prevents silent drop of the warning.
The trace Resource reads OTEL_SERVICE_NAME explicitly with a buzz-relay fallback, plus an EnvResourceDetector overlay for OTEL_RESOURCE_ATTRIBUTES.
OpenTelemetryLayer wired into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer — stdout structured logs are unchanged.
Spans added on hot paths: ws.auth, ws.event, ws.req, ws.count in connection.rs/handlers/auth.rs/handlers/event.rs (carrying conn_id/event_id/kind/sub_id), and #[instrument(skip_all)] on SubRegistry::fan_out_scoped.

Metrics — unchanged path

metrics.rs is unchanged from main: the metrics-rs / PrometheusBuilder setup, every metric name, label set, and histogram bucket boundary are preserved. Existing Prometheus scrapers and the Datadog Agent openmetrics annotation need no changes.

DB and Redis pool gauges (new, additive)

Db::pool_stats() -> DbPoolStats extended to include max: u32 (pool ceiling from DbConfig.max_connections). The max_connections field is stored on Db at construction so it's available without re-reading config.
Background task in main.rs polls pool stats (interval via BUZZ_POOL_METRICS_INTERVAL_SECS, clamped to >= 1s) and emits via metrics::gauge!:
- buzz_db_pool_size, buzz_db_pool_idle, buzz_db_pool_active, buzz_db_pool_max
- buzz_redis_pool_available, buzz_redis_pool_size, buzz_redis_pool_max, buzz_redis_pool_waiting

Graceful shutdown

The OTLP tracer provider is flushed on SIGTERM drain (after audit drain), with warning-only error handling. No meter-provider shutdown — metrics stay on the Prometheus exporter.

Environment variables

Variable	Default	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	(unset = tracing disabled)	OTLP gRPC trace endpoint
`OTEL_SERVICE_NAME`	`buzz-relay`	`service.name` resource attribute on traces
`OTEL_RESOURCE_ATTRIBUTES`	—	Extra trace resource attributes
`BUZZ_METRICS_PORT`	`9102`	Prometheus scrape port (unchanged)
`BUZZ_POOL_METRICS_INTERVAL_SECS`	`10`	Pool stats poll interval

Backward compatibility

Metrics are unconditional. /metrics on :9102 serves the same Prometheus text format with identical metric names, labels, and buckets regardless of any OpenTelemetry setting — this PR does not change the metrics path at all (metrics.rs is byte-identical to main).

OpenTelemetry adds traces only, and is opt-in. OTEL_EXPORTER_OTLP_ENDPOINT gates tracing exclusively: when it is unset — every current deployment — no tracer is initialized and no OTLP connection is attempted. JSON stdout logs are unchanged. Net effect for existing deployments: zero behavioral change.

Notes

OTEL 0.31 / 0.32 dual version in Cargo.lock: opentelemetry-otlp 0.31.1 appears transitively via mesh-llm-host-runtime 0.68.0 (a dev-dependency). It does not affect the production binary; both versions coexist because they are independent crate versions in the dependency graph. Not controllable from this PR without bumping the mesh-llm pin.

Related: block-coder-tf-stacks#2267 — staging relay OTLP endpoint config.

…eus export Replace metrics-rs/metrics-exporter-prometheus with OpenTelemetry native instruments backed by both a Prometheus text endpoint (:9102) and an OTLP gRPC exporter. Add distributed tracing via tracing-opentelemetry. Add DB and Redis pool metrics. ## What changed ### Metrics - Rewrote metrics.rs as an OTEL setup module: SdkMeterProvider with a PrometheusExporter (pull-based, same /metrics endpoint) and an optional PeriodicReader+OTLP exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT. - Migrated all 41 metrics::counter!/histogram!/gauge! call sites across connection.rs, subscription.rs, state.rs, handlers/, and api/ to the pre-built Metrics struct (OnceLock, zero per-call-site allocation). - Preserved every metric name, type, label set, and histogram bucket from the prior implementation so existing Prometheus scrapers (including the Datadog Agent openmetrics annotation) need no changes. - Instruments lazy-init to OTEL noop meter when install() hasn't been called, matching prior metrics-rs behaviour in unit tests. ### Tracing - Added telemetry.rs: try_init_tracer() initialises an OTLP gRPC span exporter + SdkTracerProvider when OTEL_EXPORTER_OTLP_ENDPOINT is set; returns None (zero overhead) when unset. - Wired OpenTelemetryLayer into the tracing_subscriber stack in main.rs alongside the existing JSON fmt layer (stdout logs unchanged). - Added #[instrument] spans on hot paths: handle_event, fan_out_pubsub_event, handle_auth, SubRegistry::fan_out_scoped. ### DB and Redis pool metrics - Added Db::pool_stats() -> DbPoolStats in buzz-db (exposes sqlx pool size and num_idle). - Added background task in main.rs polling pool stats every 10 s (configurable via BUZZ_POOL_METRICS_INTERVAL_SECS) and emitting buzz_db_pool_{size,idle,active} and buzz_redis_pool_{available,size, max,waiting} gauges. ### Graceful shutdown - SdkMeterProvider and optional SdkTracerProvider flushed on SIGTERM drain. ## Environment variables (all optional) - OTEL_EXPORTER_OTLP_ENDPOINT — unset disables OTEL entirely - OTEL_SERVICE_NAME — defaults to buzz-relay - OTEL_RESOURCE_ATTRIBUTES — extra resource attributes - OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG — sampling strategy - BUZZ_POOL_METRICS_INTERVAL_SECS — pool poll interval (default 10) Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…entity, metrics port bind ## Fixes ### IMPORTANT 1 — Prometheus bind failure now fails startup Bind the metrics TcpListener synchronously in install() before tokio::spawn. serve_prometheus() now accepts a pre-bound TcpListener instead of a port number. A port conflict panics at startup (matching prior behaviour) rather than silently dropping the metrics endpoint from a detached task. ### IMPORTANT 2 — OTLP service.name defaults to buzz-relay Build a single shared Resource via service_resource() in telemetry.rs using ResourceBuilder::with_service_name(buzz-relay) followed by with_detector(EnvResourceDetector) so that OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES still win when set. Both SdkTracerProvider and SdkMeterProvider receive the same Resource so traces and metrics correlate under the same service identity in Datadog. ### IMPORTANT 3 — Span topology: WS flow now produces one connected trace Create explicit parent spans in handle_text_message() for EVENT (ws.event), REQ (ws.req), COUNT (ws.count), and AUTH (ws.auth) messages, each carrying conn_id. Spawned handler futures are wrapped with .instrument(span) so the tracing context is not dropped at the tokio::spawn boundary. handle_event() and handle_auth() now call Span::current().record() to populate the event_id and kind/conn_id fields declared in their #[instrument] attributes. ### NIT 4 — target_info series suppressed for byte-parity Add .without_target_info() to the Prometheus exporter builder so the new Resource (non-empty after fix 2) does not inject a target_info series that the old metrics-rs endpoint never emitted. ### NIT 5 — BUZZ_POOL_METRICS_INTERVAL_SECS=0 no longer panics Clamp interval_secs to >= 1. tokio::time::interval(Duration::ZERO) panics; a config typo of 0 would have silently killed the pool metrics task. ### CI — cargo fmt drift Run cargo fmt --all to fix rustfmt line-wrapping across the migrated crate::metrics::metrics().<handle>.add(...) call sites. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

EnvResourceDetector reads OTEL_RESOURCE_ATTRIBUTES only, not OTEL_SERVICE_NAME (opentelemetry_sdk 0.32.1 resource/env.rs:23). SdkProvidedResourceDetector does read OTEL_SERVICE_NAME but always emits a service.name key, falling back to unknown_service:<exe> when unset — which would clobber the buzz-relay default. Read OTEL_SERVICE_NAME explicitly: non-empty value wins over the buzz-relay fallback; OTEL_RESOURCE_ATTRIBUTES (via EnvResourceDetector overlaid last) still wins over OTEL_SERVICE_NAME per OTEL spec. Correct the module and function doc comments that claimed the SDK detector handled OTEL_SERVICE_NAME automatically. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 · 2026-06-30T17:37:25Z

Exported metrics reference

All metric names, types, and label sets are preserved verbatim from the prior metrics-rs implementation, so existing Prometheus scrapers and Datadog dashboards need no changes. Names are emitted identically on both readers: the Prometheus /metrics text endpoint on :9102 (always on) and the OTLP push exporter (only when OTEL_EXPORTER_OTLP_ENDPOINT is set).

Naming note: the two HTTP framework metrics are intentionally unprefixed (http_*); every other relay metric carries the buzz_ prefix. The Prometheus exporter is configured .without_units() / .without_counter_suffixes() / .without_scope_info() / .without_target_info() to keep byte-parity with the old endpoint.

HTTP framework (`track_metrics` middleware)

Metric	Type	Labels	Description
`http_requests_total`	Counter	`code`, `caller`, `action`	HTTP requests served. `caller` from the Istio `x-envoy-downstream-service-cluster` header (validated, `unknown` fallback); `action` is the matched route pattern. Health/metrics/unmatched paths skipped to bound cardinality.
`http_request_latency_ms`	Histogram	`code`, `caller`, `action`	Request latency in ms. Explicit buckets: 5/10/25/50/100/250/500/1000/2500/5000/10000.

WebSocket connections

Metric	Type	Labels	Description
`buzz_ws_connections_total`	Counter	—	WebSocket connections accepted.
`buzz_ws_connections_active`	UpDownCounter	—	Currently-open WebSocket connections (incremented on register, decremented on close).
`buzz_ws_backpressure_disconnects_total`	Counter	—	Connections dropped because the client could not keep up with the send queue.
`buzz_ws_auth_timeouts_total`	Counter	—	Connections closed for failing to authenticate within the auth window.

Subscriptions

Metric	Type	Labels	Description
`buzz_subscriptions_active`	UpDownCounter	—	Currently-active REQ subscriptions across all connections.

Event ingest

Metric	Type	Labels	Description
`buzz_events_received_total`	Counter	`kind`	Events received over WS. `kind` is bounded to a known allow-list (else `other`) to prevent cardinality explosion.
`buzz_events_stored_total`	Counter	`kind`	Events successfully persisted.
`buzz_events_rejected_total`	Counter	`reason`	Events rejected, labeled by reason.
`buzz_event_processing_seconds`	Histogram	—	End-to-end event processing time. Buckets (s): 0.001/0.005/0.01/0.025/0.05/0.1/0.25/0.5/1/5.

Fan-out / multi-node

Metric	Type	Labels	Description
`buzz_fanout_recipients`	Histogram	—	Number of recipients per fanned-out event. Integer-count buckets: 0/1/5/10/25/50/100/500/1000.
`buzz_multinode_fanout_total`	Counter	—	Cross-pod fan-out operations published to the pub/sub bus.
`buzz_multinode_fanout_lag_total`	Counter	—	Messages dropped because a pod's multi-node fan-out consumer lagged the broadcast channel.
`buzz_cache_invalidation_lag_total`	Counter	—	Cache-invalidation messages dropped because a pod's consumer lagged.

Auth

Metric	Type	Labels	Description
`buzz_auth_attempts_total`	Counter	`method`	NIP-42 auth attempts (`method=nip42`).
`buzz_auth_failures_total`	Counter	`reason`	Auth failures by reason (`allowlist_denied`, `not_relay_member`, `nip42_invalid`).

Media uploads

Metric	Type	Labels	Description
`buzz_media_uploads_total`	Counter	`mime`	Successful media uploads, labeled by MIME type.
`buzz_media_upload_rejections_total`	Counter	`reason`	Upload rejections (`rate_limit`, `concurrency`).

Workflows

Metric	Type	Labels	Description
`buzz_workflow_runs_total`	Counter	`trigger`	Workflow runs, labeled by trigger kind.

Audit log

Metric	Type	Labels	Description
`buzz_audit_log_seconds`	Histogram	—	Audit-log write latency. Buckets (s): same `DURATION_BUCKETS_S` as event processing.
`buzz_audit_log_errors_total`	Counter	—	Audit-log write failures.
`buzz_audit_send_errors_total`	Counter	—	Failures sending audit entries downstream.

Caches

Metric	Type	Labels	Description
`buzz_membership_cache_hits_total`	Counter	—	Membership-cache hits.
`buzz_membership_cache_misses_total`	Counter	—	Membership-cache misses.
`buzz_accessible_channels_cache_hits_total`	Counter	—	Accessible-channels-cache hits.
`buzz_accessible_channels_cache_misses_total`	Counter	—	Accessible-channels-cache misses.

COUNT fallback

Metric	Type	Labels	Description
`buzz_count_fallback_rejections_total`	Counter	—	COUNT queries rejected for requiring a too-broad fallback scan.

Connection-pool gauges (periodic, every `BUZZ_POOL_METRICS_INTERVAL_SECS`, default 10s)

Metric	Type	Labels	Description
`buzz_db_pool_size`	Gauge	—	Total Postgres pool connections.
`buzz_db_pool_idle`	Gauge	—	Idle Postgres pool connections.
`buzz_db_pool_active`	Gauge	—	In-use Postgres pool connections (`size - idle`).
`buzz_redis_pool_size`	Gauge	—	Current Redis pool connections.
`buzz_redis_pool_available`	Gauge	—	Available (idle) Redis pool connections.
`buzz_redis_pool_max`	Gauge	—	Configured Redis pool max.
`buzz_redis_pool_waiting`	Gauge	—	Callers waiting on a Redis connection.

Two buzz_search_index_* handles (_seconds histogram, _errors_total counter) are declared but currently have no emitters — the search path moved off the Typesense backend, so they register as zero-value series. Left in place to avoid touching unrelated code in this migration PR; can be pruned in a follow-up.

tlongwell-block · 2026-06-30T17:58:22Z

I did the cross-check Tyler asked for.

Confirmation / recommendation: I would keep Prometheus/OpenMetrics scraped by the Datadog Agent as the primary production metrics path, and treat the OTEL/OTLP path as opt-in / less paved for now.

What I found:

Datadog Agent can scrape Prometheus/OpenMetrics from Kubernetes pods. Block docs explicitly say the Datadog Agent has built-in OpenMetrics support, configured per pod/container via ad.datadoghq.com/${CONTAINER_NAME}.checks annotations, and that the Agent watches the Kubernetes API for annotated pods/containers to scrape. See: https://dev-guides.sqprod.co/cash/docs/platform/architecture/custom_metrics_collection
Afterpay/Block observability docs give the same Kubernetes pod-annotation flow, including openmetrics_endpoint, metrics, histogram_buckets_as_distributions, send_distribution_buckets, and send_monotonic_counter: https://dev-guides.sqprod.co/afterpay/docs/observability/how-to/ingest-prom-metrics
This PR preserves a Prometheus text endpoint on :9102/metrics and configures the OTEL Prometheus exporter with .without_units(), .without_counter_suffixes(), .without_scope_info(), and .without_target_info() to keep metric names/output compatible with the old metrics-rs/Prometheus endpoint. That supports the claim that existing Datadog OpenMetrics scraping should not need to change.
Internal guidance/search results point to OTEL/OTLP metrics as an active migration/trial area rather than the universally paved path. I found support for Tyler’s “finicky internally” read: collector/OTLP metrics work exists, but docs and internal references repeatedly steer production custom metrics toward Datadog OpenMetrics scraping; OTEL collector/SDK migration is still more situational. I would phrase it as “not abandoned, but less mature/paved internally than Prometheus/OpenMetrics → Datadog Agent.”
For tracing, the companion infra PR’s target endpoint matches existing in-cluster usage in block-coder-tf-stacks (cachew and blox-orchestrator already point at http://datadog-agent.datadog-agent.svc.cluster.local:4317). So using the Datadog Agent OTLP gRPC receiver for staging traces is directionally reasonable, but I’d avoid coupling metrics to OTLP unless there’s a specific reason.

Net: Will’s stated refactor direction — Prometheus/OpenMetrics for relay health/activity metrics, Datadog Agent scrape from pods, OTEL only where explicitly needed (e.g. traces behind OTEL_EXPORTER_OTLP_ENDPOINT) — matches the internal paved path better than a full OTEL metrics migration.

…e, keep OTLP tracing Drop the OpenTelemetry metrics exporter (opentelemetry-prometheus, prometheus crate, OTLP metric push) in favour of the original metrics-rs facade (metrics::counter!/gauge!/histogram! macros) backed by metrics-exporter-prometheus. OTLP tracing (telemetry.rs, tracing-opentelemetry, OTEL trace spans in connection/event/auth handlers) is intentionally preserved — only the metrics path is reverted. Changes: - Cargo.toml: restore metrics + metrics-exporter-prometheus workspace deps; strip metrics/logs features from opentelemetry, opentelemetry_sdk, opentelemetry-otlp, tracing-opentelemetry; remove opentelemetry-prometheus and prometheus = "0.14" - metrics.rs: restore PrometheusBuilder-based setup verbatim from origin/main - All call sites: crate::metrics::metrics().<handle>.add/record -> metrics-rs macros (counter!/gauge!/histogram!) across connection, auth, event, count, state, subscription, bridge, media - main.rs: fanout_lag + cache_inval_lag background consumers converted to metrics::counter!; pool gauge section (net-new, kept per spec) converted from relay_metrics::meter() OTEL API to metrics::gauge!.set(); meter_provider shutdown removed; telemetry comment updated to trace-only - telemetry.rs: doc comment updated — Resource is for trace provider only - Cargo.lock: opentelemetry-prometheus and prometheus crates removed Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…p, telemetry tests, db pool max Finding 1: change try_init_tracer to return TracerInit enum (Enabled/Disabled/ExporterBuildFailed) so the exporter-build failure warning is emitted by the caller after tracing_subscriber is installed, preventing the warn! from firing with no subscriber attached. Finding 2: remove opentelemetry-semantic-conventions from root Cargo.toml and crates/buzz-relay/Cargo.toml — the crate had zero references in buzz-relay/src; confirmed dropped from Cargo.lock after build. Finding 3: add #[cfg(test)] unit tests in telemetry.rs covering service_resource() (default, OTEL_SERVICE_NAME honored, empty string fallback) and try_init_tracer() (Disabled when endpoint unset, not-Disabled when endpoint set). ENV_LOCK Mutex serializes env-mutating tests to prevent cross-test races. Finding 4: store max_connections on Db struct (set from DbConfig in new(), read from pool.options() in from_pool()); add max: u32 to DbPoolStats; emit buzz_db_pool_max gauge alongside existing db pool gauges in main.rs. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…sult helper Extract Err->TracerInit classification into a private classify_exporter_result fn so the test can synthesize an ExporterBuildError directly rather than relying on tonic's URI-validation behaviour at build time (which may lazily accept bad URIs). Rewrite test_try_init_tracer_exporter_build_failed_on_bad_endpoint as test_classify_exporter_result_maps_err_to_exporter_build_failed: constructs ExporterBuildError::InvalidUri directly, asserts matches!(result, TracerInit::ExporterBuildFailed(_)). No env vars needed — no lock required. Production behaviour unchanged: try_init_tracer delegates to the helper. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 marked this pull request as draft June 30, 2026 15:50

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 30, 2026 12:02

wpfleger96 marked this pull request as ready for review June 30, 2026 17:39

wpfleger96 changed the title ~~feat(relay): migrate telemetry to OpenTelemetry with OTLP and Prometheus export~~ feat(relay): add OpenTelemetry tracing, keep Prometheus metrics Jun 30, 2026

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits July 1, 2026 14:42

wpfleger96 merged commit b1d9d95 into main Jul 1, 2026
29 checks passed

wpfleger96 deleted the duncan/otel-migration branch July 1, 2026 20:06

wesbillman mentioned this pull request Jul 2, 2026

chore(release): release Buzz Desktop version 0.3.42 #1479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398

feat(relay): add OpenTelemetry tracing, keep Prometheus metrics#1398
wpfleger96 merged 6 commits into
mainfrom
duncan/otel-migration

wpfleger96 commented Jun 30, 2026 •

edited

Loading

Uh oh!

wpfleger96 commented Jun 30, 2026

Uh oh!

tlongwell-block commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wpfleger96 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tracing (new)

Metrics — unchanged path

DB and Redis pool gauges (new, additive)

Graceful shutdown

Environment variables

Backward compatibility

Notes

Uh oh!

wpfleger96 commented Jun 30, 2026

Exported metrics reference

HTTP framework (track_metrics middleware)

WebSocket connections

Subscriptions

Event ingest

Fan-out / multi-node

Auth

Media uploads

Workflows

Audit log

Caches

COUNT fallback

Connection-pool gauges (periodic, every BUZZ_POOL_METRICS_INTERVAL_SECS, default 10s)

Uh oh!

tlongwell-block commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wpfleger96 commented Jun 30, 2026 •

edited

Loading

HTTP framework (`track_metrics` middleware)

Connection-pool gauges (periodic, every `BUZZ_POOL_METRICS_INTERVAL_SECS`, default 10s)