Skip to content

PMM-14689 Add centralized logging and tracing (OpenTelemetry + ClickHouse)#5449

Draft
ademidoff wants to merge 13 commits into
v3from
PMM-14689-logs-and-traces-phase1
Draft

PMM-14689 Add centralized logging and tracing (OpenTelemetry + ClickHouse)#5449
ademidoff wants to merge 13 commits into
v3from
PMM-14689-logs-and-traces-phase1

Conversation

@ademidoff

Copy link
Copy Markdown
Member

Context

PMM monitors infrastructure well but troubleshooting is weak: logs are exposed only as a point-in-time /logs.zip snapshot (one file per component, no real-time view, exporter logs truncated to a small ring buffer). This change moves logging/tracing to a central, queryable, real-time store.

Client and Server logs/traces flow through an OpenTelemetry Collector (contrib) on PMM Server into the existing ClickHouse using the OTel schema (pmm.logs / pmm.traces), queried via Grafana's grafana-clickhouse-datasource.

What's included

  • Collector (server): new supervised otelcol-contrib program + a pmm-managed-rendered /etc/otelcol/config.yamlfilelog receiver tails /srv/logs/*.log (captures all server components with no per-component change), loopback OTLP receivers, ClickHouse exporter with create_schema=false. Ansible role installs the binary.
  • Schema + TTL owned by pmm-managed (managed/services/clickhouse): pmm.logs / pmm.traces migrations on a dedicated logs_schema_migrations table; ApplyTTL via ALTER TABLE … MODIFY TTL. qan-api2 is intentionally left unaware of logging/tracing.
  • Retention: new LogsRetention setting, applied instantly on ChangeSettings.
  • Client log shipping over the existing authenticated gRPC channel: new LogShipService/LogShipChannel (mirrors the RTA channel — no new port/auth). The supervisor's LogWriter ships pmm-agent's own logs and all controlled-agent (exporter/built-in) logs; the server forwards batches to the collector via OTLP/HTTP.
  • Database log-watcher agent (db-log-watcher-agent): tails configured DB log files (rotation-safe, path allowlist) and ships tagged lines. Exposed via pmm-admin add mysql --watch-logs --log-file=PATH.
  • Log-based alerting: alert templates gain a datasource: clickhouse option that builds a 3-node Grafana rule (ClickHouse SQL → reducethreshold); ships a built-in mysql_log_down.yml.
  • Grafana: a dedicated ClickHouseLogs datasource for OTel logs/traces in Explore (the existing ClickHouse datasource is unchanged).

Tagging / data model

Client and DB logs are tagged with service.name, db.system, pmm.log_type, pmm.source=client, pmm.agent_id; server component logs with pmm.source=server. All land in the OTel pmm.logs schema; spans in pmm.traces.

Testing

  • Unit tests: supervisord config rendering, otelcol config, DB log-watcher (tail→ship + allowlist), ClickHouse alert template parsing.
  • All modules build (api, managed, agent, admin); touched packages vet clean.

Caveats / follow-ups

  • otelcol-contrib version pin (Ansible role) must match the pmm.logs/pmm.traces DDL — the clickhouseexporter schema is version-sensitive.
  • The Grafana rule JSON (SQL + expression nodes) and the grafana-clickhouse-datasource otelEnabled mappings are plugin-version-sensitive and should be validated against the shipped Grafana.
  • Only changed proto packages were kept regenerated; make gen (swagger clients, etc.) should run in CI. DB-backed tests and the live pipeline (collector → ClickHouse → Explore, alert firing) need a running stack.
  • Tracing stands up the pipeline/table/datasource; instrumenting PMM's own services with the OTel SDK is deferred.
  • Scoped follow-ups: other DB engines + log-path auto-detection for the watcher; surfacing the template datasource in the alerting API/UI.

🤖 Generated with Claude Code

…ouse)

Ship Client and Server logs/traces to a central, queryable store using an
OpenTelemetry Collector (contrib) on PMM Server and the existing ClickHouse,
queried via Grafana's grafana-clickhouse-datasource.

- Collector: new supervised otelcol-contrib program + pmm-managed-rendered
  /etc/otelcol/config.yaml (filelog of /srv/logs, loopback OTLP receivers,
  clickhouse exporter, create_schema=false). Ansible role installs the binary.
- Schema + TTL owned by pmm-managed (managed/services/clickhouse): pmm.logs /
  pmm.traces migrations on their own version table, ApplyTTL via ALTER TABLE.
  qan-api2 is left unaware of logging/tracing.
- New LogsRetention setting controls log/trace TTL, applied on ChangeSettings.
- Client log shipping over the existing authenticated gRPC channel: new
  LogShipService/LogShipChannel (mirrors RTAChannel), supervisor LogWriter ships
  pmm-agent's own logs and all controlled-agent logs; server forwards to the
  collector via OTLP/HTTP.
- Database log-watcher agent (db-log-watcher-agent): tails configured DB log
  files (rotation-safe, path allowlist) and ships tagged lines. Exposed via
  pmm-admin add mysql --watch-logs --log-file.
- Log-based alerting: alert templates gain a "clickhouse" datasource that builds
  a 3-node Grafana rule (SQL -> reduce -> threshold); built-in mysql_log_down.yml.
- Dedicated ClickHouseLogs Grafana datasource for OTel logs/traces in Explore;
  the existing ClickHouse datasource is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ademidoff ademidoff requested a review from a team as a code owner June 3, 2026 06:40
@ademidoff ademidoff requested review from JiriCtvrtka and maxkondr and removed request for a team June 3, 2026 06:40
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 20.15267% with 523 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.90%. Comparing base (36137ea) to head (9be0bad).

Files with missing lines Patch % Lines
managed/services/clickhouse/clickhouse.go 0.00% 100 Missing ⚠️
managed/services/logship/service.go 0.00% 71 Missing ⚠️
managed/services/alerting/service.go 0.00% 68 Missing ⚠️
agent/client/channel/logship_channel.go 0.00% 54 Missing ⚠️
agent/agents/supervisor/supervisor.go 15.90% 35 Missing and 2 partials ⚠️
agent/agents/logs/dbwatcher/dbwatcher.go 64.70% 26 Missing and 10 partials ⚠️
agent/client/client.go 12.50% 34 Missing and 1 partial ⚠️
managed/services/agents/dblogwatcher.go 0.00% 24 Missing ⚠️
managed/services/converters.go 0.00% 15 Missing ⚠️
managed/services/supervisord/otelcol_config.go 50.00% 13 Missing and 1 partial ⚠️
... and 12 more
Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5449      +/-   ##
==========================================
- Coverage   43.18%   42.90%   -0.28%     
==========================================
  Files         413      419       +6     
  Lines       42326    42946     +620     
==========================================
+ Hits        18280    18428     +148     
- Misses      22175    22605     +430     
- Partials     1871     1913      +42     
Flag Coverage Δ
admin 34.69% <0.00%> (-0.02%) ⬇️
agent 48.56% <32.66%> (-0.60%) ⬇️
managed 42.09% <12.59%> (-0.25%) ⬇️
vmproxy 72.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ademidoff and others added 5 commits June 3, 2026 10:02
- settings_helpers_test: include LogsRetention default (30d) in expected Settings.
- handler_test (TestCheckPortChanged): add the new log_watcher_options column to
  agentColumns and to each sqlmock AddRow so reform can scan the agent row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix the golangci-lint findings reported on the PR for the new code:
- funcorder: order exported methods before unexported (dbwatcher, logship_channel).
- unparam: dbwatcher.New no longer returns an always-nil error.
- godot: capitalize toplevel doc-comment sentences (clickhouse, alerting).
- noctx: clickhouse ApplyTTL uses ExecContext (ctx threaded through callers).
- nolintlint: drop the unused //nolint:errcheck in alerting.
- depguard: use stdlib "errors"/fmt instead of github.com/pkg/errors in the new
  files (clickhouse, otelcol_config, dbwatcher).
- noinlineerr / embeddedstructfieldcheck / errorsastype: minor cleanups in the
  new files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TestAgentTypes iterates every AgentType enum value through
types.AgentTypeName, which panics on unmapped types. Add the
AGENT_TYPE_DB_LOG_WATCHER_AGENT constant and its display name so the new
agent type is handled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ademidoff ademidoff marked this pull request as draft June 3, 2026 10:51
ademidoff and others added 7 commits June 4, 2026 02:52
Rename the ClickHouse log/trace lifecycle dependency consistently:
- interface logsClickhouseService -> logService
- Server field / Params field / local var logsClickhouse / LogsClickhouse -> logService / LogService

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…qan-api2

After moving schema/TTL ownership to managed/services/clickhouse, the collector
config comment still credited qan-api2 for enforcing retention. Correct it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant