PMM-14689 Add centralized logging and tracing (OpenTelemetry + ClickHouse)#5449
Draft
ademidoff wants to merge 13 commits into
Draft
PMM-14689 Add centralized logging and tracing (OpenTelemetry + ClickHouse)#5449ademidoff wants to merge 13 commits into
ademidoff wants to merge 13 commits into
Conversation
…ouse) Ship Client and Server logs/traces to a central, queryable store using an OpenTelemetry Collector (contrib) on PMM Server and the existing ClickHouse, queried via Grafana's grafana-clickhouse-datasource. - Collector: new supervised otelcol-contrib program + pmm-managed-rendered /etc/otelcol/config.yaml (filelog of /srv/logs, loopback OTLP receivers, clickhouse exporter, create_schema=false). Ansible role installs the binary. - Schema + TTL owned by pmm-managed (managed/services/clickhouse): pmm.logs / pmm.traces migrations on their own version table, ApplyTTL via ALTER TABLE. qan-api2 is left unaware of logging/tracing. - New LogsRetention setting controls log/trace TTL, applied on ChangeSettings. - Client log shipping over the existing authenticated gRPC channel: new LogShipService/LogShipChannel (mirrors RTAChannel), supervisor LogWriter ships pmm-agent's own logs and all controlled-agent logs; server forwards to the collector via OTLP/HTTP. - Database log-watcher agent (db-log-watcher-agent): tails configured DB log files (rotation-safe, path allowlist) and ships tagged lines. Exposed via pmm-admin add mysql --watch-logs --log-file. - Log-based alerting: alert templates gain a "clickhouse" datasource that builds a 3-node Grafana rule (SQL -> reduce -> threshold); built-in mysql_log_down.yml. - Dedicated ClickHouseLogs Grafana datasource for OTel logs/traces in Explore; the existing ClickHouse datasource is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## v3 #5449 +/- ##
==========================================
- Coverage 43.18% 42.90% -0.28%
==========================================
Files 413 419 +6
Lines 42326 42946 +620
==========================================
+ Hits 18280 18428 +148
- Misses 22175 22605 +430
- Partials 1871 1913 +42
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- settings_helpers_test: include LogsRetention default (30d) in expected Settings. - handler_test (TestCheckPortChanged): add the new log_watcher_options column to agentColumns and to each sqlmock AddRow so reform can scan the agent row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix the golangci-lint findings reported on the PR for the new code: - funcorder: order exported methods before unexported (dbwatcher, logship_channel). - unparam: dbwatcher.New no longer returns an always-nil error. - godot: capitalize toplevel doc-comment sentences (clickhouse, alerting). - noctx: clickhouse ApplyTTL uses ExecContext (ctx threaded through callers). - nolintlint: drop the unused //nolint:errcheck in alerting. - depguard: use stdlib "errors"/fmt instead of github.com/pkg/errors in the new files (clickhouse, otelcol_config, dbwatcher). - noinlineerr / embeddedstructfieldcheck / errorsastype: minor cleanups in the new files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TestAgentTypes iterates every AgentType enum value through types.AgentTypeName, which panics on unmapped types. Add the AGENT_TYPE_DB_LOG_WATCHER_AGENT constant and its display name so the new agent type is handled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rename the ClickHouse log/trace lifecycle dependency consistently: - interface logsClickhouseService -> logService - Server field / Params field / local var logsClickhouse / LogsClickhouse -> logService / LogService Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…qan-api2 After moving schema/TTL ownership to managed/services/clickhouse, the collector config comment still credited qan-api2 for enforcing retention. Correct it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
PMM monitors infrastructure well but troubleshooting is weak: logs are exposed only as a point-in-time
/logs.zipsnapshot (one file per component, no real-time view, exporter logs truncated to a small ring buffer). This change moves logging/tracing to a central, queryable, real-time store.Client and Server logs/traces flow through an OpenTelemetry Collector (contrib) on PMM Server into the existing ClickHouse using the OTel schema (
pmm.logs/pmm.traces), queried via Grafana'sgrafana-clickhouse-datasource.What's included
otelcol-contribprogram + a pmm-managed-rendered/etc/otelcol/config.yaml—filelogreceiver tails/srv/logs/*.log(captures all server components with no per-component change), loopback OTLP receivers, ClickHouse exporter withcreate_schema=false. Ansible role installs the binary.managed/services/clickhouse):pmm.logs/pmm.tracesmigrations on a dedicatedlogs_schema_migrationstable;ApplyTTLviaALTER TABLE … MODIFY TTL. qan-api2 is intentionally left unaware of logging/tracing.LogsRetentionsetting, applied instantly onChangeSettings.LogShipService/LogShipChannel(mirrors the RTA channel — no new port/auth). The supervisor'sLogWriterships pmm-agent's own logs and all controlled-agent (exporter/built-in) logs; the server forwards batches to the collector via OTLP/HTTP.db-log-watcher-agent): tails configured DB log files (rotation-safe, path allowlist) and ships tagged lines. Exposed viapmm-admin add mysql --watch-logs --log-file=PATH.datasource: clickhouseoption that builds a 3-node Grafana rule (ClickHouse SQL →reduce→threshold); ships a built-inmysql_log_down.yml.ClickHouseLogsdatasource for OTel logs/traces in Explore (the existingClickHousedatasource is unchanged).Tagging / data model
Client and DB logs are tagged with
service.name,db.system,pmm.log_type,pmm.source=client,pmm.agent_id; server component logs withpmm.source=server. All land in the OTelpmm.logsschema; spans inpmm.traces.Testing
api,managed,agent,admin); touched packages vet clean.Caveats / follow-ups
pmm.logs/pmm.tracesDDL — the clickhouseexporter schema is version-sensitive.grafana-clickhouse-datasourceotelEnabledmappings are plugin-version-sensitive and should be validated against the shipped Grafana.make gen(swagger clients, etc.) should run in CI. DB-backed tests and the live pipeline (collector → ClickHouse → Explore, alert firing) need a running stack.datasourcein the alerting API/UI.🤖 Generated with Claude Code