Skip to content

PMM-15145 Fix DB down alerts auto-resolving when agent is down#5492

Open
theTibi wants to merge 1 commit into
v3from
PMM-15145
Open

PMM-15145 Fix DB down alerts auto-resolving when agent is down#5492
theTibi wants to merge 1 commit into
v3from
PMM-15145

Conversation

@theTibi

@theTibi theTibi commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

MongoDB/PostgreSQL/Valkey/Redis down alerts share the flaw fixed for MySQL in
PMM-14193: when the DB and its agent/exporter go down together, the *_up
metric disappears, the expression has no series to evaluate, and the alert
silently resolves — exactly when it should fire.

Fix

Anchor on pmm_managed_inventory_agents (emitted by pmm-managed, always present).
Any enabled service not currently reporting *_up == 1 (down or missing)
stays at value 1 → Alerting; a baseline 0 keeps healthy ones Normal. Keying on
service_id also fixes the old on(node_name) join for multi-instance nodes.

Updated: mongodb_down, postgresql_down, valkey_down, redis_down — same
expression as PMM-14193, only metric/agent_type/labels differ.

Not changed — agent_down: its source pmm_managed_inventory_agents{agent_type="pmm-agent"}
flips to 0 (doesn't vanish) when the host dies, so it has no missing-series issue.

Note: redis_down and valkey_down are identical (both redis_up/valkey_exporter,
no service_type label to split them) — pre-existing, preserved here.

Ref: PMM-14193

Ticket number: PMM-0

Feature build: SUBMODULES-0

If this PR adds, removes or alters one or more API endpoints, please review and update the relevant API documentation as well:

  • API Docs updated

If this PR is related to other PRs, contributions, or ongoing work in this or other repositories, please reference them here:

  • Links to related work items (optional).

… services to improve down detection logic

Signed-off-by: theTibi <tkorocz@gmail.com>
@theTibi theTibi requested a review from a team as a code owner June 11, 2026 12:36
@theTibi theTibi requested review from 4nte, JiriCtvrtka and ademidoff and removed request for a team June 11, 2026 12:36
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.54%. Comparing base (0e531ee) to head (a8e5112).

Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #5492      +/-   ##
==========================================
- Coverage   43.62%   43.54%   -0.08%     
==========================================
  Files         413      413              
  Lines       42401    42928     +527     
==========================================
+ Hits        18498    18694     +196     
- Misses      22144    22362     +218     
- Partials     1759     1872     +113     
Flag Coverage Δ
managed 42.86% <ø> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants