Skip to content

Grafana-to-GitHub alert bridge with metric classification#3224

Open
x3c41a wants to merge 2 commits into
masterfrom
grafana-github-alert-bridge
Open

Grafana-to-GitHub alert bridge with metric classification#3224
x3c41a wants to merge 2 commits into
masterfrom
grafana-github-alert-bridge

Conversation

@x3c41a
Copy link
Copy Markdown
Contributor

@x3c41a x3c41a commented Feb 18, 2026

design doc

Summary

Cloudflare Worker that converts Grafana Alertmanager webhooks into categorized GitHub issues, triggering a Claude Code agent to diagnose and act on bridge alerts.

flowchart TD
    P["Prometheus :9615"] --> G["Grafana Alert Rules"]
    G --> AM["Alertmanager"]
    AM -->|webhook| W["Cloudflare Worker"]
    AM -->|webhook| M["Matrix"]

    W -->|"warning + known"| T["Issue: label claude"]
    W -->|"critical / unknown"| E["Issue: label claude-escalate"]

    T -->|Haiku| H["Fast Triage"]
    E -->|Sonnet| S["Deep Investigation"]

    H --> V["Engineer Reviews"]
    S --> V
Loading
  • Classifies 28 bridge alerts into 8 categories based on metric patterns
  • Detects environment and bridge pair from alert names
  • Tiered model: warning/known → Haiku, critical/unknown → Sonnet. Haiku handles ~90% of alerts at 1/10th the cost.
  • Deployed at https://grafana-github-bridge.parity-bridges.workers.dev
  • Located in deployments/local-scripts/grafana-github-bridge/

Grafana notification policy

- receiver: GitHub parity-bridges-common
  matchers:
    - alertname =~ ".*Bridge.*|.*bridge.*|.*headers mismatch"
  continue: true

Test plan

  • All 8 alert categories correctly classified via local curl tests
  • Resolved alerts are skipped (only firing creates issues)
  • Environment detection works for both domain label and chain name fallback
  • Bridge pair extraction from alert title patterns
  • Deployed and responding at workers.dev URL
  • Set GITHUB_TOKEN secret and verify issue creation end-to-end
  • Verified no false positives: regex does not match any non-bridge alerts
  • Configure Grafana notification policy with updated matcher
  • Test e2e with real Grafana alert

🤖 Generated with Claude Code

@x3c41a x3c41a requested a review from a team as a code owner February 18, 2026 14:35
@x3c41a x3c41a removed the request for review from a team February 18, 2026 14:39
@x3c41a x3c41a marked this pull request as draft February 18, 2026 14:40
Comment thread .github/workflows/claude-sre.yml Outdated
steps:
- uses: anthropics/claude-code-action@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be a service account

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be hosted on Parity's public infra if any...

@cla-bot-2021
Copy link
Copy Markdown

cla-bot-2021 Bot commented Feb 19, 2026

User @claude, please sign the CLA here.

@x3c41a x3c41a changed the title Add Grafana-to-GitHub alert bridge Grafana-to-GitHub alert bridge with metric classification Feb 19, 2026
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's (deployments/local-scripts/grafana-github-bridge/) the right place for this bridge but I could not find a better one

@x3c41a x3c41a force-pushed the grafana-github-alert-bridge branch 2 times, most recently from 0663195 to c10b545 Compare February 26, 2026 13:33
Comment thread .github/workflows/claude-sre.yml Outdated
issues: write
env:
# Grafana API access for live metric queries.
# Claude can use: curl -H "Authorization: Bearer $GRAFANA_TOKEN" "$GRAFANA_URL/api/..."
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are set in Github's Project settings

Comment thread .github/workflows/claude-sre.yml Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to a separate PR - #3236 - but may be merged as part of this PR

@x3c41a x3c41a marked this pull request as ready for review February 26, 2026 13:52
@x3c41a x3c41a force-pushed the grafana-github-alert-bridge branch 2 times, most recently from fc90b20 to ef89177 Compare February 26, 2026 15:00
Cloudflare Worker that converts Grafana Alertmanager webhooks into
categorized GitHub issues, triggering a Claude Code agent (Haiku)
to diagnose and act on bridge alerts.

- Classifies 28 bridge alerts into 8 categories (finality-lag,
  delivery-lag, confirmation-lag, reward-lag, relay-down,
  version-guard, headers-mismatch, low-balance)
- Detects environment and bridge pair from alert names
- GitHub Action triggers on issue label:claude, uses Grafana API
- Located in deployments/local-scripts/grafana-github-bridge/
@x3c41a x3c41a force-pushed the grafana-github-alert-bridge branch from ef89177 to 8b606c0 Compare February 26, 2026 15:04
@x3c41a x3c41a force-pushed the grafana-github-alert-bridge branch from e530f92 to 37d2f22 Compare February 27, 2026 12:38
@MN0B
Copy link
Copy Markdown

MN0B commented Mar 3, 2026

Comments/questions (TL:DR - please don't do this before resolving my worries !!)

I'm concerned by a lack of human in the loop here - what does "diagnose and act" mean ?

Let me check I understand :

PR takes external data from Grafana/Prometheus -> Cloudflare -> GH Issue -> GH action

What prevents a prompt injection from grafana/prometheus ? - this could include info messages, error strings ....

On the Claude side - what APIs are you using and where are the tokens stored and what are the exact permissions granted to the GH token ?

How do you know Claude is going to do what you expect it to do ? (its non deterministic)

What are the accesses for the API tokens used - have you used least privilege, granular as possible, read only by default ?

@x3c41a
Copy link
Copy Markdown
Contributor Author

x3c41a commented Mar 4, 2026

Hey @MN0B
thanks for looking into that. Let's resolve your questions one by one:

I'm concerned by a lack of human in the loop here - what does "diagnose and act" mean ?

Human is at the very last stage of Claude's investigation -- either approving or guiding Claude what needs to be fixed. "Diagnose and act" means - get to the root cause (diagnose) and fix the issue/create a PR (act).

There is a whole set of documents: Runbooks, Playbooks and SOPs that are being developed for that matter. The docs will serve as guidance for both humans and agents.

PR takes external data from Grafana/Prometheus -> Cloudflare -> GH Issue -> GH action

Yes but why do you say that data is external if both code and grafana alerts are owned by Parity?

What prevents a prompt injection from grafana/prometheus ? - this could include info messages, error strings ....

System prompts, runbooks, human-reviewers

On the Claude side - what APIs are you using and where are the tokens stored and what are the exact permissions granted to the GH token ?

To be clear, I intentionally do not have Claude SRE agent definition in this PR. That's a topic for a separate PR and a separate discussion. I don't want to mix things up.

what APIs are you using and where are the tokens stored

I use Anthropic-maintained GH workflow. The token is stored in the corresponding GH repo where the workflow will run.
Go to Settings > Secrets and Variables > define anthropic_key there

what are the exact permissions granted to the GH token

not yet confirmed but I'd say it needs both read and write (to be able to send PRs). Writing and sending PRs is safe, they won't be merged without human approval either way.

How do you know Claude is going to do what you expect it to do ? (its non deterministic)

by verifying it's output (human-reviewer)

What are the accesses for the API tokens used - have you used least privilege, granular as possible, read only by default ?

This topic is way broader than permissions itself. We need to equip Claude with Skills so that it can not only reason but also iterate and verify its hypothesis autonomously, while also limiting those Skills to reduce blast radius.

Final thought about security concerns. The points you're raising, they're all valid but most of them are fixe-able with system prompt and read-only Skills definition (that's what I suggest in my design doc). I designed Claude SRE agent deployment as staged processed (read first, resolve with human approval later). You can check out my design doc for details (implementation plan section)

x3c41a added a commit that referenced this pull request Mar 4, 2026
Convert from Cloudflare Worker to a plain Node.js HTTP server so it can
be deployed to any container infrastructure. No external dependencies —
uses only Node.js stdlib (node:http).

Adds Dockerfile, health endpoint, smoke tests, and deployment docs.

Addresses paritytech/devops#5019 feedback on PR #3224.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
x3c41a added a commit that referenced this pull request Mar 4, 2026
Convert from Cloudflare Worker to a plain Node.js HTTP server so it can
be deployed to any container infrastructure. No external dependencies —
uses only Node.js stdlib (node:http).

Adds Dockerfile, health endpoint, smoke tests, and deployment docs.

Addresses paritytech/devops#5019 feedback on PR #3224.
@MN0B
Copy link
Copy Markdown

MN0B commented Mar 4, 2026

Thanks Andrii

For the points :

Human interviention

Yes a human is at the end but fatigue happens and sometimes people just wave PRs through. If Claude generated a subtle vulnerability, back door or unintended consequence are you confident that every reviewer would spot it ?

A much safer way would be for the AI to create an issue - that means the AI is advising

External vs internal data

Keep me right here but Grafana will reflect public chain data ? - public RPC nodes and external chain states ? so malicious XCM message for example could inject data here.

Telling the AI to be safe through runbooks and prompts

AIs are non deterministic - there are examples of this failing.

Read & Write access

AIs are non deterministic - you just cant trust them to do what you think they will. If it has write access it could push to other branches, delete issues or exfiltrate stuff.

Here's what I propose :

Remove the Claude agent/workflow (claude-sre.yml).

Update the Node.js bridge to create a GitHub Issue with the diagnostic data and a link to the relevant static runbook.

Ensure the GitHub token for the service is limited to issues: write only.

This removes a lot of the risk. Once its been "hardened" with real life experience then we could be a little more relaxed with the Claude agent.

(one further thought I have is an AI/coded hybrid : use the AI to do messy things like summarise the logs but code common resolutions (even use Claude to write the code lol) - use the AI to deal with messy things but code to have certainty around risky things like code changes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants