Grafana-to-GitHub alert bridge with metric classification by x3c41a · Pull Request #3224 · paritytech/parity-bridges-common

x3c41a · 2026-02-18T14:35:08Z

Summary

Cloudflare Worker that converts Grafana Alertmanager webhooks into categorized GitHub issues, triggering a Claude Code agent to diagnose and act on bridge alerts.

flowchart TD
    P["Prometheus :9615"] --> G["Grafana Alert Rules"]
    G --> AM["Alertmanager"]
    AM -->|webhook| W["Cloudflare Worker"]
    AM -->|webhook| M["Matrix"]

    W -->|"warning + known"| T["Issue: label claude"]
    W -->|"critical / unknown"| E["Issue: label claude-escalate"]

    T -->|Haiku| H["Fast Triage"]
    E -->|Sonnet| S["Deep Investigation"]

    H --> V["Engineer Reviews"]
    S --> V

Classifies 28 bridge alerts into 8 categories based on metric patterns
Detects environment and bridge pair from alert names
Tiered model: warning/known → Haiku, critical/unknown → Sonnet. Haiku handles ~90% of alerts at 1/10th the cost.
Deployed at https://grafana-github-bridge.parity-bridges.workers.dev
Located in deployments/local-scripts/grafana-github-bridge/

Grafana notification policy

- receiver: GitHub parity-bridges-common
  matchers:
    - alertname =~ ".*Bridge.*|.*bridge.*|.*headers mismatch"
  continue: true

Test plan

All 8 alert categories correctly classified via local curl tests
Resolved alerts are skipped (only firing creates issues)
Environment detection works for both domain label and chain name fallback
Bridge pair extraction from alert title patterns
Deployed and responding at workers.dev URL
Set GITHUB_TOKEN secret and verify issue creation end-to-end
Verified no false positives: regex does not match any non-bridge alerts
Configure Grafana notification policy with updated matcher
Test e2e with real Grafana alert

🤖 Generated with Claude Code

x3c41a · 2026-02-18T15:02:00Z

+    steps:
+      - uses: anthropics/claude-code-action@v1
+        with:
+          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}


it should be a service account

x3c41a · 2026-02-18T15:02:34Z

it might be hosted on Parity's public infra if any...

cla-bot-2021 · 2026-02-19T08:29:22Z

User @claude, please sign the CLA here.

x3c41a · 2026-02-26T13:13:51Z

Not sure if that's (deployments/local-scripts/grafana-github-bridge/) the right place for this bridge but I could not find a better one

x3c41a · 2026-02-26T13:34:10Z

+      issues: write
+    env:
+      # Grafana API access for live metric queries.
+      # Claude can use: curl -H "Authorization: Bearer $GRAFANA_TOKEN" "$GRAFANA_URL/api/..."


These are set in Github's Project settings

x3c41a · 2026-02-26T13:34:37Z

Extracted to a separate PR - #3236 - but may be merged as part of this PR

Cloudflare Worker that converts Grafana Alertmanager webhooks into categorized GitHub issues, triggering a Claude Code agent (Haiku) to diagnose and act on bridge alerts. - Classifies 28 bridge alerts into 8 categories (finality-lag, delivery-lag, confirmation-lag, reward-lag, relay-down, version-guard, headers-mismatch, low-balance) - Detects environment and bridge pair from alert names - GitHub Action triggers on issue label:claude, uses Grafana API - Located in deployments/local-scripts/grafana-github-bridge/

MN0B · 2026-03-03T16:19:32Z

Comments/questions (TL:DR - please don't do this before resolving my worries !!)

I'm concerned by a lack of human in the loop here - what does "diagnose and act" mean ?

Let me check I understand :

PR takes external data from Grafana/Prometheus -> Cloudflare -> GH Issue -> GH action

What prevents a prompt injection from grafana/prometheus ? - this could include info messages, error strings ....

On the Claude side - what APIs are you using and where are the tokens stored and what are the exact permissions granted to the GH token ?

How do you know Claude is going to do what you expect it to do ? (its non deterministic)

What are the accesses for the API tokens used - have you used least privilege, granular as possible, read only by default ?

x3c41a · 2026-03-04T06:47:00Z

Hey @MN0B
thanks for looking into that. Let's resolve your questions one by one:

I'm concerned by a lack of human in the loop here - what does "diagnose and act" mean ?

Human is at the very last stage of Claude's investigation -- either approving or guiding Claude what needs to be fixed. "Diagnose and act" means - get to the root cause (diagnose) and fix the issue/create a PR (act).

There is a whole set of documents: Runbooks, Playbooks and SOPs that are being developed for that matter. The docs will serve as guidance for both humans and agents.

PR takes external data from Grafana/Prometheus -> Cloudflare -> GH Issue -> GH action

Yes but why do you say that data is external if both code and grafana alerts are owned by Parity?

What prevents a prompt injection from grafana/prometheus ? - this could include info messages, error strings ....

System prompts, runbooks, human-reviewers

On the Claude side - what APIs are you using and where are the tokens stored and what are the exact permissions granted to the GH token ?

To be clear, I intentionally do not have Claude SRE agent definition in this PR. That's a topic for a separate PR and a separate discussion. I don't want to mix things up.

what APIs are you using and where are the tokens stored

I use Anthropic-maintained GH workflow. The token is stored in the corresponding GH repo where the workflow will run.
Go to Settings > Secrets and Variables > define anthropic_key there

what are the exact permissions granted to the GH token

not yet confirmed but I'd say it needs both read and write (to be able to send PRs). Writing and sending PRs is safe, they won't be merged without human approval either way.

How do you know Claude is going to do what you expect it to do ? (its non deterministic)

by verifying it's output (human-reviewer)

What are the accesses for the API tokens used - have you used least privilege, granular as possible, read only by default ?

This topic is way broader than permissions itself. We need to equip Claude with Skills so that it can not only reason but also iterate and verify its hypothesis autonomously, while also limiting those Skills to reduce blast radius.

Final thought about security concerns. The points you're raising, they're all valid but most of them are fixe-able with system prompt and read-only Skills definition (that's what I suggest in my design doc). I designed Claude SRE agent deployment as staged processed (read first, resolve with human approval later). You can check out my design doc for details (implementation plan section)

Convert from Cloudflare Worker to a plain Node.js HTTP server so it can be deployed to any container infrastructure. No external dependencies — uses only Node.js stdlib (node:http). Adds Dockerfile, health endpoint, smoke tests, and deployment docs. Addresses paritytech/devops#5019 feedback on PR #3224. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert from Cloudflare Worker to a plain Node.js HTTP server so it can be deployed to any container infrastructure. No external dependencies — uses only Node.js stdlib (node:http). Adds Dockerfile, health endpoint, smoke tests, and deployment docs. Addresses paritytech/devops#5019 feedback on PR #3224.

MN0B · 2026-03-04T13:45:54Z

Thanks Andrii

For the points :

Human interviention

Yes a human is at the end but fatigue happens and sometimes people just wave PRs through. If Claude generated a subtle vulnerability, back door or unintended consequence are you confident that every reviewer would spot it ?

A much safer way would be for the AI to create an issue - that means the AI is advising

External vs internal data

Keep me right here but Grafana will reflect public chain data ? - public RPC nodes and external chain states ? so malicious XCM message for example could inject data here.

Telling the AI to be safe through runbooks and prompts

AIs are non deterministic - there are examples of this failing.

Read & Write access

AIs are non deterministic - you just cant trust them to do what you think they will. If it has write access it could push to other branches, delete issues or exfiltrate stuff.

Here's what I propose :

Remove the Claude agent/workflow (claude-sre.yml).

Update the Node.js bridge to create a GitHub Issue with the diagnostic data and a link to the relevant static runbook.

Ensure the GitHub token for the service is limited to issues: write only.

This removes a lot of the risk. Once its been "hardened" with real life experience then we could be a little more relaxed with the Claude agent.

(one further thought I have is an AI/coded hybrid : use the AI to do messy things like summarise the logs but code common resolutions (even use Claude to write the code lol) - use the AI to deal with messy things but code to have certainty around risky things like code changes)

x3c41a requested a review from a team as a code owner February 18, 2026 14:35

x3c41a removed the request for review from a team February 18, 2026 14:39

x3c41a marked this pull request as draft February 18, 2026 14:40

x3c41a commented Feb 18, 2026

View reviewed changes

x3c41a changed the title ~~Add Grafana-to-GitHub alert bridge~~ Grafana-to-GitHub alert bridge with metric classification Feb 19, 2026

x3c41a commented Feb 26, 2026

View reviewed changes

x3c41a force-pushed the grafana-github-alert-bridge branch 2 times, most recently from 0663195 to c10b545 Compare February 26, 2026 13:33

x3c41a commented Feb 26, 2026

View reviewed changes

x3c41a marked this pull request as ready for review February 26, 2026 13:52

x3c41a requested review from bkontur, franciscoaguirre and karolk91 February 26, 2026 14:10

x3c41a force-pushed the grafana-github-alert-bridge branch 2 times, most recently from fc90b20 to ef89177 Compare February 26, 2026 15:00

x3c41a force-pushed the grafana-github-alert-bridge branch from ef89177 to 8b606c0 Compare February 26, 2026 15:04

Remove claude-sre.yml — will be merged separately

37d2f22

x3c41a force-pushed the grafana-github-alert-bridge branch from e530f92 to 37d2f22 Compare February 27, 2026 12:38

x3c41a mentioned this pull request Mar 4, 2026

Grafana-GitHub bridge: add Dockerfile and convert to Node.js #3249

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana-to-GitHub alert bridge with metric classification#3224

Grafana-to-GitHub alert bridge with metric classification#3224
x3c41a wants to merge 2 commits into
masterfrom
grafana-github-alert-bridge

x3c41a commented Feb 18, 2026 •

edited

Loading

Uh oh!

x3c41a Feb 18, 2026

Uh oh!

x3c41a Feb 18, 2026

Uh oh!

cla-bot-2021 Bot commented Feb 19, 2026

Uh oh!

x3c41a Feb 26, 2026

Uh oh!

x3c41a Feb 26, 2026

Uh oh!

x3c41a Feb 26, 2026

Uh oh!

MN0B commented Mar 3, 2026

Uh oh!

x3c41a commented Mar 4, 2026 •

edited

Loading

Uh oh!

MN0B commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

x3c41a commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Grafana notification policy

Test plan

Uh oh!

x3c41a Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

x3c41a Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

cla-bot-2021 Bot commented Feb 19, 2026

Uh oh!

x3c41a Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

x3c41a Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

x3c41a Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

MN0B commented Mar 3, 2026

Uh oh!

x3c41a commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MN0B commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x3c41a commented Feb 18, 2026 •

edited

Loading

x3c41a commented Mar 4, 2026 •

edited

Loading

MN0B commented Mar 4, 2026 •

edited

Loading