Grafana-to-GitHub alert bridge with metric classification#3224
Conversation
| steps: | ||
| - uses: anthropics/claude-code-action@v1 | ||
| with: | ||
| anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} |
There was a problem hiding this comment.
it should be a service account
There was a problem hiding this comment.
it might be hosted on Parity's public infra if any...
There was a problem hiding this comment.
Not sure if that's (deployments/local-scripts/grafana-github-bridge/) the right place for this bridge but I could not find a better one
0663195 to
c10b545
Compare
| issues: write | ||
| env: | ||
| # Grafana API access for live metric queries. | ||
| # Claude can use: curl -H "Authorization: Bearer $GRAFANA_TOKEN" "$GRAFANA_URL/api/..." |
There was a problem hiding this comment.
These are set in Github's Project settings
There was a problem hiding this comment.
Extracted to a separate PR - #3236 - but may be merged as part of this PR
fc90b20 to
ef89177
Compare
Cloudflare Worker that converts Grafana Alertmanager webhooks into categorized GitHub issues, triggering a Claude Code agent (Haiku) to diagnose and act on bridge alerts. - Classifies 28 bridge alerts into 8 categories (finality-lag, delivery-lag, confirmation-lag, reward-lag, relay-down, version-guard, headers-mismatch, low-balance) - Detects environment and bridge pair from alert names - GitHub Action triggers on issue label:claude, uses Grafana API - Located in deployments/local-scripts/grafana-github-bridge/
ef89177 to
8b606c0
Compare
e530f92 to
37d2f22
Compare
|
Comments/questions (TL:DR - please don't do this before resolving my worries !!) I'm concerned by a lack of human in the loop here - what does "diagnose and act" mean ? Let me check I understand : PR takes external data from Grafana/Prometheus -> Cloudflare -> GH Issue -> GH action What prevents a prompt injection from grafana/prometheus ? - this could include info messages, error strings .... On the Claude side - what APIs are you using and where are the tokens stored and what are the exact permissions granted to the GH token ? How do you know Claude is going to do what you expect it to do ? (its non deterministic) What are the accesses for the API tokens used - have you used least privilege, granular as possible, read only by default ? |
|
Hey @MN0B
Human is at the very last stage of Claude's investigation -- either approving or guiding Claude what needs to be fixed. "Diagnose and act" means - get to the root cause (diagnose) and fix the issue/create a PR (act). There is a whole set of documents: Runbooks, Playbooks and SOPs that are being developed for that matter. The docs will serve as guidance for both humans and agents.
Yes but why do you say that data is
System prompts, runbooks, human-reviewers
To be clear, I intentionally do not have Claude SRE agent definition in this PR. That's a topic for a separate PR and a separate discussion. I don't want to mix things up.
I use Anthropic-maintained GH workflow. The token is stored in the corresponding GH repo where the workflow will run.
not yet confirmed but I'd say it needs both read and write (to be able to send PRs). Writing and sending PRs is safe, they won't be merged without human approval either way.
by verifying it's output (human-reviewer)
This topic is way broader than permissions itself. We need to equip Claude with Skills so that it can not only reason but also iterate and verify its hypothesis autonomously, while also limiting those Skills to reduce blast radius. Final thought about security concerns. The points you're raising, they're all valid but most of them are fixe-able with system prompt and read-only Skills definition (that's what I suggest in my design doc). I designed Claude SRE agent deployment as staged processed (read first, resolve with human approval later). You can check out my design doc for details ( |
Convert from Cloudflare Worker to a plain Node.js HTTP server so it can be deployed to any container infrastructure. No external dependencies — uses only Node.js stdlib (node:http). Adds Dockerfile, health endpoint, smoke tests, and deployment docs. Addresses paritytech/devops#5019 feedback on PR #3224. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert from Cloudflare Worker to a plain Node.js HTTP server so it can be deployed to any container infrastructure. No external dependencies — uses only Node.js stdlib (node:http). Adds Dockerfile, health endpoint, smoke tests, and deployment docs. Addresses paritytech/devops#5019 feedback on PR #3224.
|
Thanks Andrii For the points : Human interviention Yes a human is at the end but fatigue happens and sometimes people just wave PRs through. If Claude generated a subtle vulnerability, back door or unintended consequence are you confident that every reviewer would spot it ? A much safer way would be for the AI to create an issue - that means the AI is advising External vs internal data Keep me right here but Grafana will reflect public chain data ? - public RPC nodes and external chain states ? so malicious XCM message for example could inject data here. Telling the AI to be safe through runbooks and prompts AIs are non deterministic - there are examples of this failing. Read & Write access AIs are non deterministic - you just cant trust them to do what you think they will. If it has write access it could push to other branches, delete issues or exfiltrate stuff. Here's what I propose : Remove the Claude agent/workflow (claude-sre.yml). Update the Node.js bridge to create a GitHub Issue with the diagnostic data and a link to the relevant static runbook. Ensure the GitHub token for the service is limited to issues: write only. This removes a lot of the risk. Once its been "hardened" with real life experience then we could be a little more relaxed with the Claude agent. (one further thought I have is an AI/coded hybrid : use the AI to do messy things like summarise the logs but code common resolutions (even use Claude to write the code lol) - use the AI to deal with messy things but code to have certainty around risky things like code changes) |
design doc
Summary
Cloudflare Worker that converts Grafana Alertmanager webhooks into categorized GitHub issues, triggering a Claude Code agent to diagnose and act on bridge alerts.
flowchart TD P["Prometheus :9615"] --> G["Grafana Alert Rules"] G --> AM["Alertmanager"] AM -->|webhook| W["Cloudflare Worker"] AM -->|webhook| M["Matrix"] W -->|"warning + known"| T["Issue: label claude"] W -->|"critical / unknown"| E["Issue: label claude-escalate"] T -->|Haiku| H["Fast Triage"] E -->|Sonnet| S["Deep Investigation"] H --> V["Engineer Reviews"] S --> Vhttps://grafana-github-bridge.parity-bridges.workers.devdeployments/local-scripts/grafana-github-bridge/Grafana notification policy
Test plan
firingcreates issues)domainlabel and chain name fallbackGITHUB_TOKENsecret and verify issue creation end-to-end🤖 Generated with Claude Code