Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
b487bed
chore: add optional validatingAdmissionWebhook, and prepare for a sep…
sunib Jun 24, 2026
ec49609
chore: creating plan and code to capture all the mechanisms in a more…
sunib Jun 24, 2026
876ff60
chore: here is M2!
sunib Jun 24, 2026
e0b6369
chore: Interesting findings on the shallow body problem
sunib Jun 25, 2026
758a629
chore: small fixes
sunib Jun 25, 2026
2ccf07a
chore: more improvements in the tests
sunib Jun 25, 2026
7c41668
chore: first draft on architecture update
sunib Jun 25, 2026
8dc3aaa
chore: finishing for now
sunib Jun 25, 2026
e39ca06
test(mutationlab): capture rows 10 (owner-ref cascade) and 13 (conflict)
sunib Jun 25, 2026
77e349f
feat(watch): parallel watch-state stream behind --watch-state-stream
sunib Jun 25, 2026
6d87865
feat(watch): diff watch-derived vs audit-derived desired sets (Phase …
sunib Jun 25, 2026
2535109
test(mutationlab): capture rows 16+17 (watch resync + bookmark)
sunib Jun 25, 2026
d41cfab
chore: finishing the design
sunib Jun 25, 2026
38f59f8
feat: let's get all testing to Kubernetes 1.36
sunib Jun 25, 2026
6dec610
chore: reran all mutatons on k8s v1.36.1
sunib Jun 25, 2026
e612fc6
chore: improving watch-ingestion document.
sunib Jun 25, 2026
d0e25a2
chore: getting the design docs better
sunib Jun 25, 2026
b2d5bc7
feat: watch-first ingestion
sunib Jun 25, 2026
5122032
chore: next steps
sunib Jun 26, 2026
79a9fe6
docs: moving architecture along with the rewrite
sunib Jun 26, 2026
b524d83
chore: relisten to a watch when possible
sunib Jun 26, 2026
7e011db
chore: details on how Redis is needed
sunib Jun 26, 2026
d7bdb16
docs: created new plan, and hopefully found why the tests are so flaky
sunib Jun 26, 2026
acf73d5
chore: easier status and streamsready
sunib Jun 26, 2026
acaea33
chore: e2e flake preventions
sunib Jun 26, 2026
04aa391
chore: overall improvements, fixing things and cleaning docs
sunib Jun 26, 2026
915b524
feat: reworking metrics to new architecture
sunib Jun 26, 2026
261c440
feat(manifestanalyzer,git): refuse unsupported GitTarget folder conte…
sunib Jun 26, 2026
cb8d4b0
feat(watch): surface a refused GitTarget folder as a Blocked stream
sunib Jun 26, 2026
d09ab73
test(e2e): prove unsupported-folder refusal end to end (Test D) + docs
sunib Jun 26, 2026
893e17f
test(e2e): apply sops-age-key in unsupported-folder test so Ready can…
sunib Jun 26, 2026
f2773a8
docs: adding skills and working on status design
sunib Jun 27, 2026
92fa490
chore: improve status, support kstatus
sunib Jun 27, 2026
12f3aa2
chore: refining names and more explicit e2e test for status behaviour
sunib Jun 27, 2026
419ab33
docs: designing gittargetignore
sunib Jun 27, 2026
1c61666
feat: refuse weird files in GitTarget path, but do allow .gittargetig…
sunib Jun 27, 2026
b895ef3
chore: removing settings and preparing merge
sunib Jun 28, 2026
a555d9d
chore: Support CommitRequest with clearer status and non-attribution …
sunib Jun 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coverage-baseline
Original file line number Diff line number Diff line change
@@ -1 +1 @@
75.0
73.6
57 changes: 13 additions & 44 deletions charts/gitops-reverser/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,17 +170,14 @@ nodeSelector:
| `quickstart.gitProvider.secretRef.name` | Existing Secret name used by the starter `GitProvider` | `git-creds` |
| `quickstart.gitTarget.path` | Repository path used by the starter `GitTarget`; set `.` only to deliberately target the repo root | `live-cluster` |
| `quickstart.watchRule.rules` | Rules used by the starter `WatchRule` | `configmaps create/update/delete` |
| `queue.redis.addr` | Redis endpoint (`host:port`) for required durable audit queueing | `valkey:6379` |
| `queue.redis.addr` | Redis endpoint (`host:port`) for optional audit attribution and watch resume cursors; empty = committer-only mode with replay/list recovery | `valkey:6379` |
| `queue.redis.auth.existingSecret` | Name of a pre-created Secret holding the Redis password | `valkey-auth` |
| `queue.redis.auth.existingSecretKey` | Key within the Secret that holds the password | `password` |
| `queue.redis.auth.username` | Optional Redis ACL username | `""` |
| `queue.redis.maxLen` | Approximate stream max length (`0` disables trim, allowing unbounded growth) | `10000` |
| `queue.redis.tls.enabled` | Enable TLS for Redis connection | `false` |
| `webhook.audit.debugStream.enabled` | Append every decoded audit event to the early Redis debug stream | `false` |
| `webhook.audit.debugStream.stream` | Redis stream name for early decoded audit event debugging | `gitopsreverser.audit.debug.events.v1` |
| `webhook.audit.debugStream.maxLen` | Approximate early debug stream max length (`0` disables trim, allowing unbounded growth) | `10000` |
| `auditEventJoin.bodyTTL` | TTL for parked additional audit bodies waiting for the matching official event | `5m` |
| `auditEventJoin.bodyWait` | Grace period for a bodyless official audit event to wait for a matching additional body while preserving official event order | `500ms` |
| `attribution.ttl` | How long an attribution fact is retained waiting for the matching watch event to join it | `10m` |
| `attribution.grace` | Bounded per-event wait for a matching audit fact before a watch event ships as the committer | `3s` |
| `attribution.serviceAccountNaming` | How a matched service account is named: `name` (its own username) or `bot` (collapse to the committer) | `name` |
| `servers.metrics.bindAddress` | Metrics listener bind address | `:8080` |
| `servers.metrics.tls.enabled` | Serve metrics with TLS | `false` |
| `servers.metrics.tls.certPath` | Metrics TLS certificate mount path | `/tmp/k8s-metrics-server/metrics-server-certs` |
Expand All @@ -205,47 +202,19 @@ See [`values.yaml`](values.yaml) for complete configuration options.

### Audit Webhook URL Contract

`https://<service>:9444/audit-webhook` receives the canonical audit events that drive Git writes.
`https://<service>:9444/audit-webhook` receives audit events from kube-apiserver. The operator
extracts a minimal attribution fact from each (auditID, user, verb, resourceVersion, GVR, namespace,
name, UID, status, timestamps) into the optional Redis attribution index. The same Redis endpoint stores
watch resume cursors, so reconnects can resume a normal watch from the last processed resourceVersion
when the apiserver can still serve that history. Object state itself comes from Kubernetes **watch**, not
from audit; audit only names the commit author.

`https://<service>:9444/audit-webhook-additional` receives events whose request or response bodies
un-shallow the official events on `/audit-webhook`, matched by `auditID`. Every event sent here is
eligible to contribute a parked body; no API group allowlist is required. Bodyless additional
events are dropped as malformed.

`deletecollection` audit events are preserved in the canonical stream, but per-item Git write
fan-out for collection deletes is not implemented yet.
When `queue.redis.addr` is empty the audit webhook is not used at all and the product runs
committer-only — every commit is authored by the configured committer and watch recovery uses replay/list
snapshots instead of persisted resume cursors.

Cluster ID path segments are rejected.

### Audit Queue Retention

The Redis/Valkey stream is a durable **work queue**, not a long-term audit archive. Use
Kubernetes audit logs or object storage for raw audit retention; this stream exists to
buffer events between the kube-apiserver audit webhook and the controller.

- `queue.redis.maxLen` is an **approximate** upper bound — the chart renders
`--audit-redis-max-len`, which is enforced with `XADD ... MAXLEN ~ N` for performance.
- A bounded default (`10000`) protects Valkey memory and reload time. Sizing should be
based on expected audit events per second, the outage/catch-up window the queue
must tolerate, and a safety factor.
- Setting `maxLen: 0` keeps the stream unbounded. The controller logs a startup warning
when either the canonical or debug stream is unbounded so accidental production
installs are visible.
- **Too-low values** can drop entries before slow consumers catch up. **Too-high
values** create memory pressure and long Valkey RDB reload times after a restart.

Operationally the queue is observable via Prometheus metrics:

- `gitopsreverser_audit_queue_stream_length` — current entries in the canonical stream
- `gitopsreverser_audit_queue_pending_entries` — claimed but unacked messages
- `gitopsreverser_audit_queue_consumer_lag` — entries not yet read by the consumer group
- `gitopsreverser_audit_queue_oldest_entry_age_seconds` — age of the oldest entry
- `gitopsreverser_audit_queue_oldest_pending_age_seconds` — age of the oldest pending entry
- `gitopsreverser_audit_debug_stream_length` — current entries in the debug stream

`pending_entries` alone is not sufficient: it counts claimed but unacked messages, while
`consumer_lag` captures backlog that may be trimmed before consumption.

## Custom Resource Definitions (CRDs)

This chart automatically manages the following CRDs:
Expand Down
2 changes: 0 additions & 2 deletions charts/gitops-reverser/templates/NOTES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,6 @@ Suggested audit webhook server URL from the control-plane node:
- `ClusterIP` alone is usually NOT enough unless your control-plane host can route to the service CIDR.
{{- end }}

Supplementary audit sources, including `apiservice-audit-proxy`, should use the same host with `/audit-webhook-additional`.

Copy-paste example to generate `audit-webhook.kubeconfig` after the Secrets exist:

```bash
Expand Down
17 changes: 3 additions & 14 deletions charts/gitops-reverser/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,26 +70,15 @@ spec:
- --audit-redis-username={{ . | quote }}
{{- end }}
- --audit-redis-db={{ .Values.queue.redis.db }}
- --audit-redis-max-len={{ .Values.queue.redis.maxLen }}
{{- if .Values.webhook.audit.debugStream.enabled }}
- --audit-debug-redis-stream={{ .Values.webhook.audit.debugStream.stream }}
- --audit-debug-redis-max-len={{ .Values.webhook.audit.debugStream.maxLen }}
{{- end }}
{{- if .Values.webhook.audit.diagStreams.enabled }}
- --audit-bytype-diag
- --audit-bytype-diag-max-len={{ .Values.webhook.audit.diagStreams.maxLen }}
{{- with .Values.webhook.audit.diagStreams.resources }}
- --audit-bytype-diag-resources={{ join "," . }}
{{- end }}
{{- end }}
{{- if .Values.queue.redis.tls.enabled }}
- --audit-redis-tls
{{- end }}
- --attribution-ttl={{ .Values.attribution.ttl }}
- --attribution-grace={{ .Values.attribution.grace }}
- --attribution-sa-naming={{ .Values.attribution.serviceAccountNaming }}
{{- with .Values.controllerManager.additionalSensitiveResources }}
- {{ printf "--additional-sensitive-resources=%s" (join "," .) | quote }}
{{- end }}
- --audit-event-body-ttl={{ .Values.auditEventJoin.bodyTTL }}
- --audit-event-body-wait={{ .Values.auditEventJoin.bodyWait }}
{{- if .Values.logging.level }}
- --zap-log-level={{ .Values.logging.level }}
{{- end }}
Expand Down
57 changes: 14 additions & 43 deletions charts/gitops-reverser/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,62 +94,33 @@ servers:
# Leave empty to use the chart-generated Secret name.
secretNameOverride: ""

# Webhook behavior
webhook:
audit:
# Set to true to append every decoded audit event to a separate Redis stream
# before normal audit processing can filter, join, or drop it.
debugStream:
enabled: false
stream: "gitopsreverser.audit.debug.events.v1"
# Approximate max stream length; 0 disables trimming and allows unbounded growth.
# The bounded default keeps Valkey memory and restart/reload time predictable; raise it
# only if your audit rate or outage-catch-up window exceeds the default budget. The
# debug stream stores every decoded event before filtering/joining, so unbounded mode
# here is especially dangerous.
maxLen: 10000
# Opt-in <prefix>:diag_all firehose — one annotated record per ingested audit event
# (the entry payload plus its outcome/category), for investigating ingestion/ordering.
# Off by default; enable only while investigating. See
# docs/design/stream/audit-diagnostic-streams-plan.md.
diagStreams:
enabled: false
# Approximate max <prefix>:diag_all stream length; 0 disables trimming (unbounded — risky,
# this captures every ingested event).
maxLen: 100000
# Optional list of resource names to scope the firehose to (e.g. ["commitrequests",
# "configmaps"]). Empty captures every queued event; a non-empty list bounds the firehose
# (and its in-lock XADD load) to those suspect types when investigating a specific failure.
resources: []

# Durable audit queue configuration.
# Redis-backed attribution facts and watch resume cursors. Empty addr runs committer-only.
queue:
redis:
# Redis endpoint in host:port format. GitOps Reverser requires a reachable Valkey/Redis service.
# Redis endpoint (host:port). Empty disables Redis (single-replica only).
addr: "valkey:6379"
auth:
# Name of a pre-existing Secret in the same namespace that holds the Redis password.
# The Secret must be created before installing this chart.
# Required for production deployments. You may leave it empty only for dev/local environments
# where Valkey runs without authentication.
# Pre-existing Secret (same namespace) holding the Redis password. Create it before install;
# leave empty only for dev clusters where Valkey runs without authentication.
existingSecret: "valkey-auth"
# Key within the Secret that holds the password value.
existingSecretKey: "password"
# Optional Redis username for ACL-based auth (Redis 6+). Not sensitive; stays as a plain value.
username: ""
db: 0
# Approximate max stream length; 0 disables trimming and allows unbounded growth.
# 10000 is a production-safe starting point — large enough to absorb short consumer
# outages, small enough to keep Valkey memory and reload time predictable. Size based
# on your expected audit events per second and the outage/catch-up window the queue
# must tolerate. Too-low values can drop entries before slow consumers catch up.
maxLen: 10000
tls:
enabled: false

auditEventJoin:
bodyTTL: "5m"
bodyWait: "500ms"
# Commit-author attribution from audit facts. Only applies when queue.redis.addr is set.
attribution:
# How long an attribution fact is retained waiting for the matching watch event to join it.
ttl: "10m"
# Bounded per-event wait for a matching audit fact before a watch event ships as the committer.
# Larger values raise attribution hit-rate at the cost of commit latency.
grace: "3s"
# How a matched service account is named as the author: "name" (the service account's own
# username) or "bot" (collapse every service account to the committer).
serviceAccountNaming: "name"

# TLS certificate management (Kubernetes requires this for API server callbacks)
certificates:
Expand Down
Loading
Loading