From a475b3948ffb8819d263f26e3274736d50d2051b Mon Sep 17 00:00:00 2001 From: Anderson Nogueira Date: Tue, 9 Jun 2026 20:04:25 +0000 Subject: [PATCH] docs: tighten README to key facts plus pointers - Lead with the GitOps split (tofu to ArgoCD-healthy, ArgoCD after) and the ApplicationSet fan-out as key facts - Replace component prose, TF operating guide, and compute-topology table with a layout table and pointers to docs/, ADRs, runbooks, module READMEs --- README.md | 131 +++++++++++++++++------------------------------------- 1 file changed, 42 insertions(+), 89 deletions(-) diff --git a/README.md b/README.md index d6af127..7abb103 100644 --- a/README.md +++ b/README.md @@ -1,114 +1,67 @@ # percona-cd-platform -Percona's CI/CD platform. - -It hosts Jenkins masters and the platform services around them — observability, -single sign-on, ingress with TLS termination, and cluster autoscaling — running -on a GitOps-managed EKS cluster in `us-east-1`. Everything is defined as code +Percona's CI/CD platform: a GitOps-managed EKS cluster in `us-east-1` hosting +the Jenkins masters and the platform services around them (LGTM observability, +Authentik SSO, ingress with TLS, autoscaling). Everything is defined as code and reconciled from this repo; there are no manual cluster changes. -Region, cluster name, hostnames, and other deployment-specific values are -parameterized in `terraform/`. - -## Components - -**CI/CD (Jenkins).** Jenkins masters served on `*.cd.percona.com`, in one of -two modes: an ALB → in-cluster NGINX proxy → cross-region VPC peering → EC2 -master (a reconciler keeps each EndpointSlice synced to the live instance IP), -or an in-cluster StatefulSet. The custom master image bundles the WAR, -Percona-patched plugins, and `init.groovy.d`. - -**Monitoring / observability.** A distributed LGTM stack (Mimir, Loki, Tempo, -Grafana). EC2 masters run a master-side Alloy that pushes metrics, logs, and -traces through `alloy-gateway` (NGINX bearer-auth + Alloy receivers) into the -stack. Grafana fronts all three behind Authentik OIDC. +## Key facts -**Terraform / AWS.** OpenTofu owns AWS-side state through "ArgoCD healthy": -VPC, EKS, managed node groups, Karpenter prerequisites, Pod Identity, ACM -wildcard cert, the LGTM S3 buckets, and each EC2 Jenkins master (spot fleet or -on-demand, IAM, EBS, userdata via a reusable module). TF outputs reach ArgoCD as -cluster-Secret annotations consumed as Helm values. +- OpenTofu owns AWS up to "ArgoCD healthy": VPC, EKS, node groups, Pod + Identity, the EC2 Jenkins masters, ARM spot fleets, S3, cleanup reapers. + TF outputs reach ArgoCD as cluster-Secret annotations. +- From there ArgoCD owns everything in-cluster: a root App-of-Apps fans out + ApplicationSets that reconcile one Application per `resources/addons/*` dir + and one per `resources/jenkins/master/instances/*` dir. No manual `kubectl`. +- Jenkins masters serve on `*.cd.percona.com` in two modes: EKS-fronted EC2 + (ALB, in-cluster NGINX, cross-region VPC peering, an EndpointSlice + reconciler) or in-cluster StatefulSet. +- Repo CI is lint + validate only; `ci-gate` is the single required check and + `just ci` mirrors it locally. +- The repo is public: no account IDs, ARNs, or secrets in committed files. -**Account cleanup reapers.** Two scheduled Lambdas from the -[`scheduled-lambda` module](terraform/modules/scheduled-lambda/README.md): -daily unattached-EBS reaper, 5-minute untagged-EC2 + orphan-`eksctl`-stack -reaper. Tunables (schedules, `dry_run`, EKS skip regex) in -[`terraform/locals.tf`](terraform/locals.tf); ops in -[`docs/runbooks/cleanup-reapers.md`](docs/runbooks/cleanup-reapers.md). +## Layout -**GitOps / ArgoCD.** From "ArgoCD healthy" onward, everything in-cluster is -GitOps-managed. A root App-of-Apps fans out to ApplicationSets that reconcile -one Application per addon and one per in-cluster Jenkins instance. No manual -`kubectl` mutations — drift breaks reconciliation. - -**Repo CI.** GitHub Actions runs lint + validate only (no plan, no deploy): -one check per job, `ci-gate` aggregate as the single required check. `just ci` -mirrors it locally. Dependabot bumps actions, test deps, and the Jenkins base -image weekly. +| Path | What lives there | +|---|---| +| [`terraform/`](terraform/) | AWS substrate; reusable modules with their own READMEs ([jenkins-arm-fleet](terraform/modules/jenkins-arm-fleet/README.md), [jenkins-arm-standalone](terraform/modules/jenkins-arm-standalone/README.md), [scheduled-lambda](terraform/modules/scheduled-lambda/README.md)); pins in [`versions.tf`](terraform/versions.tf) | +| [`argocd-bootstrap/`](argocd-bootstrap/) | Root Application, ApplicationSets, AppProject | +| [`resources/addons/`](resources/addons/) | One dir = one ArgoCD Application (observability, ingress, SSO, ...) | +| [`resources/jenkins/`](resources/jenkins/) | In-cluster master chart, per-instance values, clouds catalog (rendered by [`scripts/render-clouds.py`](scripts/render-clouds.py), drift-gated in CI) | +| [`images/`](images/) | Container images (controller bundle and friends), built by GitHub Actions | +| [`scripts/`](scripts/) | Verification and render tooling; catalog in [`scripts/README.md`](scripts/README.md) | +| [`docs/`](docs/) | [Architecture](docs/architecture.md), [ADRs](docs/adr/), [runbooks](docs/runbooks/); everything indexed in [`docs/README.md`](docs/README.md) | +| [`justfile`](justfile) | The single entrypoint for CI and every `tofu` operation | ## Quickstart ```sh -just ci # local lint + validate +just ci # local lint + validate (mirrors the PR gate) just tf-plan # TF plan (writes tfplan) -just tf-apply # TF apply (applies the saved tfplan; never auto-approve) +just tf-apply # apply the saved tfplan; never auto-approve ``` -State bucket + lock are pre-created; see [the bootstrap runbook](docs/runbooks/bootstrap-state.md). - -## Operating Terraform via the justfile +`AWS_PROFILE` must be exported in your shell; AWS-touching recipes fail loudly +without it. Back up state before risky applies (`just tf-state-backup`). State +bucket bootstrap: [runbook](docs/runbooks/bootstrap-state.md). -The justfile is the single entrypoint for Terraform. Drive every `tofu` -operation through a `just tf-*` recipe; do not run raw `tofu` or `cd terraform` -by hand. - -- **`AWS_PROFILE` is required and supplied externally.** Export it in your shell - (e.g. `export AWS_PROFILE=percona-dev-admin`); AWS-touching recipes fail loudly - if it is unset. It is never baked into a default and never set in `terraform/`. -- **Back up state before any risky apply:** run `just tf-state-backup` (timestamped - `tofu state pull`) first. `just tf-state-versioning-check` confirms bucket - versioning is on. -- **`tf-plan` writes `tfplan`; `tf-apply` applies that saved plan** — never - auto-approve. There is no `tf-apply-now`. -- **`-target` / `-exclude` are PLAN-ONLY.** `just tf-plan-masters` scopes a plan to - the per-master modules for inspection; there is no `tf-apply-masters`. Targeting - is for exceptional ops, not routine applies. - -## Compute topology - -Five tiers, each with a canonical `workload.percona.com/tier` label and -(where exclusive) a matching taint. Workloads opt in via `nodeSelector` + -`tolerations`. `general` is untainted and is the safe fallthrough. - -| Tier | Capacity | Hosts | -|---|---|---| -| `bootstrap` | EKS MNG, on-demand, multi-AZ | ArgoCD, Karpenter, AWS LB controller, external-secrets, external-dns, kube-state-metrics | -| `obs-state` | EKS MNG, single-AZ | Stateful single-replica pods that block eviction (Authentik Postgres, Grafana, prometheus-operator CRDs) | -| `jenkins-master` | EKS MNG, on-demand, single-AZ | The in-cluster Jenkins controller (ps3-k8s pilot); pinned to us-east-1a to co-locate with its zonal EBS PVC | -| `lgtm-stateful` | Karpenter NodePool, on-demand, single-AZ | Stateful LGTM pods (Mimir, Loki, Tempo ingesters; store-gateway; compactor; alertmanager). Configured to behave like an MNG (no spot, no consolidation under load, no AMI-drift) while keeping instance-family flex | -| `general` | Karpenter NodePool, spot + on-demand, single-AZ | Stateless LGTM components, Grafana web, the auth web tier, alloy-gateway, anything without an explicit tier | - -MNGs handle bootstrap and single-AZ stateful workloads whose PDBs block -eviction. Karpenter handles the higher-volume tiers (LGTM stateful, -stateless), trading multi-AZ HA for EBS-per-pod zonality. Full reasoning in -[the cluster tier taxonomy ADR](docs/adr/0017-cluster-tier-taxonomy-and-lgtm-pinning.md). - -## Documentation +## Where the details are | Topic | Doc | |---|---| -| Architecture overview | [`docs/architecture.md`](docs/architecture.md) | -| Architecture Decision Records | [`docs/adr/`](docs/adr/) | -| Runbooks (bootstrap, recovery, upgrades) | [`docs/runbooks/`](docs/runbooks/) | - -Everything else is indexed in [`docs/README.md`](docs/README.md). +| System architecture and components | [`docs/architecture.md`](docs/architecture.md) | +| Compute tiers, MNG vs Karpenter reasoning | [ADR 0017](docs/adr/0017-cluster-tier-taxonomy-and-lgtm-pinning.md) | +| Observability push pipeline | [`docs/observability.md`](docs/observability.md) | +| EC2 master connectivity and resilience | [`docs/connectivity.md`](docs/connectivity.md), [`docs/ec2-master-resilience.md`](docs/ec2-master-resilience.md) | +| Account cleanup reapers | [`docs/runbooks/cleanup-reapers.md`](docs/runbooks/cleanup-reapers.md) | +| Bootstrap, recovery, upgrades | [`docs/runbooks/`](docs/runbooks/) | +| Every past design decision | [`docs/adr/`](docs/adr/) | ## Contributing -- `just ci` must pass before PR. -- Pre-commit hooks mirror CI ([`.pre-commit-config.yaml`](.pre-commit-config.yaml)). +- `just ci` must pass before PR; pre-commit hooks mirror it ([`.pre-commit-config.yaml`](.pre-commit-config.yaml)). - Propose architecture changes in [`docs/adr/`](docs/adr/) first. -- Pinned versions live in [`terraform/versions.tf`](terraform/versions.tf); run [`scripts/check_versions.py`](scripts/check_versions.py) before bumping pins. +- Version pins live in [`terraform/versions.tf`](terraform/versions.tf); run [`scripts/check_versions.py`](scripts/check_versions.py) before bumping. - Commit format: `type(scope): subject`. No AI footers. ## License