Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 42 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,67 @@
# percona-cd-platform

Percona's CI/CD platform.

It hosts Jenkins masters and the platform services around them — observability,
single sign-on, ingress with TLS termination, and cluster autoscaling — running
on a GitOps-managed EKS cluster in `us-east-1`. Everything is defined as code
Percona's CI/CD platform: a GitOps-managed EKS cluster in `us-east-1` hosting
the Jenkins masters and the platform services around them (LGTM observability,
Authentik SSO, ingress with TLS, autoscaling). Everything is defined as code
and reconciled from this repo; there are no manual cluster changes.

Region, cluster name, hostnames, and other deployment-specific values are
parameterized in `terraform/`.

## Components

**CI/CD (Jenkins).** Jenkins masters served on `*.cd.percona.com`, in one of
two modes: an ALB → in-cluster NGINX proxy → cross-region VPC peering → EC2
master (a reconciler keeps each EndpointSlice synced to the live instance IP),
or an in-cluster StatefulSet. The custom master image bundles the WAR,
Percona-patched plugins, and `init.groovy.d`.

**Monitoring / observability.** A distributed LGTM stack (Mimir, Loki, Tempo,
Grafana). EC2 masters run a master-side Alloy that pushes metrics, logs, and
traces through `alloy-gateway` (NGINX bearer-auth + Alloy receivers) into the
stack. Grafana fronts all three behind Authentik OIDC.
## Key facts

**Terraform / AWS.** OpenTofu owns AWS-side state through "ArgoCD healthy":
VPC, EKS, managed node groups, Karpenter prerequisites, Pod Identity, ACM
wildcard cert, the LGTM S3 buckets, and each EC2 Jenkins master (spot fleet or
on-demand, IAM, EBS, userdata via a reusable module). TF outputs reach ArgoCD as
cluster-Secret annotations consumed as Helm values.
- OpenTofu owns AWS up to "ArgoCD healthy": VPC, EKS, node groups, Pod
Identity, the EC2 Jenkins masters, ARM spot fleets, S3, cleanup reapers.
TF outputs reach ArgoCD as cluster-Secret annotations.
- From there ArgoCD owns everything in-cluster: a root App-of-Apps fans out
ApplicationSets that reconcile one Application per `resources/addons/*` dir
and one per `resources/jenkins/master/instances/*` dir. No manual `kubectl`.
- Jenkins masters serve on `*.cd.percona.com` in two modes: EKS-fronted EC2
(ALB, in-cluster NGINX, cross-region VPC peering, an EndpointSlice
reconciler) or in-cluster StatefulSet.
- Repo CI is lint + validate only; `ci-gate` is the single required check and
`just ci` mirrors it locally.
- The repo is public: no account IDs, ARNs, or secrets in committed files.

**Account cleanup reapers.** Two scheduled Lambdas from the
[`scheduled-lambda` module](terraform/modules/scheduled-lambda/README.md):
daily unattached-EBS reaper, 5-minute untagged-EC2 + orphan-`eksctl`-stack
reaper. Tunables (schedules, `dry_run`, EKS skip regex) in
[`terraform/locals.tf`](terraform/locals.tf); ops in
[`docs/runbooks/cleanup-reapers.md`](docs/runbooks/cleanup-reapers.md).
## Layout

**GitOps / ArgoCD.** From "ArgoCD healthy" onward, everything in-cluster is
GitOps-managed. A root App-of-Apps fans out to ApplicationSets that reconcile
one Application per addon and one per in-cluster Jenkins instance. No manual
`kubectl` mutations — drift breaks reconciliation.

**Repo CI.** GitHub Actions runs lint + validate only (no plan, no deploy):
one check per job, `ci-gate` aggregate as the single required check. `just ci`
mirrors it locally. Dependabot bumps actions, test deps, and the Jenkins base
image weekly.
| Path | What lives there |
|---|---|
| [`terraform/`](terraform/) | AWS substrate; reusable modules with their own READMEs ([jenkins-arm-fleet](terraform/modules/jenkins-arm-fleet/README.md), [jenkins-arm-standalone](terraform/modules/jenkins-arm-standalone/README.md), [scheduled-lambda](terraform/modules/scheduled-lambda/README.md)); pins in [`versions.tf`](terraform/versions.tf) |
| [`argocd-bootstrap/`](argocd-bootstrap/) | Root Application, ApplicationSets, AppProject |
| [`resources/addons/`](resources/addons/) | One dir = one ArgoCD Application (observability, ingress, SSO, ...) |
| [`resources/jenkins/`](resources/jenkins/) | In-cluster master chart, per-instance values, clouds catalog (rendered by [`scripts/render-clouds.py`](scripts/render-clouds.py), drift-gated in CI) |
| [`images/`](images/) | Container images (controller bundle and friends), built by GitHub Actions |
| [`scripts/`](scripts/) | Verification and render tooling; catalog in [`scripts/README.md`](scripts/README.md) |
| [`docs/`](docs/) | [Architecture](docs/architecture.md), [ADRs](docs/adr/), [runbooks](docs/runbooks/); everything indexed in [`docs/README.md`](docs/README.md) |
| [`justfile`](justfile) | The single entrypoint for CI and every `tofu` operation |

## Quickstart

```sh
just ci # local lint + validate
just ci # local lint + validate (mirrors the PR gate)
just tf-plan # TF plan (writes tfplan)
just tf-apply # TF apply (applies the saved tfplan; never auto-approve)
just tf-apply # apply the saved tfplan; never auto-approve
```

State bucket + lock are pre-created; see [the bootstrap runbook](docs/runbooks/bootstrap-state.md).

## Operating Terraform via the justfile
`AWS_PROFILE` must be exported in your shell; AWS-touching recipes fail loudly
without it. Back up state before risky applies (`just tf-state-backup`). State
bucket bootstrap: [runbook](docs/runbooks/bootstrap-state.md).

The justfile is the single entrypoint for Terraform. Drive every `tofu`
operation through a `just tf-*` recipe; do not run raw `tofu` or `cd terraform`
by hand.

- **`AWS_PROFILE` is required and supplied externally.** Export it in your shell
(e.g. `export AWS_PROFILE=percona-dev-admin`); AWS-touching recipes fail loudly
if it is unset. It is never baked into a default and never set in `terraform/`.
- **Back up state before any risky apply:** run `just tf-state-backup` (timestamped
`tofu state pull`) first. `just tf-state-versioning-check` confirms bucket
versioning is on.
- **`tf-plan` writes `tfplan`; `tf-apply` applies that saved plan** — never
auto-approve. There is no `tf-apply-now`.
- **`-target` / `-exclude` are PLAN-ONLY.** `just tf-plan-masters` scopes a plan to
the per-master modules for inspection; there is no `tf-apply-masters`. Targeting
is for exceptional ops, not routine applies.

## Compute topology

Five tiers, each with a canonical `workload.percona.com/tier` label and
(where exclusive) a matching taint. Workloads opt in via `nodeSelector` +
`tolerations`. `general` is untainted and is the safe fallthrough.

| Tier | Capacity | Hosts |
|---|---|---|
| `bootstrap` | EKS MNG, on-demand, multi-AZ | ArgoCD, Karpenter, AWS LB controller, external-secrets, external-dns, kube-state-metrics |
| `obs-state` | EKS MNG, single-AZ | Stateful single-replica pods that block eviction (Authentik Postgres, Grafana, prometheus-operator CRDs) |
| `jenkins-master` | EKS MNG, on-demand, single-AZ | The in-cluster Jenkins controller (ps3-k8s pilot); pinned to us-east-1a to co-locate with its zonal EBS PVC |
| `lgtm-stateful` | Karpenter NodePool, on-demand, single-AZ | Stateful LGTM pods (Mimir, Loki, Tempo ingesters; store-gateway; compactor; alertmanager). Configured to behave like an MNG (no spot, no consolidation under load, no AMI-drift) while keeping instance-family flex |
| `general` | Karpenter NodePool, spot + on-demand, single-AZ | Stateless LGTM components, Grafana web, the auth web tier, alloy-gateway, anything without an explicit tier |

MNGs handle bootstrap and single-AZ stateful workloads whose PDBs block
eviction. Karpenter handles the higher-volume tiers (LGTM stateful,
stateless), trading multi-AZ HA for EBS-per-pod zonality. Full reasoning in
[the cluster tier taxonomy ADR](docs/adr/0017-cluster-tier-taxonomy-and-lgtm-pinning.md).

## Documentation
## Where the details are

| Topic | Doc |
|---|---|
| Architecture overview | [`docs/architecture.md`](docs/architecture.md) |
| Architecture Decision Records | [`docs/adr/`](docs/adr/) |
| Runbooks (bootstrap, recovery, upgrades) | [`docs/runbooks/`](docs/runbooks/) |

Everything else is indexed in [`docs/README.md`](docs/README.md).
| System architecture and components | [`docs/architecture.md`](docs/architecture.md) |
| Compute tiers, MNG vs Karpenter reasoning | [ADR 0017](docs/adr/0017-cluster-tier-taxonomy-and-lgtm-pinning.md) |
| Observability push pipeline | [`docs/observability.md`](docs/observability.md) |
| EC2 master connectivity and resilience | [`docs/connectivity.md`](docs/connectivity.md), [`docs/ec2-master-resilience.md`](docs/ec2-master-resilience.md) |
| Account cleanup reapers | [`docs/runbooks/cleanup-reapers.md`](docs/runbooks/cleanup-reapers.md) |
| Bootstrap, recovery, upgrades | [`docs/runbooks/`](docs/runbooks/) |
| Every past design decision | [`docs/adr/`](docs/adr/) |

## Contributing

- `just ci` must pass before PR.
- Pre-commit hooks mirror CI ([`.pre-commit-config.yaml`](.pre-commit-config.yaml)).
- `just ci` must pass before PR; pre-commit hooks mirror it ([`.pre-commit-config.yaml`](.pre-commit-config.yaml)).
- Propose architecture changes in [`docs/adr/`](docs/adr/) first.
- Pinned versions live in [`terraform/versions.tf`](terraform/versions.tf); run [`scripts/check_versions.py`](scripts/check_versions.py) before bumping pins.
- Version pins live in [`terraform/versions.tf`](terraform/versions.tf); run [`scripts/check_versions.py`](scripts/check_versions.py) before bumping.
- Commit format: `type(scope): subject`. No AI footers.

## License
Expand Down