Fetch updated attested nodes in bulk#7022
Conversation
Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a batched attested-node fetch API and updates the authorized entry fetcher to page through node refreshes using the new batch call.
Changes:
- Introduces
FetchAttestedNodes()to thedatastore.DataStoreinterface and implements it in the SQL store, fakes, and telemetry wrapper. - Updates attested-nodes cache refresh to fetch nodes in pages (reducing per-node datastore calls).
- Adds/updates tests to cover paging behavior and invalid page sizes.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/fakes/fakedatastore/fakedatastore.go | Adds fake FetchAttestedNodes() support for tests. |
| pkg/server/endpoints/authorized_entryfetcher.go | Wires a pageSize argument into attested-nodes cache construction. |
| pkg/server/endpoints/authorized_entryfetcher_test.go | Updates cache build test for new function signature. |
| pkg/server/endpoints/authorized_entryfetcher_attested_nodes.go | Implements paged/batched node refresh via FetchAttestedNodes(). |
| pkg/server/endpoints/authorized_entryfetcher_attested_nodes_test.go | Adds page-size validation coverage and a multi-page refresh scenario. |
| pkg/server/datastore/datastore.go | Extends datastore interface and list request with BySpiffeIDs. |
| pkg/server/datastore/sqlstore/sqlstore.go | Implements FetchAttestedNodes() and adds BySpiffeIDs filtering to list queries. |
| pkg/server/datastore/sqlstore/sqlstore_test.go | Adds SQL store tests for FetchAttestedNodes() semantics (including deleted/missing IDs). |
| pkg/common/telemetry/server/datastore/wrapper.go | Adds metric-wrapped FetchAttestedNodes() call. |
| pkg/common/telemetry/server/datastore/wrapper_test.go | Extends wrapper tests to cover the new method. |
| spiffeIds := slices.Collect(maps.Keys(a.fetchNodes)) | ||
| for pageStart := 0; pageStart < len(spiffeIds); pageStart += int(a.pageSize) { | ||
| fetchNodes := a.fetchNodesPage(spiffeIds, pageStart) | ||
| nodes, err := a.ds.FetchAttestedNodes(ctx, fetchNodes) | ||
| if err != nil { | ||
| continue | ||
| return err | ||
| } |
There was a problem hiding this comment.
This is intentional, to stay consistent with updateCachedEntries (the registration-entry equivalent from #5970), which also returns on a failed page fetch.
There's no data loss or extended-stale risk: IDs are only removed from fetchNodes on success or confirmed deletion, so a failed page (and any later pages) stays queued and is retried on the next reload tick. It also surfaces the error in logs/telemetry, whereas the old per-node continue silently swallowed persistent failures — relevant since this PR targets the skipped-event spikes in #6876.
I'd prefer to keep the two cache paths symmetric, but I'm happy to switch to per-page skip if you'd rather change both here and in the entries path.
| func (w metricsWrapper) FetchAttestedNodes(ctx context.Context, spiffeIDs []string) (_ map[string]*common.AttestedNode, err error) { | ||
| callCounter := StartFetchNodeCall(w.m) | ||
| defer callCounter.Done(&err) | ||
| return w.ds.FetchAttestedNodes(ctx, spiffeIDs) | ||
| } |
There was a problem hiding this comment.
Agreed this would improve observability. I've left it sharing StartFetchNodeCall for now to match the existing convention — the bulk FetchRegistrationEntries similarly reuses StartFetchRegistrationCall.
To keep telemetry consistent, I'd suggest introducing dedicated batch metrics for both bulk methods (plus the corresponding telemetry_config.md entries) in a separate follow-up rather than diverging here. Let me know if you'd prefer it in this PR.
Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
|
Sorry @nweisenauer-sap, but there are some conflicts, can you plz resolve? Thanks! |
Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
Pull Request check list
Affected functionality
Events based cache
Description of change
This is the attested-node equivalent of #5970, which previously did the same for registration entries.
When the events-based cache processes attested-node events,
updateCachedNodeslooped over every changed SPIFFE ID and issued two database round-trips per node (FetchAttestedNodefollowed byGetNodeSelectors). With a large number of node events — e.g. after a full cache reload, or in environments where nodes attest frequently — this turned into thousands of individual queries, contributing to the skipped-event spikes and delayed SVID issuance described in #6876.This change introduces a bulk
DataStore.FetchAttestedNodes(ctx, spiffeIDs)method and reworks the cache update to fetch changed nodes in pages (using the existingpageSize) rather than one-at-a-time, so N node events no longer mean up to 2N queries.Specifically:
FetchAttestedNodes(ctx, spiffeIDs []string) (map[string]*common.AttestedNode, error)to theDataStoreinterface, with a correspondingBySpiffeIDsfilter onListAttestedNodesRequest.BySpiffeIDs+FetchSelectors, so each node is returned together with its selectors in a single query (theIN (...)filter is added to both the CTE and MySQL query builders).pageSizeto the attested-nodes cache and rewriteupdateCachedNodesto page through the changed SPIFFE IDs and callFetchAttestedNodesonce per page; a requested ID that is absent from the response is treated as a deletion. This mirrorsupdateCachedEntriesfrom Fetch updated cache entries in bulk #5970.Which issue this PR fixes
Partially addresses #6876 (companion to #6994).