Skip to content

Fetch updated attested nodes in bulk#7022

Open
nweisenauer-sap wants to merge 6 commits into
spiffe:mainfrom
nweisenauer-sap:fetch-nodes
Open

Fetch updated attested nodes in bulk#7022
nweisenauer-sap wants to merge 6 commits into
spiffe:mainfrom
nweisenauer-sap:fetch-nodes

Conversation

@nweisenauer-sap

Copy link
Copy Markdown
Contributor

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

Events based cache

Description of change

This is the attested-node equivalent of #5970, which previously did the same for registration entries.

When the events-based cache processes attested-node events, updateCachedNodes looped over every changed SPIFFE ID and issued two database round-trips per node (FetchAttestedNode followed by GetNodeSelectors). With a large number of node events — e.g. after a full cache reload, or in environments where nodes attest frequently — this turned into thousands of individual queries, contributing to the skipped-event spikes and delayed SVID issuance described in #6876.

This change introduces a bulk DataStore.FetchAttestedNodes(ctx, spiffeIDs) method and reworks the cache update to fetch changed nodes in pages (using the existing pageSize) rather than one-at-a-time, so N node events no longer mean up to 2N queries.

Specifically:

  • Add FetchAttestedNodes(ctx, spiffeIDs []string) (map[string]*common.AttestedNode, error) to the DataStore interface, with a corresponding BySpiffeIDs filter on ListAttestedNodesRequest.
  • Implement it in the SQL datastore by reusing the existing attested-node listing query with BySpiffeIDs + FetchSelectors, so each node is returned together with its selectors in a single query (the IN (...) filter is added to both the CTE and MySQL query builders).
  • Add the method to the metrics wrapper and the fake datastores.
  • Add a pageSize to the attested-nodes cache and rewrite updateCachedNodes to page through the changed SPIFFE IDs and call FetchAttestedNodes once per page; a requested ID that is absent from the response is treated as a deletion. This mirrors updateCachedEntries from Fetch updated cache entries in bulk #5970.

Which issue this PR fixes

Partially addresses #6876 (companion to #6994).

Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
@nweisenauer-sap nweisenauer-sap marked this pull request as ready for review June 8, 2026 08:36
@nweisenauer-sap nweisenauer-sap requested a review from evan2645 as a code owner June 8, 2026 08:36
Copilot AI review requested due to automatic review settings June 8, 2026 08:36

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a batched attested-node fetch API and updates the authorized entry fetcher to page through node refreshes using the new batch call.

Changes:

  • Introduces FetchAttestedNodes() to the datastore.DataStore interface and implements it in the SQL store, fakes, and telemetry wrapper.
  • Updates attested-nodes cache refresh to fetch nodes in pages (reducing per-node datastore calls).
  • Adds/updates tests to cover paging behavior and invalid page sizes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/fakes/fakedatastore/fakedatastore.go Adds fake FetchAttestedNodes() support for tests.
pkg/server/endpoints/authorized_entryfetcher.go Wires a pageSize argument into attested-nodes cache construction.
pkg/server/endpoints/authorized_entryfetcher_test.go Updates cache build test for new function signature.
pkg/server/endpoints/authorized_entryfetcher_attested_nodes.go Implements paged/batched node refresh via FetchAttestedNodes().
pkg/server/endpoints/authorized_entryfetcher_attested_nodes_test.go Adds page-size validation coverage and a multi-page refresh scenario.
pkg/server/datastore/datastore.go Extends datastore interface and list request with BySpiffeIDs.
pkg/server/datastore/sqlstore/sqlstore.go Implements FetchAttestedNodes() and adds BySpiffeIDs filtering to list queries.
pkg/server/datastore/sqlstore/sqlstore_test.go Adds SQL store tests for FetchAttestedNodes() semantics (including deleted/missing IDs).
pkg/common/telemetry/server/datastore/wrapper.go Adds metric-wrapped FetchAttestedNodes() call.
pkg/common/telemetry/server/datastore/wrapper_test.go Extends wrapper tests to cover the new method.

Comment thread pkg/server/datastore/sqlstore/sqlstore.go Outdated
Comment on lines +222 to 228
spiffeIds := slices.Collect(maps.Keys(a.fetchNodes))
for pageStart := 0; pageStart < len(spiffeIds); pageStart += int(a.pageSize) {
fetchNodes := a.fetchNodesPage(spiffeIds, pageStart)
nodes, err := a.ds.FetchAttestedNodes(ctx, fetchNodes)
if err != nil {
continue
return err
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, to stay consistent with updateCachedEntries (the registration-entry equivalent from #5970), which also returns on a failed page fetch.

There's no data loss or extended-stale risk: IDs are only removed from fetchNodes on success or confirmed deletion, so a failed page (and any later pages) stays queued and is retried on the next reload tick. It also surfaces the error in logs/telemetry, whereas the old per-node continue silently swallowed persistent failures — relevant since this PR targets the skipped-event spikes in #6876.

I'd prefer to keep the two cache paths symmetric, but I'm happy to switch to per-page skip if you'd rather change both here and in the entries path.

Comment thread pkg/server/endpoints/authorized_entryfetcher_attested_nodes.go
Comment thread pkg/server/datastore/datastore.go
Comment on lines +134 to +138
func (w metricsWrapper) FetchAttestedNodes(ctx context.Context, spiffeIDs []string) (_ map[string]*common.AttestedNode, err error) {
callCounter := StartFetchNodeCall(w.m)
defer callCounter.Done(&err)
return w.ds.FetchAttestedNodes(ctx, spiffeIDs)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this would improve observability. I've left it sharing StartFetchNodeCall for now to match the existing convention — the bulk FetchRegistrationEntries similarly reuses StartFetchRegistrationCall.

To keep telemetry consistent, I'd suggest introducing dedicated batch metrics for both bulk methods (plus the corresponding telemetry_config.md entries) in a separate follow-up rather than diverging here. Let me know if you'd prefer it in this PR.

Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
MarcosDY
MarcosDY previously approved these changes Jun 20, 2026
@MarcosDY MarcosDY added this to the 1.15.2 milestone Jun 20, 2026
@MarcosDY MarcosDY added this pull request to the merge queue Jun 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 20, 2026
@MarcosDY

Copy link
Copy Markdown
Collaborator

Sorry @nweisenauer-sap, but there are some conflicts, can you plz resolve?

Thanks!

Signed-off-by: nweisenauer-sap <137267159+nweisenauer-sap@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants