Skip to content

HNSW: copy-on-write snapshots (snapshot series 01-10)#982

Open
kei-nan wants to merge 10 commits into
RedisAI:mainfrom
kei-nan:snapshot/10-repair-cow-safety
Open

HNSW: copy-on-write snapshots (snapshot series 01-10)#982
kei-nan wants to merge 10 commits into
RedisAI:mainfrom
kei-nan:snapshot/10-repair-cow-safety

Conversation

@kei-nan

@kei-nan kei-nan commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Adds copy-on-write (COW) snapshot support to HNSW, enabling lock-free reads over a consistent point-in-time view of the graph while writers continue to mutate it concurrently. Built up as a 10-step snapshot series:

  1. 01 — block deep-copy primitive + design docs
  2. 02 — rooted copy-on-write block storage (RootedCowStore<T>)
  3. 03 — immutable snapshot handle + capture API
  4. 04 — COW-aware insert path (snapshots survive concurrent writes)
  5. 05 — lock-free KNN over a captured snapshot
  6. 06 — snapshot-backed batch iterator (plain HNSW, opt-in)
  7. 07 — tiered cursor captures-then-releases the main lock
  8. 08 — concurrent-writer safety (lock-free block COW + atomic metadata)
  9. 09 — swap-side reclaim horizon
  10. 10 — repair COW-safety: run graph repair under live snapshots

Test plan

  • New/updated unit tests: test_hnsw.cpp, test_hnsw_multi.cpp, test_hnsw_tiered.cpp, test_svs.cpp
  • ~4,578 insertions across 29 files

🤖 Generated with Claude Code


Note

High Risk
Touches core HNSW graph storage, concurrent insert/delete/repair, and tiered swap GC; mistakes could cause use-after-free or consistency bugs, though the default path is unchanged and the feature is opt-in with heavy test coverage.

Overview
Adds an opt-in (useGraphSnapshotIterator) path where HNSW batch/KNN reads capture a point-in-time graph under a brief exclusive lock, then traverse without holding the index lock while inserts, deletes, and tiered GC continue.

Storage & COW: Replaces flat graphDataBlocks with rooted copy-on-write (GraphDataBlocks / RootedCowStore): contiguous block headers + refcounted buffers, generation-tagged clone decisions (SnapshotRegistry), and ElementGraphData::copyTo for per-block deep copies. MetaDataStore COWs metadata on resize; vectors are pinned via captured per-block pointers in HNSWGraphSnapshot.

Writers: Inserts/repair route through getGraphDataByInternalIdForWrite; repair COWs all touched nodes before per-node locks. Per-block forks publish with atomic CAS for concurrent tiered inserts. With no live snapshot, behavior stays in-place (no COW overhead).

Iteration: topKFromSnapshot, dedicated HNSWSnapshot_*_BatchIterator, and tiered snapshot_mode (capture under exclusive lock, release main lock for cursor lifetime). SWAP compaction is gated by maxVisibleCount() reclaim horizon so slots visible to snapshots are not recycled in place.

Extensive design docs, standalone prototypes, serializer updates, and unit/concurrency tests (including tiered snapshot stress). RediSearch cursor wiring, config caps, and range-query snapshots are documented as not yet in this PR.

Reviewed by Cursor Bugbot for commit 71bd317. Bugbot is set up for automated code reviews on this repo. Configure here.

kei-nan and others added 10 commits June 11, 2026 21:13
ElementGraphData::copyTo / ElementLevelData::copyInto: value-semantic deep copy
of a graph block (fresh mutex, independent incoming-edges, FAM-aware), so a block
buffer can later be copied-on-write without sharing mutable state. Zero call-site
churn (the mutex stays in ElementGraphData; the copy never aliases its lock).

First branch of the snapshot-iterator stack. Includes the full design docs for
the feature (design, proposal, spec-delta, prototypes, task tracker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the flat vector<DataBlock> graph storage with a rooted, refcounted
copy-on-write container built on a single reusable RootedCowStore<T>
(shared_ptr<T> root + generation tag + SnapshotRegistry). The same store backs
the per-id metadata via MetaDataStore, so the COW machinery is written once and
shared by both per-id containers (the graph and idToMetaData).

This branch introduces the storage + COW primitives and routes the
read/grow/shrink/serialize paths through them. With no snapshot live every
version is unshared, so cowBackbone/cowBlock are no-ops and all mutations happen
in place: behavior and cost are identical to the previous flat storage. Snapshot
capture and COW-aware writes land in later branches.

Pure refactor: the full existing suite stays green (171 hnsw + 7 allocator,
including the updated memory-accounting test for the new block layout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWGraphSnapshot (an immutable, point-in-time view: the captured backbone
root + scalar entry/level/count state + captured vector-block base pointers +
captured metadata root) and HNSWIndex::captureGraphSnapshot(), which builds one
under the index read lock in O(1) (a shared_ptr bump, no element allocation) and
registers a live generation id whose token deregisters on the last handle drop.

Test snapshotHandleCapture: capture is allocation-free, records the right
entry-point/level/count, and resolves node access through the captured backbone
to the same topology the live index holds. Isolation under concurrent writes
depends on the copy-on-write write path and is exercised in the next branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… writes)

Add the copy-on-write write accessors getGraphDataByInternalIdForWrite /
getElementLevelDataForWrite, which copy the backbone and the touched block out of
any live snapshot before returning a writable pointer (no-op when no snapshot is
live, decided by the generation tag rather than use_count). Route the graph
mutation sites through them: mutuallyConnectNewElement, SwapLastIdWithDeletedId,
mutuallyRemoveNeighborAtPos. Also expose graphSnapshotActive() (live-id registry
gate) for the SWAP-deferral consumer in the next branch.

Tests: cowStorageSnapshotIsolation (a COW-aware write copies the shared block and
leaves the snapshot unchanged; dropping the snapshot reclaims it; an unshared
write stays in place with no allocation), snapshotHandleCapture (now asserts
isolation after a post-capture grow + write), snapshotLockFreeConsistency,
snapshotPreservedUnderConcurrentInserts (real concurrent inserts can't perturb a
captured snapshot), cowDecisionByGenerationNotUseCount, snapshotPinsUntraversedBlock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWIndex::topKFromSnapshot(snap, query, k, params): the full two-phase HNSW
KNN (greedy upper-level descent + ef-bounded level-0 best-first) reading ALL
state through the immutable snapshot — graph topology via the captured backbone,
vector data via the captured per-block pointers, and deleted/in-process flags +
labels via the captured (frozen) metadata root — with no per-node locks and a
local visited set. It touches no live graph, vectors, metadata, or mutex, so it
runs after the index read lock is released and stays point-in-time consistent
while writers proceed.

Tests: lockFreeSnapshotQuery (same results as the live topKQuery on the same
graph; stays consistent while a writer COWs links underneath) and
lockFreeSnapshotQueryDuringInserts (writer does real addVectors, growing all
three per-id containers, yet the lock-free reader's results never change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWSnapshot_BatchIterator (+ single/multi subclasses): a batch iterator that
captures an immutable graph snapshot once under the read lock in newBatchIterator,
then iterates completely lock-free. It mirrors the resumable two-phase search of
HNSW_BatchIterator (greedy descent to a level-0 entry point, then ef-bounded
best-first expansion carrying candidates/extras/lower_bound across batches), but
reads exactly like topKFromSnapshot: links via snap.getLevelData, vectors via
snap.getVectorData, deleted/in-process flags + labels from the captured frozen
metadata, with a private vector<bool> visited set and NO per-node locks. The live
HNSW_BatchIterator and its hot scanGraphInternal are left untouched (zero perf/risk
to the default path).

Opt-in via HNSWRuntimeParams.useGraphSnapshotIterator (default false, zero-init
preserves the live path). HNSWIndex_Single/Multi::newBatchIterator capture the
snapshot under lockSharedIndexDataGuard() then dispatch to the snapshot iterator.

Tests (plain single + multi): snapshotIterator_EquivalenceWithLive (flag on == off,
ids + scores in order), snapshotIterator_InsertDuringIterationInvisible
(point-in-time: post-capture inserts never appear; growth + COW mid-iteration is
safe), snapshotIterator_Reset.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the snapshot-backed backend iterator into TieredHNSW_BatchIterator. When the
useGraphSnapshotIterator flag is set, the cursor takes the main index lock only
long enough for the backend newBatchIterator to capture its immutable snapshot,
then releases it immediately and iterates lock-free. The backend iterator's
liveToken keeps the graph snapshot live (graphSnapshotActive) for the cursor's
whole life, so ingestion (flat->HNSW) and GC are no longer blocked by the cursor.
The depletion / destructor / reset paths skip the now-redundant unlock in snapshot
mode (the lock was released right after capture). The flat (frontend) side keeps
its short per-batch flatIndexGuard locking.

Tests: tieredSnapshotIteratorEquivalence (flag on == off on a fully-ingested
index, single + multi) and tieredSnapshotIngestionProgressesWhileCursorOpen — a
single-threaded driver that ingests flat->HNSW while a snapshot cursor is open and
asserts the backend grows (this would self-deadlock under the live iterator, which
holds the shared lock); the cursor stays point-in-time (only the as-of-capture
population is returned).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mic metadata)

Make a live snapshot safe against *concurrent* writers, not just a single
writer holding the lock. Two independent tears are closed:

- Graph blocks: cowBlock now forks the live buffer with an atomic
  compare-exchange (generation-tag idempotent), so two inserts COW'ing the
  same block can't free a buffer a lock-free reader is mid-scan on. blockData
  returns a raw pointer (gated on a snapshot being live) to avoid shared_ptr
  refcount churn on the hot read path. This is the fix for the use-after-free
  the concurrent insert + snapshot repro hit.

- Metadata: MetaDataStore is read through an atomic published pointer, so a
  reader never observes a torn 16-byte shared_ptr while a writer republishes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A SWAP recycles a deleted slot by overwriting its vector + metadata in place,
which copy-on-write does not version — so a live snapshot must not let a slot
it can still read be recycled. A snapshot only ever reads ids below its
captured curElementCount, so the reclaim *horizon* is the max curElementCount
over all live snapshots: a swap whose freed id is >= the horizon is invisible
to every snapshot and runs immediately; one below it stays a tombstone until
the cursors that could see it are released. horizon == 0 (no snapshot) ⇒
everything recyclable. This keeps compaction flowing under a long-lived cursor
instead of stalling all reclamation globally.

The horizon is tracked by SnapshotRegistry (max captured count over live
snapshots) and gated in executeReadySwapJobs under the exclusive main lock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pshots

Graph repair (repairNodeConnections) rewrites a node's links in place. Under a
live snapshot that is unsafe: the COW-aware link update would move a node's
neighbor-guard mutex out from under a held lock and mutate a buffer the snapshot
still shares. The fix makes repair copy-on-write-safe: mutuallyUpdateForRepairedNode
COWs every node it will touch BEFORE taking any link lock, so the snapshot keeps
its frozen view and the live rewrite is consistent. removeVectorInPlace captures
the locked candidate id so lock/unlock can't target different ids when the best
candidate is reassigned mid-loop.

With repair COW-safe, repairs run even while a snapshot is live — they are no
longer deferred — so the reclaim horizon (snapshot 09) is no longer inert: deletes'
repairs complete and their swaps become ready, then the horizon decides which
recycle now (id >= horizon) and which wait for the cursor to close (id < horizon).

Adds tieredSnapshotRepairRunsHorizonRecycles and the concurrent
parallelBatchIteratorSearchSnapshot repro (insert + delete + repair + lock-free
cursors), validated under ASan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jit-ci

jit-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 71bd317. Configure here.


// Move the last element's data to the deleted element's place
auto element = getGraphDataByInternalId(element_internal_id);
auto element = getGraphDataByInternalIdForWrite(element_internal_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap path skips graph COW

High Severity

SwapLastIdWithDeletedId and the start of removeAndSwap still rewrite neighbor link lists and incoming-edge sets through read-only graph accessors, so those blocks are mutated in place while a live graph snapshot may still be reading them. Inserts and repair were moved to COW-before-write, but swap compaction was not, so lock-free snapshot iteration can observe graph topology that diverges from capture time (including when tiered GC runs swaps above the reclaim horizon).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 71bd317. Configure here.

auto *last_element_data = getDataByInternalId(curElementCount);
DataBlock &last_gd_block = graphDataBlocks.back();
auto last_element = (ElementGraphData *)last_gd_block.removeAndFetchLastElement();
auto last_element = (ElementGraphData *)graphDataBlocks.removeAndFetchLastElement();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain swap ignores live snapshots

High Severity

Plain HNSW still calls removeAndSwap from synchronous delete paths without checking graphSnapshotActive() or the reclaim horizon, so a concurrent opt-in snapshot iterator can keep reading a slot while SwapLastIdWithDeletedId overwrites that id’s graph, vector, and metadata in place. Tiered swap jobs defer below-horizon ids, but the non-tiered path never gates compaction while snapshots are live.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 71bd317. Configure here.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants