HNSW: copy-on-write snapshots (snapshot series 01-10)#982
Conversation
ElementGraphData::copyTo / ElementLevelData::copyInto: value-semantic deep copy of a graph block (fresh mutex, independent incoming-edges, FAM-aware), so a block buffer can later be copied-on-write without sharing mutable state. Zero call-site churn (the mutex stays in ElementGraphData; the copy never aliases its lock). First branch of the snapshot-iterator stack. Includes the full design docs for the feature (design, proposal, spec-delta, prototypes, task tracker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the flat vector<DataBlock> graph storage with a rooted, refcounted copy-on-write container built on a single reusable RootedCowStore<T> (shared_ptr<T> root + generation tag + SnapshotRegistry). The same store backs the per-id metadata via MetaDataStore, so the COW machinery is written once and shared by both per-id containers (the graph and idToMetaData). This branch introduces the storage + COW primitives and routes the read/grow/shrink/serialize paths through them. With no snapshot live every version is unshared, so cowBackbone/cowBlock are no-ops and all mutations happen in place: behavior and cost are identical to the previous flat storage. Snapshot capture and COW-aware writes land in later branches. Pure refactor: the full existing suite stays green (171 hnsw + 7 allocator, including the updated memory-accounting test for the new block layout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWGraphSnapshot (an immutable, point-in-time view: the captured backbone root + scalar entry/level/count state + captured vector-block base pointers + captured metadata root) and HNSWIndex::captureGraphSnapshot(), which builds one under the index read lock in O(1) (a shared_ptr bump, no element allocation) and registers a live generation id whose token deregisters on the last handle drop. Test snapshotHandleCapture: capture is allocation-free, records the right entry-point/level/count, and resolves node access through the captured backbone to the same topology the live index holds. Isolation under concurrent writes depends on the copy-on-write write path and is exercised in the next branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… writes) Add the copy-on-write write accessors getGraphDataByInternalIdForWrite / getElementLevelDataForWrite, which copy the backbone and the touched block out of any live snapshot before returning a writable pointer (no-op when no snapshot is live, decided by the generation tag rather than use_count). Route the graph mutation sites through them: mutuallyConnectNewElement, SwapLastIdWithDeletedId, mutuallyRemoveNeighborAtPos. Also expose graphSnapshotActive() (live-id registry gate) for the SWAP-deferral consumer in the next branch. Tests: cowStorageSnapshotIsolation (a COW-aware write copies the shared block and leaves the snapshot unchanged; dropping the snapshot reclaims it; an unshared write stays in place with no allocation), snapshotHandleCapture (now asserts isolation after a post-capture grow + write), snapshotLockFreeConsistency, snapshotPreservedUnderConcurrentInserts (real concurrent inserts can't perturb a captured snapshot), cowDecisionByGenerationNotUseCount, snapshotPinsUntraversedBlock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWIndex::topKFromSnapshot(snap, query, k, params): the full two-phase HNSW KNN (greedy upper-level descent + ef-bounded level-0 best-first) reading ALL state through the immutable snapshot — graph topology via the captured backbone, vector data via the captured per-block pointers, and deleted/in-process flags + labels via the captured (frozen) metadata root — with no per-node locks and a local visited set. It touches no live graph, vectors, metadata, or mutex, so it runs after the index read lock is released and stays point-in-time consistent while writers proceed. Tests: lockFreeSnapshotQuery (same results as the live topKQuery on the same graph; stays consistent while a writer COWs links underneath) and lockFreeSnapshotQueryDuringInserts (writer does real addVectors, growing all three per-id containers, yet the lock-free reader's results never change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add HNSWSnapshot_BatchIterator (+ single/multi subclasses): a batch iterator that captures an immutable graph snapshot once under the read lock in newBatchIterator, then iterates completely lock-free. It mirrors the resumable two-phase search of HNSW_BatchIterator (greedy descent to a level-0 entry point, then ef-bounded best-first expansion carrying candidates/extras/lower_bound across batches), but reads exactly like topKFromSnapshot: links via snap.getLevelData, vectors via snap.getVectorData, deleted/in-process flags + labels from the captured frozen metadata, with a private vector<bool> visited set and NO per-node locks. The live HNSW_BatchIterator and its hot scanGraphInternal are left untouched (zero perf/risk to the default path). Opt-in via HNSWRuntimeParams.useGraphSnapshotIterator (default false, zero-init preserves the live path). HNSWIndex_Single/Multi::newBatchIterator capture the snapshot under lockSharedIndexDataGuard() then dispatch to the snapshot iterator. Tests (plain single + multi): snapshotIterator_EquivalenceWithLive (flag on == off, ids + scores in order), snapshotIterator_InsertDuringIterationInvisible (point-in-time: post-capture inserts never appear; growth + COW mid-iteration is safe), snapshotIterator_Reset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the snapshot-backed backend iterator into TieredHNSW_BatchIterator. When the useGraphSnapshotIterator flag is set, the cursor takes the main index lock only long enough for the backend newBatchIterator to capture its immutable snapshot, then releases it immediately and iterates lock-free. The backend iterator's liveToken keeps the graph snapshot live (graphSnapshotActive) for the cursor's whole life, so ingestion (flat->HNSW) and GC are no longer blocked by the cursor. The depletion / destructor / reset paths skip the now-redundant unlock in snapshot mode (the lock was released right after capture). The flat (frontend) side keeps its short per-batch flatIndexGuard locking. Tests: tieredSnapshotIteratorEquivalence (flag on == off on a fully-ingested index, single + multi) and tieredSnapshotIngestionProgressesWhileCursorOpen — a single-threaded driver that ingests flat->HNSW while a snapshot cursor is open and asserts the backend grows (this would self-deadlock under the live iterator, which holds the shared lock); the cursor stays point-in-time (only the as-of-capture population is returned). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mic metadata) Make a live snapshot safe against *concurrent* writers, not just a single writer holding the lock. Two independent tears are closed: - Graph blocks: cowBlock now forks the live buffer with an atomic compare-exchange (generation-tag idempotent), so two inserts COW'ing the same block can't free a buffer a lock-free reader is mid-scan on. blockData returns a raw pointer (gated on a snapshot being live) to avoid shared_ptr refcount churn on the hot read path. This is the fix for the use-after-free the concurrent insert + snapshot repro hit. - Metadata: MetaDataStore is read through an atomic published pointer, so a reader never observes a torn 16-byte shared_ptr while a writer republishes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A SWAP recycles a deleted slot by overwriting its vector + metadata in place, which copy-on-write does not version — so a live snapshot must not let a slot it can still read be recycled. A snapshot only ever reads ids below its captured curElementCount, so the reclaim *horizon* is the max curElementCount over all live snapshots: a swap whose freed id is >= the horizon is invisible to every snapshot and runs immediately; one below it stays a tombstone until the cursors that could see it are released. horizon == 0 (no snapshot) ⇒ everything recyclable. This keeps compaction flowing under a long-lived cursor instead of stalling all reclamation globally. The horizon is tracked by SnapshotRegistry (max captured count over live snapshots) and gated in executeReadySwapJobs under the exclusive main lock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pshots Graph repair (repairNodeConnections) rewrites a node's links in place. Under a live snapshot that is unsafe: the COW-aware link update would move a node's neighbor-guard mutex out from under a held lock and mutate a buffer the snapshot still shares. The fix makes repair copy-on-write-safe: mutuallyUpdateForRepairedNode COWs every node it will touch BEFORE taking any link lock, so the snapshot keeps its frozen view and the live rewrite is consistent. removeVectorInPlace captures the locked candidate id so lock/unlock can't target different ids when the best candidate is reassigned mid-loop. With repair COW-safe, repairs run even while a snapshot is live — they are no longer deferred — so the reclaim horizon (snapshot 09) is no longer inert: deletes' repairs complete and their swaps become ready, then the horizon decides which recycle now (id >= horizon) and which wait for the cursor to close (id < horizon). Adds tieredSnapshotRepairRunsHorizonRecycles and the concurrent parallelBatchIteratorSearchSnapshot repro (insert + delete + repair + lock-free cursors), validated under ASan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 71bd317. Configure here.
|
|
||
| // Move the last element's data to the deleted element's place | ||
| auto element = getGraphDataByInternalId(element_internal_id); | ||
| auto element = getGraphDataByInternalIdForWrite(element_internal_id); |
There was a problem hiding this comment.
Swap path skips graph COW
High Severity
SwapLastIdWithDeletedId and the start of removeAndSwap still rewrite neighbor link lists and incoming-edge sets through read-only graph accessors, so those blocks are mutated in place while a live graph snapshot may still be reading them. Inserts and repair were moved to COW-before-write, but swap compaction was not, so lock-free snapshot iteration can observe graph topology that diverges from capture time (including when tiered GC runs swaps above the reclaim horizon).
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 71bd317. Configure here.
| auto *last_element_data = getDataByInternalId(curElementCount); | ||
| DataBlock &last_gd_block = graphDataBlocks.back(); | ||
| auto last_element = (ElementGraphData *)last_gd_block.removeAndFetchLastElement(); | ||
| auto last_element = (ElementGraphData *)graphDataBlocks.removeAndFetchLastElement(); |
There was a problem hiding this comment.
Plain swap ignores live snapshots
High Severity
Plain HNSW still calls removeAndSwap from synchronous delete paths without checking graphSnapshotActive() or the reclaim horizon, so a concurrent opt-in snapshot iterator can keep reading a slot while SwapLastIdWithDeletedId overwrites that id’s graph, vector, and metadata in place. Tiered swap jobs defer below-horizon ids, but the non-tiered path never gates compaction while snapshots are live.
Reviewed by Cursor Bugbot for commit 71bd317. Configure here.
|
|


Summary
Adds copy-on-write (COW) snapshot support to HNSW, enabling lock-free reads over a consistent point-in-time view of the graph while writers continue to mutate it concurrently. Built up as a 10-step snapshot series:
RootedCowStore<T>)Test plan
test_hnsw.cpp,test_hnsw_multi.cpp,test_hnsw_tiered.cpp,test_svs.cpp🤖 Generated with Claude Code
Note
High Risk
Touches core HNSW graph storage, concurrent insert/delete/repair, and tiered swap GC; mistakes could cause use-after-free or consistency bugs, though the default path is unchanged and the feature is opt-in with heavy test coverage.
Overview
Adds an opt-in (
useGraphSnapshotIterator) path where HNSW batch/KNN reads capture a point-in-time graph under a brief exclusive lock, then traverse without holding the index lock while inserts, deletes, and tiered GC continue.Storage & COW: Replaces flat
graphDataBlockswith rooted copy-on-write (GraphDataBlocks/RootedCowStore): contiguous block headers + refcounted buffers, generation-tagged clone decisions (SnapshotRegistry), andElementGraphData::copyTofor per-block deep copies.MetaDataStoreCOWs metadata on resize; vectors are pinned via captured per-block pointers inHNSWGraphSnapshot.Writers: Inserts/repair route through
getGraphDataByInternalIdForWrite; repair COWs all touched nodes before per-node locks. Per-block forks publish with atomic CAS for concurrent tiered inserts. With no live snapshot, behavior stays in-place (no COW overhead).Iteration:
topKFromSnapshot, dedicatedHNSWSnapshot_*_BatchIterator, and tieredsnapshot_mode(capture under exclusive lock, release main lock for cursor lifetime). SWAP compaction is gated bymaxVisibleCount()reclaim horizon so slots visible to snapshots are not recycled in place.Extensive design docs, standalone prototypes, serializer updates, and unit/concurrency tests (including tiered snapshot stress). RediSearch cursor wiring, config caps, and range-query snapshots are documented as not yet in this PR.
Reviewed by Cursor Bugbot for commit 71bd317. Bugbot is set up for automated code reviews on this repo. Configure here.