RedisAI · kei-nan · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/docs/design/hnsw-snapshot/README.md b/docs/design/hnsw-snapshot/README.md
@@ -0,0 +1,53 @@
+# HNSW snapshot iterator — design
+
+Design exploration for a **consistent, lock-free snapshot iterator** over HNSW:
+construct an iterator under the index read lock, release the lock, and iterate
+while concurrent inserts/deletes proceed — with results consistent as of
+construction time. Mechanism: persistent (rooted) copy-on-write where the
+**clone decision is generation-tag-driven** and **reclamation is refcount-driven**
+(superseded versions free themselves when the last snapshot referencing them
+drops); slot recycling (SWAP) is gated by a separate reclaim horizon.
+
+The foundational change (the block deep-copy primitive `ElementGraphData::copyTo`
+/ `ElementLevelData::copyInto`) is implemented in
+`src/VecSim/algorithms/hnsw/graph_data.h` on this branch.
+
+## Contents
+
+| file | what it is |
+|---|---|
+| `proposal.md`   | Why + what (the user-visible surface). No code. |
+| `design.md`     | Implementation plan: subsystems touched, data model, alternatives considered/rejected, testing strategy, open questions. |
+| `design-note.md`| Deeper technical exploration: why "filter the live graph" fails, the chosen `shared_ptr<vector<Block{shared_ptr buf}>>` structure, prototype + benchmark + POC findings, costs. |
+| `tasks.md`      | Phased implementation checklist (phase 1 = the `copyTo` primitive, already started). |
+| `spec-delta.md` | Behavior spec: requirements + WHEN/THEN scenarios. |
+| `prototypes/`   | Standalone C++ programs (below). |
+
+## Prototypes
+
+All build with `g++ -O2 -std=c++20 -pthread`. `snapshot_poc.cpp` additionally
+links two `src/VecSim/memory/*.cpp` files (see its header comment).
+
+| prototype | proves |
+|---|---|
+| `cow_snapshot.cpp`          | rooted-COW snapshot mechanism: O(1) capture, consistent lock-free reads under concurrent writes, automatic cleanup |
+| `read_indirection_bench.cpp`| read-path cost of candidate layouts — the per-block `shared_ptr` scatter is the cost; contiguous headers + refcounted buffer read at flat speed |
+| `block_deepcopy.cpp`        | the `copyTo` deep-copy algorithm (FAM links + fresh mutex + independent incoming-edges) on a multi-level node |
+| `snapshot_poc.cpp`          | **end-to-end feasibility** using the REAL `ElementGraphData` + `copyTo`: snapshot stays consistent through millions of concurrent writes; clean under TSan/ASan/UBSan |
+
+## Status
+
+The **vecsim integration is implemented and validated** (a stacked series of
+branches, `snapshot/01`–`10`): the COW block storage + generation-tag clone, the
+O(1)-ish snapshot capture, the lock-free KNN/batch iterator (plain **and**
+tiered, opt-in), concurrent-insert safety (atomic block-COW), the swap reclaim
+horizon, and COW-before-lock repair so deletes' graph repair runs under live
+cursors. Full unit suite green; the concurrent insert+delete+repair+cursor repro
+is clean under ASan. The original prototypes (below) proved feasibility first.
+
+**Not yet built** (tracked in `tasks.md` Phases 5–7): RediSearch wiring
+(`hybrid_reader.c`, holding the snapshot across `FT.CURSOR READ`, a string
+param-resolver token), the snapshot memory/lifetime cap + `FT.INFO`/`FT.DEBUG`
+observability, range-query snapshot support, and production-scale benchmarks.
+`spec-delta.md` uses the OpenSpec requirement/scenario shape from the RediSearch
+repo, which is framework-neutral.
diff --git a/docs/design/hnsw-snapshot/design-note.md b/docs/design/hnsw-snapshot/design-note.md
@@ -0,0 +1,217 @@
+# Design: Consistent lock-free snapshot iterator for HNSW (VecSim)
+
+Status: **draft / prototype**. This is a design exploration, not a committed plan.
+
+## Problem & requirement
+
+A query/batch iterator over an HNSW index should be:
+
+1. **Constructed under the index read lock**, then **release the lock**, and
+2. **iterate while concurrent insertions/deletions run**, ideally
+3. returning results **consistent with the index state at construction time**.
+
+Today the multi-threaded path (tiered index `acquireSharedLocks`,
+`hnsw_tiered.h`) holds the shared lock for the *whole* query precisely because
+lock-free traversal during writes is unsafe: the block backbone can be
+reallocated, and slot reuse (SWAP) / in-place link rewrites can pull data out
+from under a reader. Releasing the lock is exactly the guarantee we must
+replace.
+
+## Why "filter the live graph" cannot work
+
+HNSW is **lossy on write**: inserting a node can *evict* an edge between two
+existing nodes (`revisitNeighborConnections` → `getNeighborsByHeuristic2` →
+`setLinks`), and delete/repair rewrites neighbor lists in place
+(`removeLink`, swap-remove). So a point-in-time view cannot be reconstructed by
+reading the live structure and filtering (by doc-id watermark, by prefix of an
+"append-only" list, etc.) — the old information is destroyed. **A snapshot must
+retain the old bytes.** The only real choice is the retention mechanism.
+
+## Chosen mechanism: persistent (rooted) copy-on-write
+
+Store the graph blocks behind a **shared, rooted structure**. Prototype B
+(below) settled the exact shape: block headers must stay **contiguous** (a
+`vector<shared_ptr<DataBlock>>` scatters them across the heap and costs +25–60%
+on traversal). The structure that keeps flat-speed reads *and* per-block COW is
+contiguous headers each holding a **refcounted data buffer**:
+
+```
+shared_ptr< vector< Block > >          // root over a CONTIGUOUS header vector
+   where  Block { shared_ptr<Buffer> data; }   // per-block refcounted buffer
+```
+
+- **root** (`shared_ptr<...>`) — swapped rarely (once per generation that writes).
+- **header vector** — contiguous; copied (headers only, cheap) on backbone COW.
+- **block buffer** (`shared_ptr<Buffer>`) — the edges; COW'd per block on write.
+
+Reads are `root` (hoisted) → contiguous `vec[b]` → `data.get()` → buffer — the
+same access shape as today's flat `vector<DataBlock>`.
+
+### Operations
+
+- **Snapshot capture** (under the *write*/exclusive lock — see "Locking model"):
+  hand out a generation id, **fork the backbone** (copy the header vector,
+  `N/blockSize` pointers, so the live index keeps mutating a private copy while
+  the snapshot owns the captured one), and snapshot the vector-block base
+  pointers. O(#blocks); no element/vector data is copied. Release the lock, then
+  iterate through the captured handle with **no lock**.
+- **No snapshot active**: every version's generation post-dates all (zero) live
+  snapshots → writers mutate in place. **Zero overhead — current behavior.**
+- **Backbone after capture**: the fork is stamped the *current* generation, so a
+  concurrent writer's backbone-COW check is a no-op — the live backbone is never
+  re-forked under the shared lock and stays structurally stable for lock-free
+  readers. Thereafter only per-block buffers are COW'd.
+- **Content write to a shared block**: clone the block buffer (deep-copy via
+  `copyTo`) iff a live snapshot can still see it, repoint its slot, mutate the
+  copy. Under concurrent tiered inserts (which mutate the live graph under the
+  *shared* lock + per-node mutexes) the slot is repointed with an **atomic
+  compare-exchange**, so two writers racing to COW the same block can't tear the
+  handle or lose a fork (the generation stamp makes the CAS idempotent — the
+  loser adopts the winner's fork).
+- **Cleanup**: **automatic** via refcounts. When the snapshot is dropped, old
+  blocks/backbones with no remaining references free themselves. No
+  hand-rolled epoch/horizon/retire-queue.
+
+**Clone decision: a generation tag, not `use_count`.** The initial sketch keyed
+the clone on `shared_ptr` `use_count` (`> 1` ⇒ shared ⇒ clone). That breaks: a
+container-level `use_count` goes **stale after the first COW** — the live root is
+unshared again while a snapshot still holds the *old* one — so it would wrongly
+mutate in place and corrupt the snapshot. Instead every version carries the
+**generation it was last written in**; a write clones iff `newestLive() >= gen`
+(some live snapshot predates it) and stamps the clone with the current
+generation. `use_count` is reserved purely for **reclamation**.
+
+**Locking model (revised for concurrency).** The first design assumed all writes
+run under the *exclusive* lock and capture under the *shared* lock (mutually
+exclusive), so the clone decision would need no atomics. That holds for the plain
+index, but the **tiered** index wires concurrent inserts into the live graph
+under the *shared* `mainIndexGuard` + per-node mutexes. So the implementation
+moved two things: **capture runs under the *exclusive* lock** (a brief quiescence
+point — no writer is mid-insert — where the backbone fork + generation handout
+can't race an in-flight insert), and **per-block COW publishes via an atomic
+CAS** so concurrent inserts COW'ing the same block are safe. After capture the
+reader still touches only its private immutable handle — no per-node locks, and
+no atomics on the snapshot read path.
+
+## Consistency guarantee
+
+The shipped guarantee is **strict for graph topology + membership + vectors**,
+**weak (best-effort) for per-element metadata flags**:
+
+- **Strict as-of-construction** for the graph the snapshot traverses: the set of
+  ids that existed at capture and their links/levels, plus each visible id's
+  vector bytes, are frozen — concurrent inserts COW, so they never perturb the
+  captured backbone/buffers. This is the property the requirement asks for and
+  the one the lock-free reader depends on.
+- **Weak** for the deleted / in-process **flags**: `markDelete` flips the flag in
+  place on the single-version metadata, so a delete that lands after capture MAY
+  or MAY NOT be observed by the reader (it tolerates either value). This matches
+  the live index's existing weak flag consistency; making it strict would mean
+  versioning the metadata flag too, which we chose not to do (see
+  `spec-delta.md`, "Deleted-flag consistency").
+
+A cheaper **safety-only** variant (drop COW entirely, just capture the backbone +
+gate slot reuse: "no crashes, no new nodes after T, but edges among old nodes may
+drift") was the recorded fallback; the implementation took the consistent COW
+version above.
+
+## Costs & open problems
+
+1. **Read-path indirection.** `root → backbone[i] → block → data` adds one
+   pointer hop vs today's flat `vector<DataBlock>` offset math, and the block
+   headers stop being contiguous. *The element-data locality is unchanged*
+   (it already lives in a separate `DataBlock::data` heap buffer). Net cost must
+   be measured — see prototype B.
+2. **Block must be deep-copyable for block-COW.** `ElementGraphData` embeds a
+   `std::mutex` (can't be copied) and a pointer to separately-allocated upper
+   levels. **Resolved by an explicit `copyTo` deep-copy** (re-inits a fresh mutex
+   on the copy, deep-copies each level's incoming-edges, walks the `others` upper
+   levels) — the mutex **stays in the struct**. The originally-sketched "move the
+   mutexes into a side array keyed by id" was implemented as a spike and
+   **rejected**: a stable side-array lock fixes lock *identity* but repair still
+   writes non-COW'd buffers, so it was neither necessary nor sufficient — the real
+   requirement is to **COW every touched block before taking any per-node lock**
+   (so a held lock's buffer can't be forked out from under it). With COW-before-
+   lock in place, deep-copy + in-struct mutex is sufficient and churns zero call
+   sites. (The mutex-identity hazard, and why the side array doesn't solve it, is
+   the central finding of the delete/repair work — see `design.md` "Subsystems".)
+3. **Pinned snapshot → retained memory.** A long-lived snapshot (open
+   `FT.CURSOR`) keeps old blocks alive; COW'd blocks double until it drops.
+   Mitigation = cap snapshot lifetime. (Same issue under any retention scheme.)
+4. **Granularity.** Block-COW copies a whole ~200 KB block to preserve one
+   node's edges. Fine if writes touch few blocks during a snapshot; approaches a
+   full copy if writes are spread across the whole index.
+
+## Feasibility: PROVEN (end-to-end POC)
+
+`prototypes/snapshot_poc.cpp` is a minimal graph index built on the **real
+VecSim `ElementGraphData`/`ElementLevelData`** (it `#include`s the production
+`graph_data.h`) and the **real `copyTo` primitive** added in Phase 1, with the
+rooted-COW block storage. It demonstrates the full requirement end to end:
+construct a snapshot under a read lock, release it, and traverse lock-free while
+a writer thread adds nodes and rewires edges (COW-ing real blocks via `copyTo`).
+
+Results (compiled against the real headers; `memory/vecsim_malloc.cpp` +
+`memory/vecsim_base.cpp` linked):
+
+- Snapshot query stayed **bit-identical** across thousands of lock-free reads
+  during ~1–2.4M concurrent writer ops; the live graph provably diverged.
+- **ThreadSanitizer: clean** (no data race) — the lock-free read path is sound.
+- **ASan/UBSan/LeakSan: clean** — the `copyTo` FAM logic and the refcounted
+  buffer deleter are memory-safe, no leaks.
+- Allocator bytes grew while the snapshot was held (pinned COW versions) and
+  **dropped on release** (automatic reclamation).
+
+This answers "is it implementable against VecSim's actual node representation?"
+— yes. The remaining work is integration scope (wiring COW through the real
+HNSW read/write paths, the batch iterator, the tiered index, and RediSearch),
+not a feasibility unknown.
+
+## Prototype findings
+
+Prototypes are in `prototypes/` (standalone). A/B/C use `g++ -O2 -std=c++20`;
+the POC additionally links two VecSim `memory/*.cpp` files (see its header).
+
+- **C — `block_deepcopy.cpp`:** validates the `copyTo` algorithm (FAM links +
+  fresh mutex + independent incoming-edges) on a multi-level node. PASSED.
+
+**A — `cow_snapshot.cpp` (mechanism): PASSED.** O(1) capture; a snapshot stayed
+bit-identical (checksum invariant) through **2.4M concurrent writes** with no
+lock during iteration; the live index diverged; dropping the snapshot freed
+exactly the COW'd-away versions automatically (276 → 212 live blocks). Confirms
+the refcount-driven COW is correct, lock-free for readers, and self-cleaning.
+
+**B — `read_indirection_bench.cpp` (read cost): layout matters a lot.**
+Dependent random walk, 1M elements, dim-128 distance per visit:
+
+| layout | lookup-only | vs flat |
+|---|---|---|
+| `vector<DataBlock>` (today, flat) | ~9.9 ns | — |
+| `shared_ptr<vector<Block>>` (blocks by value) | ~8.2 ns | ~0% |
+| **`shared_ptr<vector<Block{shared_ptr buf}>>`** (chosen) | ~9.0 ns | **~0%** |
+| `shared_ptr<vector<shared_ptr<Block>>>` | ~15.7 ns | **+58%** |
+
+With a realistic distance calc per visit, the scattered `shared_ptr<Block>`
+layout still cost **+23%** end-to-end. **Conclusion: the cost is the per-block
+`shared_ptr` heap scatter, not the root indirection.** Contiguous headers +
+refcounted buffer recover flat-speed reads, so the chosen structure imposes
+**no read regression** while keeping per-block COW.
+
+### Decision recorded: why not a "mutate-in-place + per-snapshot overlay"
+
+An overlay (keep the live index flat, copy a block into a side map just before a
+writer mutates it in place) would also avoid the read tax. **Rejected:** a
+lock-free snapshot reader that misses the overlay and reads the live block can
+**race** a writer that inserts-then-mutates that same block, yielding a
+*newer-than-T* read. In-place mutation is not cleanly lock-free. The rooted COW
+wins because snapshots read **immutable** versions — writers only ever create
+*new* blocks, never touch a block a snapshot holds. This is the property that
+makes "release the lock and iterate during writes" actually safe.
+
+## Fallback
+
+If the trivially-copyable-block refactor (cost #2) proves too invasive, fall
+back to **safety-only weak consistency**: keep the flat layout, capture the
+backbone + `curElementCount` under the lock, gate slot reuse by an owner
+horizon. Lock-free and crash-safe, excludes nodes added after T, but does not
+isolate edge drift among existing nodes.