[sketch] API for snapshot generation#542
Conversation
| /// single node. If consumed by multiple sibling nodes the shared | ||
| /// session keys cause equivocation and the libp2p identity causes | ||
| /// peer-id collisions. | ||
| Full, |
There was a problem hiding this comment.
Do you have a specific use-case in mind for this? I think for existing tests I have not encountered this yet.
There was a problem hiding this comment.
No. But I can imagine that such scenario (replicating 1:1 entire source network) will be needed. API can be ready for this.
There was a problem hiding this comment.
In general we always inject the keystore (derived from the node name), but we can also add support to not doit iff we have this kind of snap. Maybe set Shareable as default?
Thx!
There was a problem hiding this comment.
One more comment expaining why I proposed Shared / Full:
The problem with snapshot that was fixed by paritytech/smoldot#3264 was as follows:
- snapshot contain the keystore and networking directory,
- the same snapshot was used for alice and bob nodes,
- this led to problems with the liveliness of the network (basically network was not progressing).
This approach was incorrect.
So when generating the snapshot we can take two approaches:
- we always snapshot DB directory only, and always re-generated the keys (consensus + p2p),
OR: - we also allow to take a "full" snapshot for every node individually and then replicate the nodes from the snapshot (so we have per-node snapshot). This could provide some means for scenarios when we want to have full control over the node's directory content.
For now we only have use-cases for shared snapshots.
There was a problem hiding this comment.
Maybe set Shareable as default?
There was a problem hiding this comment.
I think the issue (in smoldot), was a bug on the native provider (https://github.com/paritytech/zombienet-sdk/blob/main/crates/provider/src/native/node.rs#L189-L195), since we are unpacking the snap as last step before initialize the process. In the other providers (docker/k8s), we first unpack the snap and then apply the _startup files (e.g keystore).
I will fix this and ping you.
(#543)
There was a problem hiding this comment.
But still - the keys should not be there in the snapshot. We should only have DB there if we don't plan to use keys/networking.
We can also skip the snapshot type in the API. What would be your take on this?
Implementation of proposal from #541 (also #542). Tests that build DB snapshot artifacts (cumulus `full_node_warp_sync`, smoldot `smoke`/`bulletin`) each hand-roll the same pause → tar → checksum → bundle dance. This moves it into the SDK. ### What's new - **`NetworkNode::snapshot_db(out_path)`** — tars a node's DB into a `.tgz` (`data/`, plus `relay-data/` for cumulus collators). Drops `keystore/` and `network/` so the archive can be loaded on several sibling nodes without key/peer-id clashes. - **`Network::pause()` / `resume()`** — SIGSTOP/SIGCONT all nodes in parallel, so the DB on disk is consistent while you snapshot. - **`snapshot::BundleBuilder`** — packs per-node archives + a JSON `user_data` blob into one `bundle.tar.gz` with a `manifest.json` (sha256 + size per archive). `build()` won't compile until you've added at least one archive. - **`with_optional_default_db_snapshot(Option<…>)`** on the relay/parachain builders (and `with_optional_db_snapshot` per node) — so one network builder works for both the fresh and the from-snapshot case without branching. ### Example ```rust // Produce: snapshot a relay validator + a collator, bundle them. network.pause().await?; let relay = network.get_node("alice")?.snapshot_db("relay.tgz").await?; let para = network.get_node("collator")?.snapshot_db("para.tgz").await?; network.resume().await?; let bundle = BundleBuilder::new() .add(relay) .add(para) .user_data(serde_json::json!({ "para_best_block": 42 })) .build("bundle.tar.gz")?; // bundle.path / .sha256 / .size //////////////////////////////////////////////////////////////////////////////// // Consume: same builder, snapshot optional. fn network(snapshot: Option<&Path>) -> NetworkConfig { NetworkConfigBuilder::new() .with_relaychain(|r| { r.with_chain("rococo-local") .with_default_command("polkadot") .with_optional_default_db_snapshot(snapshot) // None = fresh, Some = resume .with_validator(|n| n.with_name("alice")) }) /* … */ .build().unwrap() } ``` ### Tests - Unit tests in `crates/sdk/src/snapshot.rs`: bundle build → unpack → manifest round-trip (no network, runs in CI). - `crates/configuration`: tests for the three `with_optional_*db_snapshot` methods. - `crates/sdk/tests/snapshot_roundtrip.rs` (`#[ignore]`, needs polkadot binaries): snapshot a live network, bundle it, re-spawn from the snapshot, and check both chains advance and finalize past the snapshot height. ## Notes - `snapshot_db` reads the node's local base dir, so it's native-provider only. - Caller must pause the node first — snapshoting a running node risks a torn RocksDB state. - A graceful `terminate()` (instead of `pause`) before snapshotting is left for a follow-up. - Open question if `SIGSTOP` allows for safe db copy. So far it worked well. --------- Co-authored-by: Javier Viola <javier@parity.io>
This is just proposal to agree the shape of the API.