Skip to content

[sketch] API for snapshot generation#542

Draft
michalkucharczyk wants to merge 2 commits into
mainfrom
mku-snapshot-api
Draft

[sketch] API for snapshot generation#542
michalkucharczyk wants to merge 2 commits into
mainfrom
mku-snapshot-api

Conversation

@michalkucharczyk
Copy link
Copy Markdown
Contributor

This is just proposal to agree the shape of the API.

Comment thread sketch.rs Outdated
/// single node. If consumed by multiple sibling nodes the shared
/// session keys cause equivocation and the libp2p identity causes
/// peer-id collisions.
Full,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific use-case in mind for this? I think for existing tests I have not encountered this yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. But I can imagine that such scenario (replicating 1:1 entire source network) will be needed. API can be ready for this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we always inject the keystore (derived from the node name), but we can also add support to not doit iff we have this kind of snap. Maybe set Shareable as default?
Thx!

Copy link
Copy Markdown
Contributor Author

@michalkucharczyk michalkucharczyk May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment expaining why I proposed Shared / Full:

The problem with snapshot that was fixed by paritytech/smoldot#3264 was as follows:

  • snapshot contain the keystore and networking directory,
  • the same snapshot was used for alice and bob nodes,
  • this led to problems with the liveliness of the network (basically network was not progressing).

This approach was incorrect.

So when generating the snapshot we can take two approaches:

  • we always snapshot DB directory only, and always re-generated the keys (consensus + p2p),
    OR:
  • we also allow to take a "full" snapshot for every node individually and then replicate the nodes from the snapshot (so we have per-node snapshot). This could provide some means for scenarios when we want to have full control over the node's directory content.

For now we only have use-cases for shared snapshots.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set Shareable as default?

0a04d26

Copy link
Copy Markdown
Collaborator

@pepoviola pepoviola May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue (in smoldot), was a bug on the native provider (https://github.com/paritytech/zombienet-sdk/blob/main/crates/provider/src/native/node.rs#L189-L195), since we are unpacking the snap as last step before initialize the process. In the other providers (docker/k8s), we first unpack the snap and then apply the _startup files (e.g keystore).

I will fix this and ping you.

(#543)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still - the keys should not be there in the snapshot. We should only have DB there if we don't plan to use keys/networking.

We can also skip the snapshot type in the API. What would be your take on this?

Comment thread sketch.rs Outdated
This was referenced May 26, 2026
pepoviola added a commit that referenced this pull request May 29, 2026
Implementation of proposal from #541 (also #542).

Tests that build DB snapshot artifacts (cumulus `full_node_warp_sync`,
smoldot `smoke`/`bulletin`) each hand-roll the same pause → tar →
checksum → bundle dance. This moves it into the SDK.

### What's new

- **`NetworkNode::snapshot_db(out_path)`** — tars a node's DB into a
`.tgz` (`data/`, plus `relay-data/` for cumulus collators). Drops
`keystore/` and `network/` so the archive can be loaded on several
sibling nodes without key/peer-id clashes.
- **`Network::pause()` / `resume()`** — SIGSTOP/SIGCONT all nodes in
parallel, so the DB on disk is consistent while you snapshot.
- **`snapshot::BundleBuilder`** — packs per-node archives + a JSON
`user_data` blob into one `bundle.tar.gz` with a `manifest.json` (sha256
+ size per archive). `build()` won't compile until you've added at least
one archive.
- **`with_optional_default_db_snapshot(Option<…>)`** on the
relay/parachain builders (and `with_optional_db_snapshot` per node) — so
one network builder works for both the fresh and the from-snapshot case
without branching.

### Example

```rust
// Produce: snapshot a relay validator + a collator, bundle them.
network.pause().await?;
let relay = network.get_node("alice")?.snapshot_db("relay.tgz").await?;
let para  = network.get_node("collator")?.snapshot_db("para.tgz").await?;
network.resume().await?;

let bundle = BundleBuilder::new()
    .add(relay)
    .add(para)
    .user_data(serde_json::json!({ "para_best_block": 42 }))
    .build("bundle.tar.gz")?;          // bundle.path / .sha256 / .size

////////////////////////////////////////////////////////////////////////////////

// Consume: same builder, snapshot optional.
fn network(snapshot: Option<&Path>) -> NetworkConfig {
    NetworkConfigBuilder::new()
        .with_relaychain(|r| {
            r.with_chain("rococo-local")
                .with_default_command("polkadot")
                .with_optional_default_db_snapshot(snapshot)   // None = fresh, Some = resume
                .with_validator(|n| n.with_name("alice"))
        })
        /* … */
        .build().unwrap()
}
```

### Tests

- Unit tests in `crates/sdk/src/snapshot.rs`: bundle build → unpack →
manifest round-trip (no network, runs in CI).
- `crates/configuration`: tests for the three
`with_optional_*db_snapshot` methods.
- `crates/sdk/tests/snapshot_roundtrip.rs` (`#[ignore]`, needs polkadot
binaries): snapshot a live network, bundle it, re-spawn from the
snapshot, and check both chains advance and finalize past the snapshot
height.

## Notes

- `snapshot_db` reads the node's local base dir, so it's native-provider
only.
- Caller must pause the node first — snapshoting a running node risks a
torn RocksDB state.
- A graceful `terminate()` (instead of `pause`) before snapshotting is
left for a follow-up.
- Open question if `SIGSTOP` allows for safe db copy. So far it worked
well.

---------

Co-authored-by: Javier Viola <javier@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants