Skip to content

Robust Syncing Protocol for Dynamic Primary and Replica Node Topologies #65

@mudigal

Description

@mudigal

Problem Statement

As the web3-storage network scales, the current architecture faces significant challenges in maintaining data consistency during topology changes. Specifically, when the set of Primary or Replica nodes is dynamic (e.g., adding nodes, migrating providers, or recovering from downtime), there is no formalized protocol to ensure that state and data remain synchronized across the cluster.

Without a robust syncing mechanism, new nodes enter a "data gap" state where they are unable to participate in consensus or fulfill retrieval requests for historical data.

Syncing Scenarios & Requirements

We have identified four critical sync vectors that need to be addressed:

1. Primary $\leftrightarrow$ Primary Synchronization

When a new Primary node is introduced to increase write-throughput or redundancy:

  • Challenge: The new node must ingest the current state of storage proofs and indexing metadata without halting the network's ability to process new incoming writes.
  • Goal: Achieve state-parity among all authoritative write-nodes.

2. Primary $\rightarrow$ Replica Synchronization

When scaling the read-layer or adding a new replica:

  • Challenge: Replicas need to pull large blobs of data from Primaries. This creates high bandwidth pressure on the Primary nodes.
  • Goal: Implement a streaming transfer protocol that supports resume-on-failure.

3. Intra-Replica (P2P) Synchronization

To optimize network health:

  • Challenge: Replicas should be able to sync from other nearby Replicas rather than always hitting the Primary layer.
  • Goal: Reduce the $O(n)$ load on Primary nodes where $n$ is the number of replicas.

4. Provider Migration

When a provider node is decommissioned or replaced:

  • Challenge: Handing over "ownership" of data segments and their associated cryptographic proofs to a new node.

Technical Analysis & Proposed Approach

Data Discovery and Identification

We should leverage libp2p's discovery mechanisms to identify "sync-capable" peers. Data should be identified via Content Identifiers (CIDs) to ensure that the data received is exactly what was requested.

Incremental State Transfer

Instead of full state downloads, we propose a "Delta-Sync" approach:

  1. Snapshots: Primary nodes maintain periodic snapshots of the state.
  2. Gossip Logs: Use libp2p-gossipsub to broadcast recent changes that occurred after the last snapshot.
  3. Catch-up: New nodes download the latest snapshot and then replay the gossip logs.

Verification Logic

Every synced segment must be verified against the on-chain (or shared ledger) root hash.

  • Proof-of-Sync: A mechanism where a node can cryptographically prove it has completed a sync before it is marked as "Active" in the node registry.

Questions for Maintainers

  1. Is there a preferred architectural preference between a pull-based (new node requests data) vs. push-based (Primary broadcasts to new nodes) sync model?
  2. Should the sync logic reside within the core pallet logic or as a separate networking service?
  3. Are there existing benchmarks for the expected volume of data per provider that the sync protocol should be optimized for?

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions