Problem Statement
The current Storage Provider uses a single RocksDB instance with 3 column families (CF_NODES, CF_BUCKETS, CF_ROOT_TO_BUCKET) for all buckets. This creates a structural scaling bottleneck driven primarily by how MMR state is stored and rebuilt per bucket.
The Core Problem: MMR Rebuild Is O(n) Per Operation
BucketState stores all MMR leaves as a Vec<MmrLeaf> serialized into a single value in CF_BUCKETS. Every commit() and get_mmr_proof() call:
- Deserializes the entire bucket state (all leaves) from one RocksDB key
- Rebuilds the entire MMR from scratch by replaying all leaves
- Appends new leaves (for commits)
- Reserializes everything back to disk
// This runs on EVERY commit and EVERY proof generation (disk.rs:351-356, 436-438)
let mut mmr = crate::mmr::Mmr::new();
for leaf in &bucket.leaves {
mmr.push(blake2_256(&leaf.encode()));
}
For a bucket with 100,000 leaves, every single upload replays 100,000 hashes. The serialized BucketState blob grows linearly and unboundedly — at scale this becomes a multi-MB single-key value that gets deserialized/reserialized on every operation.
Secondary Problems
Cross-bucket contention on the write path: All bucket commits go through the same RocksDB write buffer. A burst of commits to Bucket A blocks Bucket B's checkpoint signature response, which has a 48-hour challenge deadline.
Challenge response latency: get_mmr_proof() also rebuilds the entire MMR just to generate one proof. Under load, generating proofs for large buckets competes with all other bucket operations for the same DB resources.
No tenant isolation: Storage and query spikes from one client's bucket affect all other buckets on the same provider.
Expensive bucket deletion: Deleting an expired bucket requires writing individual tombstone markers for every key, triggering RocksDB compaction cycles that compete with active storage operations.
Migration difficulty: Moving a single bucket to another provider requires scanning and extracting its data from the shared database — there's no self-contained unit to transfer.
Directional Solution: One Database Per Bucket
Architecture
Each bucket gets its own isolated database instance alongside a node-wide global index:
Storage Provider
├── Global Index (single lightweight DB)
│ ├── Bucket registry (bucket_id → DB path, status, stats)
│ ├── Deduplication index (content_hash → reference count)
│ └── Node-wide metrics and telemetry
│
├── Bucket DB: bucket_001/ (self-contained)
│ ├── MMR nodes (one key per node position — no replay needed)
│ ├── MMR leaves (individually addressable)
│ ├── Chunk data (content-addressed)
│ └── Bucket metadata (root, seq, config)
│
├── Bucket DB: bucket_002/ (self-contained)
│ └── ...
└── ...
Key Benefit: Persistent MMR Structure
With per-bucket DBs, MMR nodes can be stored individually (one key per node position) instead of replaying from a serialized leaf vec:
commit() becomes O(log n) — append new leaves and their parent nodes, no replay
get_mmr_proof() becomes O(log n) — walk the tree by position, no rebuild
get_mmr_peaks() becomes O(log n) — read peak positions directly
Other Benefits
- Compaction-free bucket deletion: Delete the entire database directory instead of writing tombstones
- Complete tenant isolation: One bucket's I/O spike cannot affect another bucket's checkpoint or challenge response
- Easy bucket migration: Copy the self-contained database directory to move a bucket between providers
- Independent tuning: Hot buckets can have different cache/buffer settings than cold ones
Known Scaling Gaps
- File descriptor exhaustion: Thousands of open databases can exceed OS
ulimit -n
- Memory overhead: Each DB instance has its own memory-mapped buffers
- Loss of global deduplication: Identical chunks across buckets get stored twice
Proposed Mitigations
- LRU connection pool: Cap concurrent open DBs. Evict least-recently-used bucket DBs from the pool. Reopen on next access. This bounds file descriptors and memory.
- Lightweight engines: Evaluate SQLite (WAL mode) or Sled per bucket — single-file, minimal footprint, native Rust (Sled).
- Dual-DB architecture: Global Index handles deduplication references and node-wide metadata; per-bucket DBs handle bucket-specific MMR state and chunk mappings.
- OS limit tuning: Adjust
ulimit -n and kernel parameters for providers running at scale.
Investigation Scope
Current Implementation Reference
- MMR implementation:
provider-node/src/mmr.rs (in-memory Vec<H256>, rebuilt from leaves)
- Disk storage:
provider-node/src/storage/disk.rs (RocksDB 0.22, 3 column families)
- BucketState:
provider-node/src/storage/disk.rs:22-29 (stores Vec<MmrLeaf> as single serialized blob)
- Commit path:
provider-node/src/storage/disk.rs:318-383 (full MMR replay on every commit)
- Proof generation:
provider-node/src/storage/disk.rs:420-450 (full MMR replay for each proof)
Problem Statement
The current Storage Provider uses a single RocksDB instance with 3 column families (
CF_NODES,CF_BUCKETS,CF_ROOT_TO_BUCKET) for all buckets. This creates a structural scaling bottleneck driven primarily by how MMR state is stored and rebuilt per bucket.The Core Problem: MMR Rebuild Is O(n) Per Operation
BucketStatestores all MMR leaves as aVec<MmrLeaf>serialized into a single value inCF_BUCKETS. Everycommit()andget_mmr_proof()call:For a bucket with 100,000 leaves, every single upload replays 100,000 hashes. The serialized
BucketStateblob grows linearly and unboundedly — at scale this becomes a multi-MB single-key value that gets deserialized/reserialized on every operation.Secondary Problems
Cross-bucket contention on the write path: All bucket commits go through the same RocksDB write buffer. A burst of commits to Bucket A blocks Bucket B's checkpoint signature response, which has a 48-hour challenge deadline.
Challenge response latency:
get_mmr_proof()also rebuilds the entire MMR just to generate one proof. Under load, generating proofs for large buckets competes with all other bucket operations for the same DB resources.No tenant isolation: Storage and query spikes from one client's bucket affect all other buckets on the same provider.
Expensive bucket deletion: Deleting an expired bucket requires writing individual tombstone markers for every key, triggering RocksDB compaction cycles that compete with active storage operations.
Migration difficulty: Moving a single bucket to another provider requires scanning and extracting its data from the shared database — there's no self-contained unit to transfer.
Directional Solution: One Database Per Bucket
Architecture
Each bucket gets its own isolated database instance alongside a node-wide global index:
Key Benefit: Persistent MMR Structure
With per-bucket DBs, MMR nodes can be stored individually (one key per node position) instead of replaying from a serialized leaf vec:
commit()becomes O(log n) — append new leaves and their parent nodes, no replayget_mmr_proof()becomes O(log n) — walk the tree by position, no rebuildget_mmr_peaks()becomes O(log n) — read peak positions directlyOther Benefits
Known Scaling Gaps
ulimit -nProposed Mitigations
ulimit -nand kernel parameters for providers running at scale.Investigation Scope
commit()andget_mmr_proof()latency as leaf count grows (100, 1K, 10K, 100K leaves)Current Implementation Reference
provider-node/src/mmr.rs(in-memoryVec<H256>, rebuilt from leaves)provider-node/src/storage/disk.rs(RocksDB 0.22, 3 column families)provider-node/src/storage/disk.rs:22-29(storesVec<MmrLeaf>as single serialized blob)provider-node/src/storage/disk.rs:318-383(full MMR replay on every commit)provider-node/src/storage/disk.rs:420-450(full MMR replay for each proof)