Skip to content

Evaluate per-bucket database isolation on Storage Provider #100

@mudigal

Description

@mudigal

Problem Statement

The current Storage Provider uses a single RocksDB instance with 3 column families (CF_NODES, CF_BUCKETS, CF_ROOT_TO_BUCKET) for all buckets. This creates a structural scaling bottleneck driven primarily by how MMR state is stored and rebuilt per bucket.

The Core Problem: MMR Rebuild Is O(n) Per Operation

BucketState stores all MMR leaves as a Vec<MmrLeaf> serialized into a single value in CF_BUCKETS. Every commit() and get_mmr_proof() call:

  1. Deserializes the entire bucket state (all leaves) from one RocksDB key
  2. Rebuilds the entire MMR from scratch by replaying all leaves
  3. Appends new leaves (for commits)
  4. Reserializes everything back to disk
// This runs on EVERY commit and EVERY proof generation (disk.rs:351-356, 436-438)
let mut mmr = crate::mmr::Mmr::new();
for leaf in &bucket.leaves {
    mmr.push(blake2_256(&leaf.encode()));
}

For a bucket with 100,000 leaves, every single upload replays 100,000 hashes. The serialized BucketState blob grows linearly and unboundedly — at scale this becomes a multi-MB single-key value that gets deserialized/reserialized on every operation.

Secondary Problems

Cross-bucket contention on the write path: All bucket commits go through the same RocksDB write buffer. A burst of commits to Bucket A blocks Bucket B's checkpoint signature response, which has a 48-hour challenge deadline.

Challenge response latency: get_mmr_proof() also rebuilds the entire MMR just to generate one proof. Under load, generating proofs for large buckets competes with all other bucket operations for the same DB resources.

No tenant isolation: Storage and query spikes from one client's bucket affect all other buckets on the same provider.

Expensive bucket deletion: Deleting an expired bucket requires writing individual tombstone markers for every key, triggering RocksDB compaction cycles that compete with active storage operations.

Migration difficulty: Moving a single bucket to another provider requires scanning and extracting its data from the shared database — there's no self-contained unit to transfer.

Directional Solution: One Database Per Bucket

Architecture

Each bucket gets its own isolated database instance alongside a node-wide global index:

Storage Provider
├── Global Index (single lightweight DB)
│   ├── Bucket registry (bucket_id → DB path, status, stats)
│   ├── Deduplication index (content_hash → reference count)
│   └── Node-wide metrics and telemetry
│
├── Bucket DB: bucket_001/ (self-contained)
│   ├── MMR nodes (one key per node position — no replay needed)
│   ├── MMR leaves (individually addressable)
│   ├── Chunk data (content-addressed)
│   └── Bucket metadata (root, seq, config)
│
├── Bucket DB: bucket_002/ (self-contained)
│   └── ...
└── ...

Key Benefit: Persistent MMR Structure

With per-bucket DBs, MMR nodes can be stored individually (one key per node position) instead of replaying from a serialized leaf vec:

  • commit() becomes O(log n) — append new leaves and their parent nodes, no replay
  • get_mmr_proof() becomes O(log n) — walk the tree by position, no rebuild
  • get_mmr_peaks() becomes O(log n) — read peak positions directly

Other Benefits

  • Compaction-free bucket deletion: Delete the entire database directory instead of writing tombstones
  • Complete tenant isolation: One bucket's I/O spike cannot affect another bucket's checkpoint or challenge response
  • Easy bucket migration: Copy the self-contained database directory to move a bucket between providers
  • Independent tuning: Hot buckets can have different cache/buffer settings than cold ones

Known Scaling Gaps

  • File descriptor exhaustion: Thousands of open databases can exceed OS ulimit -n
  • Memory overhead: Each DB instance has its own memory-mapped buffers
  • Loss of global deduplication: Identical chunks across buckets get stored twice

Proposed Mitigations

  1. LRU connection pool: Cap concurrent open DBs. Evict least-recently-used bucket DBs from the pool. Reopen on next access. This bounds file descriptors and memory.
  2. Lightweight engines: Evaluate SQLite (WAL mode) or Sled per bucket — single-file, minimal footprint, native Rust (Sled).
  3. Dual-DB architecture: Global Index handles deduplication references and node-wide metadata; per-bucket DBs handle bucket-specific MMR state and chunk mappings.
  4. OS limit tuning: Adjust ulimit -n and kernel parameters for providers running at scale.

Investigation Scope

  • Benchmark current single-RocksDB performance: measure commit() and get_mmr_proof() latency as leaf count grows (100, 1K, 10K, 100K leaves)
  • Prototype persistent per-node MMR storage (one key per MMR position) — measure O(log n) vs current O(n) replay
  • Prototype per-bucket SQLite/Sled instances with LRU connection pool
  • Measure file descriptor usage and memory overhead at 100, 500, 1000 bucket scale
  • Evaluate deduplication loss (how much storage overhead from duplicate chunks across buckets?)
  • Define the Global Index schema (what metadata is node-wide vs. bucket-local?)
  • Compare deletion performance: single-DB tombstone compaction vs. per-bucket directory removal
  • Determine migration path from current single-RocksDB to per-bucket architecture

Current Implementation Reference

  • MMR implementation: provider-node/src/mmr.rs (in-memory Vec<H256>, rebuilt from leaves)
  • Disk storage: provider-node/src/storage/disk.rs (RocksDB 0.22, 3 column families)
  • BucketState: provider-node/src/storage/disk.rs:22-29 (stores Vec<MmrLeaf> as single serialized blob)
  • Commit path: provider-node/src/storage/disk.rs:318-383 (full MMR replay on every commit)
  • Proof generation: provider-node/src/storage/disk.rs:420-450 (full MMR replay for each proof)

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions