Skip to content

Evaluate database engine for Storage Provider and Blockchain Provider #101

@mudigal

Description

@mudigal

Context

The Storage Provider and Blockchain Provider have different database workload profiles but both need production-grade database selection and tuning. This issue tracks the evaluation of database engines and configuration for both components.

Related: Issue #100 (per-bucket database isolation) — the engine choice here directly informs the per-bucket architecture.

Current State

  • Storage Provider: RocksDB 0.22 with 3 column families (CF_NODES, CF_BUCKETS, CF_ROOT_TO_BUCKET). See provider-node/src/storage/disk.rs.
  • Blockchain Provider: Uses Substrate's default backend (RocksDB via sc-client-db).

Database Trade-Off Matrix

Architectural Vector RocksDB (LSM-Tree) ParityDB (Hash-Based) Sled (B-Tree) SQLite (WAL)
Primary Structure Log-Structured Merge-Trees (SSTs) Fixed-size value tables with hash-indexed mmap Lock-free B-Tree B-Tree with WAL journaling
Ideal Workloads High-volume sequential reads/writes High-frequency state trie lookups (32-byte keys) Mixed read/write with transactional safety Single-writer, many-reader with complex queries
Write Amplification High (compaction rewrites data across levels) Low (append-only) Low-moderate Low (WAL append)
Memory Management Explicit app-level block caches Implicit (OS page cache) Implicit (OS page cache) Configurable page cache
Rust Native No (FFI to C++) Yes Yes No (FFI to C)
Deletion Cost High (tombstones trigger compaction) Moderate Moderate Low (single file delete for per-bucket)
Maturity Very high (Facebook, production-proven) Moderate (Parity-maintained) Moderate (community) Very high (ubiquitous)

Scalability Bottlenecks to Evaluate

RocksDB Bottlenecks

  • SSD Compaction Wear: Continuous background compaction generates intense disk writes, reducing SSD lifespans and competing for write bandwidth.
  • FFI Context Switching: Interacting via FFI introduces CPU overhead noticeable during high-frequency operations.
  • Deletion Latency Spikes: Clearing expired agreements writes massive volumes of tombstones, triggering compactions that can delay other operations.
  • Single-blob MMR state: Current BucketState serializes all MMR leaves as one value — O(n) on every access (see Issue Evaluate per-bucket database isolation on Storage Provider #100).

ParityDB Considerations

  • OS Page Cache Starvation: Delegates caching to OS — competing processes can purge DB index from RAM.
  • Value Fragmentation: Records >32KB split across tables, increasing seek operations for proofs/metadata.
  • Index Expansion Stalls: Hash index resizing can briefly pause transactions at high load factors.

Sled Considerations

  • Pure Rust: Zero FFI overhead, native compilation, good fit for per-bucket isolation.
  • Lock-free reads: Excellent for concurrent read workloads (multiple downloads while committing).
  • Maturity concern: Less battle-tested than RocksDB at extreme scale.

SQLite (WAL mode) Considerations

  • Single-file per database: Ideal for per-bucket isolation — one file = one bucket.
  • Excellent tooling: Inspectable, debuggable, well-understood.
  • Single-writer limitation: Only one writer at a time (fine for per-bucket where one provider writes).

Evaluation for Each Component

Storage Provider (per-bucket DBs — Issue #100)

The per-bucket architecture (Issue #100) changes the requirements: instead of one large DB, we need many small, independent DBs. This favors:

  • Low per-instance memory footprint
  • Fast open/close (for LRU connection pool)
  • Efficient single-file cleanup (bucket deletion = file deletion)
  • Native Rust (avoid FFI overhead across hundreds of instances)

Candidates to benchmark: Sled, SQLite (WAL), RocksDB (lightweight config)

Benchmark criteria:

  • Open/close latency (cold start for LRU pool eviction/reload)
  • Memory overhead per instance at rest and under load
  • Write throughput for MMR append operations (per-node-position key writes)
  • Read latency for MMR proof generation (random key lookups by position)
  • File descriptor usage at 100, 500, 1000 instances
  • Disk space efficiency (overhead per instance)
  • Bulk deletion speed (drop entire bucket DB)

Blockchain Provider (single instance, state trie)

The Blockchain Provider uses Substrate's storage backend for the state trie. Options:

  • RocksDB (current Substrate default): Mature, proven at scale, but compaction overhead.
  • ParityDB: Substrate-native alternative optimized for state trie access patterns. Lower write amplification.

Benchmark criteria:

  • Block import time under full state load
  • State trie read latency for storage maps (Providers, Buckets, StorageAgreements)
  • Compaction impact on block production
  • Memory usage under sustained testnet load
  • Pruning efficiency for historical state

Strategic Mitigation Plans (Applicable Regardless of Engine Choice)

Step 1: Physical Data Isolation

Decouple the Blockchain Provider's state tracking database from raw file storage. The consensus database must only store 32-byte hash pointers and state metrics.

Step 2: System Memory Safeguards

Isolate the Storage Provider's HTTP engine and file-sharing tasks within bounded OS containers (e.g., Linux cgroups). Prevent file transfer operations from purging the Blockchain Provider's consensus database indexes from RAM.

Step 3: Compaction & Buffer Tuning (if staying with RocksDB)

Configure leveled compaction with dynamic targets, expand active write buffers to cushion transaction spikes, and cap background write rates to protect disk I/O.

Step 4: Key Prefix Restructuring

Structure on-chain key pathways so entries sharing a common parent (like Bucket ID) are stored contiguously. Enables bulk deletions in a single pass.

Note: The original plan attributed key-prefix restructuring to Issue #65, but actual Issue #65 is "Robust Syncing Protocol for Dynamic Primary and Replica Node Topologies" — a different topic entirely.

Alternative Engines (Evaluate Only If Primary Candidates Fail)

  • PebblesDB (Fragmented LSM Tree): Groups keys into fragments, reducing write amplification ~70%. Alternative for Blockchain Provider if RocksDB compaction thrashes disks.
  • BadgerDB (Key-Value Separated LSM): Separates keys from values, compacts only the key index. Optimized for variable-sized metadata and proof structures.

Deliverables

  1. Benchmark report comparing Sled vs SQLite vs RocksDB for per-bucket Storage Provider DBs
  2. Benchmark report comparing RocksDB vs ParityDB for Blockchain Provider state trie
  3. Recommendation with justification for each component
  4. Configuration guide for the chosen engines (memory limits, compaction settings, OS tuning)
  5. Migration plan from current single-RocksDB to chosen architecture

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions