Evaluate database engine for Storage Provider and Blockchain Provider

## Context

The Storage Provider and Blockchain Provider have different database workload profiles but both need production-grade database selection and tuning. This issue tracks the evaluation of database engines and configuration for both components.

Related: Issue #100 (per-bucket database isolation) — the engine choice here directly informs the per-bucket architecture.

## Current State

- **Storage Provider**: RocksDB 0.22 with 3 column families (`CF_NODES`, `CF_BUCKETS`, `CF_ROOT_TO_BUCKET`). See `provider-node/src/storage/disk.rs`.
- **Blockchain Provider**: Uses Substrate's default backend (RocksDB via `sc-client-db`).

## Database Trade-Off Matrix

| Architectural Vector | RocksDB (LSM-Tree) | ParityDB (Hash-Based) | Sled (B-Tree) | SQLite (WAL) |
|---------------------|--------------------|-----------------------|---------------|--------------|
| Primary Structure | Log-Structured Merge-Trees (SSTs) | Fixed-size value tables with hash-indexed mmap | Lock-free B-Tree | B-Tree with WAL journaling |
| Ideal Workloads | High-volume sequential reads/writes | High-frequency state trie lookups (32-byte keys) | Mixed read/write with transactional safety | Single-writer, many-reader with complex queries |
| Write Amplification | High (compaction rewrites data across levels) | Low (append-only) | Low-moderate | Low (WAL append) |
| Memory Management | Explicit app-level block caches | Implicit (OS page cache) | Implicit (OS page cache) | Configurable page cache |
| Rust Native | No (FFI to C++) | Yes | Yes | No (FFI to C) |
| Deletion Cost | High (tombstones trigger compaction) | Moderate | Moderate | Low (single file delete for per-bucket) |
| Maturity | Very high (Facebook, production-proven) | Moderate (Parity-maintained) | Moderate (community) | Very high (ubiquitous) |

## Scalability Bottlenecks to Evaluate

### RocksDB Bottlenecks
- **SSD Compaction Wear**: Continuous background compaction generates intense disk writes, reducing SSD lifespans and competing for write bandwidth.
- **FFI Context Switching**: Interacting via FFI introduces CPU overhead noticeable during high-frequency operations.
- **Deletion Latency Spikes**: Clearing expired agreements writes massive volumes of tombstones, triggering compactions that can delay other operations.
- **Single-blob MMR state**: Current `BucketState` serializes all MMR leaves as one value — O(n) on every access (see Issue #100).

### ParityDB Considerations
- **OS Page Cache Starvation**: Delegates caching to OS — competing processes can purge DB index from RAM.
- **Value Fragmentation**: Records >32KB split across tables, increasing seek operations for proofs/metadata.
- **Index Expansion Stalls**: Hash index resizing can briefly pause transactions at high load factors.

### Sled Considerations
- **Pure Rust**: Zero FFI overhead, native compilation, good fit for per-bucket isolation.
- **Lock-free reads**: Excellent for concurrent read workloads (multiple downloads while committing).
- **Maturity concern**: Less battle-tested than RocksDB at extreme scale.

### SQLite (WAL mode) Considerations
- **Single-file per database**: Ideal for per-bucket isolation — one file = one bucket.
- **Excellent tooling**: Inspectable, debuggable, well-understood.
- **Single-writer limitation**: Only one writer at a time (fine for per-bucket where one provider writes).

## Evaluation for Each Component

### Storage Provider (per-bucket DBs — Issue #100)

The per-bucket architecture (Issue #100) changes the requirements: instead of one large DB, we need many small, independent DBs. This favors:
- Low per-instance memory footprint
- Fast open/close (for LRU connection pool)
- Efficient single-file cleanup (bucket deletion = file deletion)
- Native Rust (avoid FFI overhead across hundreds of instances)

**Candidates to benchmark**: Sled, SQLite (WAL), RocksDB (lightweight config)

**Benchmark criteria**:
- [ ] Open/close latency (cold start for LRU pool eviction/reload)
- [ ] Memory overhead per instance at rest and under load
- [ ] Write throughput for MMR append operations (per-node-position key writes)
- [ ] Read latency for MMR proof generation (random key lookups by position)
- [ ] File descriptor usage at 100, 500, 1000 instances
- [ ] Disk space efficiency (overhead per instance)
- [ ] Bulk deletion speed (drop entire bucket DB)

### Blockchain Provider (single instance, state trie)

The Blockchain Provider uses Substrate's storage backend for the state trie. Options:
- **RocksDB** (current Substrate default): Mature, proven at scale, but compaction overhead.
- **ParityDB**: Substrate-native alternative optimized for state trie access patterns. Lower write amplification.

**Benchmark criteria**:
- [ ] Block import time under full state load
- [ ] State trie read latency for storage maps (Providers, Buckets, StorageAgreements)
- [ ] Compaction impact on block production
- [ ] Memory usage under sustained testnet load
- [ ] Pruning efficiency for historical state

## Strategic Mitigation Plans (Applicable Regardless of Engine Choice)

### Step 1: Physical Data Isolation
Decouple the Blockchain Provider's state tracking database from raw file storage. The consensus database must only store 32-byte hash pointers and state metrics.

### Step 2: System Memory Safeguards
Isolate the Storage Provider's HTTP engine and file-sharing tasks within bounded OS containers (e.g., Linux cgroups). Prevent file transfer operations from purging the Blockchain Provider's consensus database indexes from RAM.

### Step 3: Compaction & Buffer Tuning (if staying with RocksDB)
Configure leveled compaction with dynamic targets, expand active write buffers to cushion transaction spikes, and cap background write rates to protect disk I/O.

### Step 4: Key Prefix Restructuring
Structure on-chain key pathways so entries sharing a common parent (like Bucket ID) are stored contiguously. Enables bulk deletions in a single pass.

Note: The original plan attributed key-prefix restructuring to Issue #65, but actual Issue #65 is "Robust Syncing Protocol for Dynamic Primary and Replica Node Topologies" — a different topic entirely.

## Alternative Engines (Evaluate Only If Primary Candidates Fail)

- **PebblesDB (Fragmented LSM Tree)**: Groups keys into fragments, reducing write amplification ~70%. Alternative for Blockchain Provider if RocksDB compaction thrashes disks.
- **BadgerDB (Key-Value Separated LSM)**: Separates keys from values, compacts only the key index. Optimized for variable-sized metadata and proof structures.

## Deliverables

1. Benchmark report comparing Sled vs SQLite vs RocksDB for per-bucket Storage Provider DBs
2. Benchmark report comparing RocksDB vs ParityDB for Blockchain Provider state trie
3. Recommendation with justification for each component
4. Configuration guide for the chosen engines (memory limits, compaction settings, OS tuning)
5. Migration plan from current single-RocksDB to chosen architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate database engine for Storage Provider and Blockchain Provider #101

Context

Current State

Database Trade-Off Matrix

Scalability Bottlenecks to Evaluate

RocksDB Bottlenecks

ParityDB Considerations

Sled Considerations

SQLite (WAL mode) Considerations

Evaluation for Each Component

Storage Provider (per-bucket DBs — Issue #100)

Blockchain Provider (single instance, state trie)

Strategic Mitigation Plans (Applicable Regardless of Engine Choice)

Step 1: Physical Data Isolation

Step 2: System Memory Safeguards

Step 3: Compaction & Buffer Tuning (if staying with RocksDB)

Step 4: Key Prefix Restructuring

Alternative Engines (Evaluate Only If Primary Candidates Fail)

Deliverables

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Architectural Vector	RocksDB (LSM-Tree)	ParityDB (Hash-Based)	Sled (B-Tree)	SQLite (WAL)
Primary Structure	Log-Structured Merge-Trees (SSTs)	Fixed-size value tables with hash-indexed mmap	Lock-free B-Tree	B-Tree with WAL journaling
Ideal Workloads	High-volume sequential reads/writes	High-frequency state trie lookups (32-byte keys)	Mixed read/write with transactional safety	Single-writer, many-reader with complex queries
Write Amplification	High (compaction rewrites data across levels)	Low (append-only)	Low-moderate	Low (WAL append)
Memory Management	Explicit app-level block caches	Implicit (OS page cache)	Implicit (OS page cache)	Configurable page cache
Rust Native	No (FFI to C++)	Yes	Yes	No (FFI to C)
Deletion Cost	High (tombstones trigger compaction)	Moderate	Moderate	Low (single file delete for per-bucket)
Maturity	Very high (Facebook, production-proven)	Moderate (Parity-maintained)	Moderate (community)	Very high (ubiquitous)

Evaluate database engine for Storage Provider and Blockchain Provider #101

Description

Context

Current State

Database Trade-Off Matrix

Scalability Bottlenecks to Evaluate

RocksDB Bottlenecks

ParityDB Considerations

Sled Considerations

SQLite (WAL mode) Considerations

Evaluation for Each Component

Storage Provider (per-bucket DBs — Issue #100)

Blockchain Provider (single instance, state trie)

Strategic Mitigation Plans (Applicable Regardless of Engine Choice)

Step 1: Physical Data Isolation

Step 2: System Memory Safeguards

Step 3: Compaction & Buffer Tuning (if staying with RocksDB)

Step 4: Key Prefix Restructuring

Alternative Engines (Evaluate Only If Primary Candidates Fail)

Deliverables

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions