fix(xmtp_db): make TestInstrumentation database-lock panic opt-in#3767
fix(xmtp_db): make TestInstrumentation database-lock panic opt-in#3767xmtp-coder-agent wants to merge 3 commits into
Conversation
TestInstrumentation panicked on every `database is locked` query error whenever the `test` cfg or `test-utils` feature was active. Because the node-sdk Vitest suite enables `test-utils`, transient SQLite lock contention between concurrent fork workers triggered this panic on `r2d2-worker`/`tokio-rt-worker` threads, aborting the worker processes and surfacing as flaky `Worker exited unexpectedly` CI failures. `database is locked` is a retryable condition in this codebase (SQLite is configured with `PRAGMA busy_timeout` and the error is classified retryable). Panicking short-circuits that recovery machinery. Gate the panic behind an opt-in `XMTP_PANIC_ON_DB_LOCK` env var (accepts "1"/"true", matching the existing SQLCIPHER_LOG convention). Default (unset) logs the error and returns normally, restoring the retry behavior. Developers debugging lock contention can still opt in. Resolves xmtp#3765 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`test_stream_messages` opened a message stream after `add_members` and asserted the first streamed message was the application payload "hello". The `group_updated` membership-change commit from adding bob can race into the stream ahead of the application messages, so the assertion intermittently saw the commit message instead of "hello" and failed (observed in CI on this PR). Filter the stream to `GroupMessageKind::Application` messages so the assertions are deterministic regardless of whether the commit message is delivered through the stream. Errors are passed through unchanged so the existing error handling in `assert_msg!` still applies. Verified deterministic: the test passed 10/10 consecutive runs against a live node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI fix: deflaked
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3767 +/- ##
==========================================
+ Coverage 84.44% 84.47% +0.02%
==========================================
Files 408 408
Lines 59807 59830 +23
==========================================
+ Hits 50506 50540 +34
+ Misses 9301 9290 -11 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| /// Returns true when the caller has explicitly opted in to panicking on | ||
| /// `database is locked` via the `XMTP_PANIC_ON_DB_LOCK` env var. | ||
| fn panic_on_db_lock_enabled() -> bool { | ||
| panic_on_db_lock_from_env(std::env::var("XMTP_PANIC_ON_DB_LOCK")) |
There was a problem hiding this comment.
this makes it default off, so libxmtp workspace test and others which used to have panic on lock now have it off by default. i still want those to panic. can you turn it on for CI please? @xmtp-coder-agent it can stay off for nodesdk tests ONLY
There was a problem hiding this comment.
Done — turned the panic back on for the Rust workspace tests while keeping it off for node-sdk only.
Since both the workspace tests and node-sdk compile xmtp_db with test-utils, the split is made at the test-runner environment rather than at compile time:
- CI workspace tests: set
XMTP_PANIC_ON_DB_LOCK=trueinnix/package/nextest.nix(covers the v3 + d14ntest-workspacejob). Verified withnix eval .#nextest.x86_64-linux.v3.XMTP_PANIC_ON_DB_LOCK→true, so it reaches the derivation build env. - Local
just test: set it in the_test-v3/_test-d14nrecipes (theci/ci-d14nnextest profiles), so local workspace runs panic too. - node-sdk / node-bindings: run via
just node test*(Vitest) and never set the var → stay off by default, so no more crashed fork workers.
Commit: 36e833a. Let me know if you'd also like it on for any other CI job (e.g. test-node).
|
@xmtp-coder-agent maybe instead of this env var, node sdk should use an in memory database for tests wherever possible? is the node sdk currently reusing the same db for each of its tests rather than creating a new one? then we can keep test using the panic. the panic is very useful for debuggin race conditions and should be default on. |
…r node-sdk Per maintainer feedback on xmtp#3767: the `database is locked` panic is a useful diagnostic for the Rust workspace test suites and should keep firing there — it must only stay off for the node-sdk Vitest suite, where it crashes fork workers (the original bug). Both paths compile xmtp_db with `test-utils`, so the distinction is made at the test-runner environment rather than at compile time: - Set XMTP_PANIC_ON_DB_LOCK=true for the workspace nextest run in CI (nix/package/nextest.nix → the v3 + d14n test-workspace job). Verified via `nix eval` that the var reaches the derivation build environment. - Set it for the local `just test` path (justfile _test-v3 / _test-d14n, which use the ci / ci-d14n nextest profiles). The node-sdk / node-bindings JS suites run via `just node test*` (Vitest) and never set the var, so they keep the default-off behavior. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves #3765
Problem
TestInstrumentation(crates/xmtp_db/src/encrypted_store/database/instrumentation.rs) calledpanic!("database locked")on any query that failed withdatabase is locked. This instrumentation is installed on every SQLite connection whenever thetestcfg or thetest-utilsfeature is active — and the node-sdk Vitest suite (bindings_node) enablestest-utils.Under concurrent test load (CI node-sdk shard 2), Vitest fork workers transiently contend on SQLite locks. Each
database is lockederror panicked on anr2d2-worker/tokio-rt-workerthread, aborting the worker process and surfacing as flakyWorker exited unexpectedlyfailures with a non-zero Vitest exit code.database is lockedis not fatal in this codebase: SQLite is configured withPRAGMA busy_timeout = 5000and the error is explicitly classified as retryable (PlatformStorageError::DatabaseLocked => true;ClientErrorretries on"database is locked"). The panic short-circuited that designed retry/backoff machinery, turning a recoverable transient error into a hard crash.Fix
Gate the panic behind an opt-in
XMTP_PANIC_ON_DB_LOCKenvironment variable (accepts1/true, matching the existingSQLCIPHER_LOGconvention):XMTP_PANIC_ON_DB_LOCK=1: preserve the legacy panic for developers explicitly debugging lock contention.The env read is split from a pure decision helper (
panic_on_db_lock_from_env) so the opt-in logic is unit-tested without mutating the process environment (Rust 2024set_varis unsafe/racy).Unchanged behavior
DatabaseLockedremains classified as retryable.Testing
instrumentation.rscover: panic disabled by default (unset env), opt-in values (1/true), and that other values (0,false, empty,TRUE, etc.) do not enable it.cargo test -p xmtp_db --lib instrumentation→ 3 passed.cargo clippy -p xmtp_db --all-features --tests→ clean.cargo fmt -p xmtp_db --check→ clean.🤖 Generated with Claude Code
Note
Make
TestInstrumentationdatabase-lock panic opt-in viaXMTP_PANIC_ON_DB_LOCKenv varTestInstrumentationnow requiresXMTP_PANIC_ON_DB_LOCK=1orXMTP_PANIC_ON_DB_LOCK=true; previously any occurrence would panic unconditionally.panic_on_db_lock_enabledandpanic_on_db_lock_from_envhelpers in instrumentation.rs to read and evaluate the env var.XMTP_PANIC_ON_DB_LOCK=truein thejustfiletest recipes and the nextest.nix package so CI and local test runs still opt in.Macroscope summarized 36e833a.