Skip to content

fix(xmtp_db): make TestInstrumentation database-lock panic opt-in#3767

Draft
xmtp-coder-agent wants to merge 3 commits into
xmtp:mainfrom
xmtp-coder-agent:fix/issue-3765
Draft

fix(xmtp_db): make TestInstrumentation database-lock panic opt-in#3767
xmtp-coder-agent wants to merge 3 commits into
xmtp:mainfrom
xmtp-coder-agent:fix/issue-3765

Conversation

@xmtp-coder-agent

@xmtp-coder-agent xmtp-coder-agent commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Resolves #3765

Problem

TestInstrumentation (crates/xmtp_db/src/encrypted_store/database/instrumentation.rs) called panic!("database locked") on any query that failed with database is locked. This instrumentation is installed on every SQLite connection whenever the test cfg or the test-utils feature is active — and the node-sdk Vitest suite (bindings_node) enables test-utils.

Under concurrent test load (CI node-sdk shard 2), Vitest fork workers transiently contend on SQLite locks. Each database is locked error panicked on an r2d2-worker/tokio-rt-worker thread, aborting the worker process and surfacing as flaky Worker exited unexpectedly failures with a non-zero Vitest exit code.

database is locked is not fatal in this codebase: SQLite is configured with PRAGMA busy_timeout = 5000 and the error is explicitly classified as retryable (PlatformStorageError::DatabaseLocked => true; ClientError retries on "database is locked"). The panic short-circuited that designed retry/backoff machinery, turning a recoverable transient error into a hard crash.

Fix

Gate the panic behind an opt-in XMTP_PANIC_ON_DB_LOCK environment variable (accepts 1/true, matching the existing SQLCIPHER_LOG convention):

  • Default (unset): log the query error at error level and return normally — the existing busy_timeout/retry machinery recovers. No worker crashes.
  • XMTP_PANIC_ON_DB_LOCK=1: preserve the legacy panic for developers explicitly debugging lock contention.

The env read is split from a pure decision helper (panic_on_db_lock_from_env) so the opt-in logic is unit-tested without mutating the process environment (Rust 2024 set_var is unsafe/racy).

Unchanged behavior

  • Query-error logging (message/table/column/hint) is preserved.
  • DatabaseLocked remains classified as retryable.
  • Begin/Commit/Rollback transaction tracing is unchanged.

Testing

  • New unit tests in instrumentation.rs cover: panic disabled by default (unset env), opt-in values (1/true), and that other values (0, false, empty, TRUE, etc.) do not enable it.
  • cargo test -p xmtp_db --lib instrumentation → 3 passed.
  • cargo clippy -p xmtp_db --all-features --tests → clean.
  • cargo fmt -p xmtp_db --check → clean.

🤖 Generated with Claude Code

Note

Make TestInstrumentation database-lock panic opt-in via XMTP_PANIC_ON_DB_LOCK env var

  • Panicking on SQLite "database is locked" errors in TestInstrumentation now requires XMTP_PANIC_ON_DB_LOCK=1 or XMTP_PANIC_ON_DB_LOCK=true; previously any occurrence would panic unconditionally.
  • Adds panic_on_db_lock_enabled and panic_on_db_lock_from_env helpers in instrumentation.rs to read and evaluate the env var.
  • Sets XMTP_PANIC_ON_DB_LOCK=true in the justfile test recipes and the nextest.nix package so CI and local test runs still opt in.
  • Fixes a flaky test in stream_messages.rs by filtering the message stream to application messages only, avoiding races with membership-change commits.
  • Behavioral Change: the default behavior no longer panics on lock contention; environments outside the configured test runners must explicitly set the env var to preserve the old behavior.

Macroscope summarized 36e833a.

xmtp-coder-agent and others added 2 commits June 12, 2026 15:40
TestInstrumentation panicked on every `database is locked` query error
whenever the `test` cfg or `test-utils` feature was active. Because the
node-sdk Vitest suite enables `test-utils`, transient SQLite lock
contention between concurrent fork workers triggered this panic on
`r2d2-worker`/`tokio-rt-worker` threads, aborting the worker processes
and surfacing as flaky `Worker exited unexpectedly` CI failures.

`database is locked` is a retryable condition in this codebase (SQLite is
configured with `PRAGMA busy_timeout` and the error is classified
retryable). Panicking short-circuits that recovery machinery.

Gate the panic behind an opt-in `XMTP_PANIC_ON_DB_LOCK` env var (accepts
"1"/"true", matching the existing SQLCIPHER_LOG convention). Default
(unset) logs the error and returns normally, restoring the retry
behavior. Developers debugging lock contention can still opt in.

Resolves xmtp#3765

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`test_stream_messages` opened a message stream after `add_members` and
asserted the first streamed message was the application payload "hello".
The `group_updated` membership-change commit from adding bob can race
into the stream ahead of the application messages, so the assertion
intermittently saw the commit message instead of "hello" and failed
(observed in CI on this PR).

Filter the stream to `GroupMessageKind::Application` messages so the
assertions are deterministic regardless of whether the commit message is
delivered through the stream. Errors are passed through unchanged so the
existing error handling in `assert_msg!` still applies.

Verified deterministic: the test passed 10/10 consecutive runs against a
live node.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xmtp-coder-agent

Copy link
Copy Markdown
Contributor Author

CI fix: deflaked test_stream_messages

The Rust Workspace CI run surfaced a failure in subscriptions::stream_messages::tests::test_stream_messagespre-existing flakiness unrelated to the instrumentation change (this PR's only other diff is in xmtp_db).

Root cause: the test opened a message stream after add_members(bob) and asserted the first streamed message equaled "hello". The group_updated membership-change commit can race into the stream ahead of the application message, so the assertion intermittently saw the commit message (left: group_updated…, right: "hello"). The test even had a commented-out assert_msg_exists! acknowledging this commit message.

Fix: filter the stream to GroupMessageKind::Application messages so the assertions are deterministic regardless of whether the commit is delivered through the stream (errors pass through unchanged).

Verification: test_stream_messages passed 10/10 consecutive runs against a live node; cargo check --tests, clippy --tests, and fmt --check all clean for xmtp_mls.

This is in-scope with the PR's flaky-CI theme.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.47%. Comparing base (578abea) to head (36e833a).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...db/src/encrypted_store/database/instrumentation.rs 84.21% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3767      +/-   ##
==========================================
+ Coverage   84.44%   84.47%   +0.02%     
==========================================
  Files         408      408              
  Lines       59807    59830      +23     
==========================================
+ Hits        50506    50540      +34     
+ Misses       9301     9290      -11     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

/// Returns true when the caller has explicitly opted in to panicking on
/// `database is locked` via the `XMTP_PANIC_ON_DB_LOCK` env var.
fn panic_on_db_lock_enabled() -> bool {
panic_on_db_lock_from_env(std::env::var("XMTP_PANIC_ON_DB_LOCK"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it default off, so libxmtp workspace test and others which used to have panic on lock now have it off by default. i still want those to panic. can you turn it on for CI please? @xmtp-coder-agent it can stay off for nodesdk tests ONLY

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — turned the panic back on for the Rust workspace tests while keeping it off for node-sdk only.

Since both the workspace tests and node-sdk compile xmtp_db with test-utils, the split is made at the test-runner environment rather than at compile time:

  • CI workspace tests: set XMTP_PANIC_ON_DB_LOCK=true in nix/package/nextest.nix (covers the v3 + d14n test-workspace job). Verified with nix eval .#nextest.x86_64-linux.v3.XMTP_PANIC_ON_DB_LOCKtrue, so it reaches the derivation build env.
  • Local just test: set it in the _test-v3 / _test-d14n recipes (the ci / ci-d14n nextest profiles), so local workspace runs panic too.
  • node-sdk / node-bindings: run via just node test* (Vitest) and never set the var → stay off by default, so no more crashed fork workers.

Commit: 36e833a. Let me know if you'd also like it on for any other CI job (e.g. test-node).

@insipx

insipx commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@xmtp-coder-agent maybe instead of this env var, node sdk should use an in memory database for tests wherever possible? is the node sdk currently reusing the same db for each of its tests rather than creating a new one? then we can keep test using the panic. the panic is very useful for debuggin race conditions and should be default on.

…r node-sdk

Per maintainer feedback on xmtp#3767: the `database is locked` panic is a
useful diagnostic for the Rust workspace test suites and should keep
firing there — it must only stay off for the node-sdk Vitest suite, where
it crashes fork workers (the original bug).

Both paths compile xmtp_db with `test-utils`, so the distinction is made
at the test-runner environment rather than at compile time:

- Set XMTP_PANIC_ON_DB_LOCK=true for the workspace nextest run in CI
  (nix/package/nextest.nix → the v3 + d14n test-workspace job). Verified
  via `nix eval` that the var reaches the derivation build environment.
- Set it for the local `just test` path (justfile _test-v3 / _test-d14n,
  which use the ci / ci-d14n nextest profiles).

The node-sdk / node-bindings JS suites run via `just node test*` (Vitest)
and never set the var, so they keep the default-off behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky CI Failure: node-sdk shard 2 — database locked panics crash Vitest workers

2 participants