Retry user-initiated splices across restarts and disconnects#930
Retry user-initiated splices across restarts and disconnects#930jkczyz wants to merge 7 commits into
Conversation
|
👋 Hi! I see this is a draft PR. |
3ae2507 to
6098d0e
Compare
joostjager
left a comment
There was a problem hiding this comment.
The PR description explains that LDK currently does not durably record a splice until signature exchange, but I don’t think it explains why ldk-node should therefore become the owner of that durability?
This seems similar to the functionality added in #882: common enough, and close enough to channel/protocol state, that it feels like it should live in LDK if we want the behavior to be durable. Persisting it in ldk-node means we now have another store tracking protocol state alongside ChannelManager/ChannelMonitor, plus reconciliation logic to infer whether LDK still has or no longer has the splice. That creates desync risk between persistence layers.
I can see ldk-node owning product policy around retries or how/when to surface failures, but the durable record of an accepted splice contribution or in-flight splice intent feels like it should be owned by LDK.
|
This will now need a (likely rather considerable) rebase now that #888 and a few related PRs landed. |
Funding broadcasts are classified into payment records off the broadcaster's queue, which can run after wallet sync has already recorded the transaction -- for instance when LDK re-broadcasts a still-pending funding transaction on restart. In that case the classification overwrote a record wallet sync had already advanced, downgrading a confirmed or graduated funding payment back to unconfirmed/pending. Merge only the classification and our contribution figures into an existing record, leaving the confirmation state that the wallet-sync events own in place. Raised by Codex in the review of lightningdevkit#888. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A funding payment's id is anchored to its first negotiated candidate, but once it confirms the record is stamped with the candidate that actually confirmed. After it graduates and its pending-store entry is removed, a later reorg event carrying the confirmed candidate's txid could no longer be resolved to the payment, so it was recorded as a duplicate generic on-chain payment rather than reverting the splice payment. RBF splices, whose confirmed candidate differs from the first, are the reachable case. Resolve such an event against the payment store by on-chain txid once the pending store no longer holds it, and revert the funding payment to pending so wallet sync re-graduates it when it reconfirms. Raised by Codex in the review of lightningdevkit#888. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Retrying a user-initiated splice across restarts requires persisting the splice intent before handing it to LDK, which happens before negotiation and therefore before any funding transaction exists. The pending-payment record was built around an on-chain PaymentDetails carrying a txid, which cannot represent a splice that has not been broadcast yet. Reshape PendingPaymentDetails into an enum: a PendingSplice variant that holds only the generated PaymentId and the splice intent, and a Tracked variant that is the previous record plus an optional intent retained until the splice locks. Add the SpliceIntent and SpliceKind types the intent needs to resubmit or rebuild the contribution. This is groundwork; nothing constructs a PendingSplice yet. The classify, retry, and wiring that use it follow in subsequent commits. Generated with assistance from Claude Code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A user-initiated splice will be keyed by a PaymentId generated at splice time rather than derived from a candidate's txid, so its retry intent, funding payment, and candidate history all share one record. Teach the classifier to find a pre-broadcast splice intent by its channel and reuse that id, promoting the intent record to a tracked funding payment while preserving the intent until the splice locks. Splices we did not originate (counterparty-initiated or V2 dual-funded opens) keep deriving the id from the first candidate's txid via a fallback. Also map any candidate txid back to the record in find_payment_by_txid, since a splice under a generated id is no longer found by the txid-derived lookup and an earlier RBF candidate may be the one that confirms. No splice intents are created yet, so behavior is unchanged; the splice entry points that persist them follow. Generated with assistance from Claude Code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LDK abandons an in-progress splice negotiation whenever the peer disconnects -- which includes stopping the node -- and only durably records a splice once its negotiation reaches signing. A splice dropped before then, after splice_in, splice_out, or bump_channel_funding_fee returned Ok, is therefore silently lost across a restart or an ill-timed disconnect. Persist a splice intent before handing the contribution to LDK, keyed by a PaymentId generated at splice time and reusing the channel's existing intent record when one is present, so a splice and its fee bumps share one id and at most one intent exists per channel. At startup a reconciler probes each intent against LDK's live channel state and resubmits any LDK dropped -- including those lost to a crash before LDK persisted anything -- surfacing SpliceNegotiationFailed only when the channel is gone, a fee bump has nothing left to replace, or the resubmission budget is exhausted. Resubmitting does not require the peer to be connected: LDK holds the contribution and initiates quiescence on reconnect. A payment-tracking merge (e.g. from wallet sync) must leave a live intent untouched. The splice tests now locate a funding payment by its candidate txid, since its id is generated rather than derived from the funding txid. Generated with assistance from Claude Code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A user-initiated splice can fail mid-negotiation while the node is running -- the peer disconnects, or the contribution goes stale behind a competing negotiation -- and LDK reports each such round via SpliceNegotiationFailed. Drive those events through the splice retrier: resubmit the same contribution when the peer merely disconnected, rebuild a fresh one when it went stale, and give up (surfacing the failure) only for a non-retriable reason or once the resubmission budget is exhausted, using LDK's own is_retriable classification. Clear a splice's intent once the channel locks its new funding or the channel closes. Event::SpliceNegotiationFailed is now emitted only when a splice is finally abandoned, not for every failed negotiation round, since a recoverable failure is retried transparently. Generated with assistance from Claude Code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add integration coverage for resuming a dropped splice: splice_resumed_after_restart initiates a splice-out while disconnected, restarts the node before anything is negotiated, and asserts the reconciler resumes and completes the splice -- and that a second restart does not resubmit the now-locked splice. splice_rbf_resumed_after_restart does the same for a fee bump. Document on splice_in, splice_out, and bump_channel_funding_fee that the splice is retried automatically across restarts until it completes or is given up on. Generated with assistance from Claude Code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6098d0e to
c005623
Compare
LDK abandons an in-progress splice negotiation whenever the peer disconnects (which includes stopping the node) and only durably records a splice once it reaches signing. Between calling
splice_in/splice_out/bump_channel_funding_feeand that point, a restart or an ill-timed disconnect silently drops the splice with no way to recover it.This makes those splices durable and self-healing: the intent is persisted before the contribution is handed to LDK, and a startup reconciler plus a
SpliceNegotiationFailedhandler resubmit it (gated on LDK'sNegotiationFailureReason::is_retriable) until the splice locks or is genuinely unrecoverable. As a resultSpliceNegotiationFailedis now emitted only once a splice is given up on, rather than for every failed negotiation round.Rather than adding a dedicated store, the intent lives in the existing
PendingPaymentStoreunder aPaymentIdgenerated at splice time — which becomes the splice's payment id, replacing the previous first-candidate-txid derivation. That required modeling the pending record as an enum, since a not-yet-broadcast splice has no funding transaction, and therefore noPaymentDetails, yet. The change is scoped to user-initiated splices; counterparty-initiated splices and V2 opens are untouched and keep the txid-derived id.Restart resumption is covered by two integration tests (a splice-out and an RBF fee-bump), alongside unit coverage of the retry-decision matrix; the existing splice/funding/on-chain suites continue to pass.
Based on #962.