Skip to content

[bug]: sweeper batched 4 force-close outputs into invalid tx, then removed all inputs as State=Fatal #10840

Description

@openoms

Pre-Submission Checklist

  • I have searched the existing issues and believe this is a new bug.
  • I am not asking a question about how to use lnd, but reporting a bug (otherwise open a discussion).

LND Version

0.20.0-beta

LND Configuration

restlisten=0.0.0.0:8080
rpclisten=0.0.0.0:10009
listen=0.0.0.0:9735
prometheus.listen=0.0.0.0:9092

bitcoin.active=1
bitcoin.mainnet=1
bitcoin.node=bitcoind
bitcoin.basefee=<configured>
bitcoin.feerate=<configured>
bitcoin.timelockdelta=60
bitcoind.rpchost=<internal-bitcoind-service>:8332
bitcoind.rpcuser=<redacted>
bitcoind.rpcpass=<redacted>
bitcoind.zmqpubrawblock=tcp://<internal-bitcoind-service>:28332
bitcoind.zmqpubrawtx=tcp://<internal-bitcoind-service>:28333

db.backend=postgres
db.use-native-sql=true
db.postgres.dsn=<redacted>
db.postgres.timeout=<configured>
db.postgres.maxconnections=<configured>

accept-keysend=1
allow-circular-route=1
stagger-initial-reconnect=1
protocol.wumbo-channels=1
maxchansize=500000000
max-commit-fee-rate-anchors=100
default-remote-max-htlcs=50
debuglevel=info
prometheus.enable=1
gc-canceled-invoices-on-the-fly=true
gc-canceled-invoices-on-startup=true
tor.active=true
tor.v3=true
tor.skip-proxy-for-clearnet-targets=true
minchansize=<configured>
alias=<redacted>

Backend Version

Bitcoin Core v28

Backend Configuration

debug=mempool
debug=rpc
shrinkdebugfile=1
server=1
txindex=1
blockfilterindex=1
printtoconsole=1
rpcuser=<redacted>
rpcpassword=<redacted>
zmqpubrawtx=tcp://0.0.0.0:28333
zmqpubrawblock=tcp://0.0.0.0:28332
bind=<redacted>
rpcbind=<redacted>
rpcallowip=<redacted>

OS/Distribution

kubernetes with https://github.com/blinkbitcoin/charts/tree/main/charts/lnd

Bug Details & Steps to Reproduce

On a mainnet LND node using Postgres for channel state, three force-closed
channels remained in limbo. After restart, LND re-registered the pending sweep
inputs. A later lncli wallet bumpfee --sat_per_vbyte 2 on one HTLC
second-level output caused the sweeper to build a single 4-input transaction
covering:

  • one COMMITMENT_TIME_LOCK output
  • two COMMITMENT_TO_REMOTE_CONFIRMED outputs
  • one HTLC_ACCEPTED_SUCCESS_SECOND_LEVEL output

Bitcoin Core rejected the transaction with:

mempool rejection: mandatory script verify flag failed

LND then removed all four inputs from the sweeper as State=Fatal. I am trying
to understand whether one invalid input poisoned the full batch, whether these
input types should have been isolated, and what the safest recovery path is for
a Postgres-backed node.

Environment

  • LND: 0.20.0-beta commit=v0.20.0-beta
  • LND commit hash: b9ea7070c20ad2ca8514a47d9b4d560a501f0487
  • Network: Bitcoin mainnet
  • Chain backend: Bitcoin Core
  • Bitcoin Core: /Satoshi:28.0.0/
  • Bitcoin Core relay fee: 0.00001000
  • Bitcoin Core incremental fee: 0.00001000
  • LND channel state backend: Postgres
  • Node alias/pubkey: redacted

Force-close channels in limbo

From lncli pendingchannels, total limbo balance was about 0.676 BTC.

Channel Closing tx Limbo balance Anchor Notes
Channel A redacted about 0.072 BTC LOST no pending HTLCs
Channel B redacted about 0.551 BTC RECOVERED one incoming stage 2 HTLC for 8601 sat
Channel C redacted about 0.053 BTC LOST no pending HTLCs

Pending sweep inputs after restart

After restart, lncli wallet pendingsweeps showed four pending inputs:

Input Witness type Amount
Input A COMMITMENT_TO_REMOTE_CONFIRMED about 0.072 BTC
Input B COMMITMENT_TO_REMOTE_CONFIRMED about 0.053 BTC
Input C COMMITMENT_TIME_LOCK about 0.551 BTC
Input D HTLC_ACCEPTED_SUCCESS_SECOND_LEVEL 8601 sat

Triggering action

I bumped the fee for the HTLC second-level output:

lncli wallet bumpfee --sat_per_vbyte 2 <htlc-second-level-outpoint>

LND accepted it:

"Successfully registered rbf-tx with sweeper"

On the next block, the sweeper attempted a 4-input batch.

Relevant log excerpt

The full outpoints are redacted here, but the important sequence was:

[INF] WLKT: [BumpFee]: bumping fee for existing input=<htlc-second-level-outpoint>, new params=startingFeeRate={true 500}
[DBG] SWPR: Received new block: height=<height>, attempt sweeping 4 inputs:
<commitment-time-lock-outpoint> (CommitmentTimeLock)
<commitment-to-remote-confirmed-outpoint-a> (CommitmentToRemoteConfirmed)
<commitment-to-remote-confirmed-outpoint-b> (CommitmentToRemoteConfirmed)
<htlc-second-level-outpoint> (HtlcAcceptedSuccessSecondLevel)
[DBG] SWPR: Creating sweep tx for 4 inputs (...) using 500 sat/kw
[DBG] SWPR: Created sweep tx <sweep-txid> for inputs:
<commitment-time-lock-outpoint> (CommitmentTimeLock)
<commitment-to-remote-confirmed-outpoint-a> (CommitmentToRemoteConfirmed)
<commitment-to-remote-confirmed-outpoint-b> (CommitmentToRemoteConfirmed)
<htlc-second-level-outpoint> (HtlcAcceptedSuccessSecondLevel)
[DBG] SWPR: Failed to create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[ERR] SWPR: Initial broadcast failed: create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[DBG] SWPR: Sending result [Event=Fatal] for requestID=1
[ERR] SWPR: Failed to sweep input: <commitment-time-lock-outpoint> (CommitmentTimeLock), error: create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[ERR] SWPR: Failed to sweep input: <commitment-to-remote-confirmed-outpoint-a> (CommitmentToRemoteConfirmed), error: create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[ERR] SWPR: Failed to sweep input: <commitment-to-remote-confirmed-outpoint-b> (CommitmentToRemoteConfirmed), error: create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[ERR] SWPR: Failed to sweep input: <htlc-second-level-outpoint> (HtlcAcceptedSuccessSecondLevel), error: create RBF-compliant tx: tx=<sweep-txid> failed mempool check: mempool rejection: mandatory script verify flag failed
[DBG] SWPR: Removing input(State=Fatal) <commitment-to-remote-confirmed-outpoint-a> from sweeper
[DBG] SWPR: Removing input(State=Fatal) <commitment-to-remote-confirmed-outpoint-b> from sweeper
[DBG] SWPR: Removing input(State=Fatal) <commitment-time-lock-outpoint> from sweeper
[DBG] SWPR: Removing input(State=Fatal) <htlc-second-level-outpoint> from sweeper

After this, the next block log showed:

[DBG] SWPR: Received new block: height=<height>, attempt sweeping 0 inputs:
[DBG] SWPR: Sweeping 0 inputs

Actual behavior

The sweeper created a 4-input transaction. Bitcoin Core rejected it with:

mandatory script verify flag failed

All four inputs were then marked fatal and removed from the sweeper, including
the three large commitment outputs. The limbo channels remained listed by
lncli pendingchannels, but lncli wallet pendingsweeps became empty.

Expected Behavior

Expected behavior

I expected one of these outcomes:

  • LND constructs a valid sweep transaction for the mature commitment outputs.
  • If one input fails script verification, LND isolates that input or avoids
    marking unrelated inputs fatal.
  • If HTLC_ACCEPTED_SUCCESS_SECOND_LEVEL is unsafe to batch with commitment
    outputs, LND keeps it separate from COMMITMENT_TIME_LOCK and
    COMMITMENT_TO_REMOTE_CONFIRMED outputs.

Questions

  1. Does this look like one bad input poisoning a sweep batch, or does the log
    indicate all four witnesses were invalid?
  2. Should an HTLC_ACCEPTED_SUCCESS_SECOND_LEVEL input be batched with
    COMMITMENT_TIME_LOCK and COMMITMENT_TO_REMOTE_CONFIRMED inputs?
  3. Is it expected that all inputs in a batch are removed as State=Fatal after
    one batched transaction fails script verification?
  4. For a Postgres-backed node where lncli wallet pendingsweeps is now empty
    but lncli pendingchannels still shows the limbo channels, is there a safe
    way to make LND retry these sweeps?
  5. If LND cannot retry, which manual recovery path is recommended for each
    witness type? I am considering Chantools dry-runs only, without --publish,
    and then validating any raw tx with bitcoin-cli testmempoolaccept before
    any broadcast.

Debug Information

I can provide sanitized logs and exact outpoints privately on Slack or else suitable.

Environment

No response

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions