Skip to content

htlcswitch: Add HTLC state machine fuzz tests#7

Open
NishantBansal2003 wants to merge 1 commit into
masterfrom
fuzz-htlcswitch
Open

htlcswitch: Add HTLC state machine fuzz tests#7
NishantBansal2003 wants to merge 1 commit into
masterfrom
fuzz-htlcswitch

Conversation

@NishantBansal2003

@NishantBansal2003 NishantBansal2003 commented Nov 15, 2025

Copy link
Copy Markdown
Owner

Change Description

Topology: Local (ChannelLink) <---> Remote (ChannelLink)

This PR adds HTLC state machine fuzz tests using a 2-hop network. Based on the fuzz input, the test executes various state transitions such as Local sending HTLC adds to Remote, Remote sending adds to Local, and exchanging revoke, commit-sig, fulfill/fail, and other messages between peers.
Message ordering is controlled via a mockPeer instance, allowing us to simulate unordered commitment dance behavior and precisely manage message flow.
The primary goal is to ensure that after all HTLC operations, the channels never fall out of sync or force-close unexpectedly.

Steps to Test

go test -run="FuzzHTLCStates/seed#0" -v -race -count=10

@coveralls

coveralls commented Nov 15, 2025

Copy link
Copy Markdown

Coverage Report for CI Build 25013897283

Coverage increased (+0.009%) to 62.214%

Details

  • Coverage increased (+0.009%) from the base build.
  • Patch coverage: 30 uncovered changes across 2 files (46 of 76 lines covered, 60.53%).
  • 83 coverage regressions across 22 files.

Uncovered Changes

File Changed Covered %
htlcswitch/mock.go 55 27 49.09%
channeldb/channel.go 7 5 71.43%

Coverage Regressions

83 previously-covered lines in 22 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
contractcourt/chain_watcher.go 18 76.12%
discovery/syncer.go 9 85.6%
peer/brontide.go 7 79.15%
htlcswitch/link.go 6 77.04%
watchtower/wtclient/queue.go 6 85.92%
lnwallet/test/test_interface.go 4 78.14%
lnrpc/wtclientrpc/wtclient.go 3 62.61%
lnwire/closing_signed.go 3 76.79%
lnwire/dyn_propose.go 3 82.76%
lnwire/pong.go 3 84.21%

Coverage Stats

Coverage Status
Relevant Lines: 230856
Covered Lines: 143624
Line Coverage: 62.21%
Coverage Strength: 19126.57 hits per line

💛 - Coveralls

@NishantBansal2003

NishantBansal2003 commented Nov 16, 2025

Copy link
Copy Markdown
Owner Author

The only drawback of these fuzz tests is that they can be a bit flaky (on restart). The htlcswitch relies on many goroutines, so there are cases where some goroutines may not finish in time, and the Sleep duration we set may not be enough. One way to fix this is to increase the Sleep delay, but that trades off against the overall fuzzing timeout (we might get: fuzzing process hung or terminated unexpectedly: exit status 2). So the flakiness is something we need to reconsider, balancing sleep duration and fuzz time.

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

One more thing -- fuzzing is only effective with a lot of CPU time, so the reason I’m hitting test flake sometimes might be due to the limited runtime resources on my machine. It would be better to validate this on a different machine with more CPU capacity to ensure correctness.

@morehouse

Copy link
Copy Markdown

I tried running the fuzz test and encountered a big issue: it's too slow! Running a single input can take ~1 minute, which is way too slow for fuzzing to be useful. Ideally we like to see thousands of executions per second, but we often tolerate 10s or 100s per second for more complex targets. And the target often stops altogether with "fuzzing process hung or terminated unexpectedly" errors.

I noticed a couple 100 ms sleeps in the target, which we should eliminate if possible. It would also be very helpful to profile the target and figure out where most of the time is getting spent. Perhaps there's some restructuring we can do to improve performance.

I suspect we'd have a much better time if we try to fuzz the channel link directly instead of through mock nodes. That would eliminate a lot of setup and teardown time and reduce latency from passing messages around. I'm not sure exactly what that would look like, but we'd probably need to mock lots of stuff the channel link depends on, and we might need to refactor the link a bit to facilitate testing/fuzzing.

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

I tried running the fuzz test and encountered a big issue: it's too slow! Running a single input can take ~1 minute, which is way too slow for fuzzing to be useful. Ideally we like to see thousands of executions per second, but we often tolerate 10s or 100s per second for more complex targets. And the target often stops altogether with "fuzzing process hung or terminated unexpectedly" errors.

Yes, I was concerned about this issue. The main reason I tried such a heavy setup was to exercise restart behaviour in the fuzz state machine, but now I think that will only add more startup overhead to the fuzz tests.
The biggest factor reducing execs/sec is the large number of goroutines involved in link setup. For each HTLC state transition, the main process has to iterate through all goroutines, which naturally slows things down.

I suspect we'd have a much better time if we try to fuzz the channel link directly instead of through mock nodes. That would eliminate a lot of setup and teardown time and reduce latency from passing messages around. I'm not sure exactly what that would look like, but we'd probably need to mock lots of stuff the channel link depends on, and we might need to refactor the link a bit to facilitate testing/fuzzing.

Yes, to reduce setup overhead, I switched to a simple 2-hop network with only channelLink instances. I also mocked the mailbox, since passing messages through the real mailbox introduces additional delays (goroutines) during fuzzing, So now the functions are invoked directly.
Even with these changes, I’m still seeing around 1–10 execs/sec. I suspect the remaining overhead comes from DB writes and the initial setup, which I measured to be roughly ~100ms. I’m not sure of the best way to eliminate that. But with the current setup, removing most asynchronous calls and switching to direct synchronous calls has definitely fixed the occasional “fuzzing process hung or terminated unexpectedly” issue.

Can you check now and see if there are any other areas which can be fixed?

@morehouse

Copy link
Copy Markdown

This is a good start. The timeouts are gone, and I'm getting ~20 execs/s. I think this is a good direction to move in.

The next step I would take would be to profile everything -- CPU, goroutine, and heap, looking for ways to increase speed, cut wait times, and reduce memory usage. With 16 parallel workers, the fuzz target was peaking around 90 GB of memory on my machine. I can't run with all 32 cores on my machine or it runs out of memory. Maybe we won't be able to reduce that too much, but it's worth checking to see if there's any easy wins.

Also, it's important to check the code coverage after you've built up a decent corpus, to make sure the fuzz target is executing the expected code paths.

I think there's a lot of other improvements we can make to this fuzz target -- currently it is quite limited in the sequences it can generate and the corner cases it can trigger. But I think first we should optimize the general approach.

@morehouse

Copy link
Copy Markdown

After running for a while, I got this crash:

--- FAIL: FuzzHTLCStates (0.00s)
    --- FAIL: FuzzHTLCStates/ea357960e1bde8a0 (0.00s)
        fuzz_test.go:555: 
            	Error Trace:	/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:337
            	            				/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:348
            	            				/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:387
            	            				/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:495
            	            				/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:555
            	            				/usr/lib/golang/src/reflect/value.go:584
            	            				/usr/lib/golang/src/reflect/value.go:368
            	            				/usr/lib/golang/src/testing/fuzz.go:340
            	Error:      	Target error should be in err chain:
            	            	expected: "invoice with payment hash already exists"
            	            	in chain: "cannot use hash of all-zeroes preimage"
            	Test:       	FuzzHTLCStates/ea357960e1bde8a0
FAIL

Input:

$ cat testdata/fuzz/FuzzHTLCStates/ea357960e1bde8a0
go test fuzz v1
[]byte("0100\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")

It appears we're attempting to create an invalid invoice on the "victim" node, which gets caught. Probably we should just avoid doing this, since an attacker wouldn't get to choose the preimage anyway.

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

The next step I would take would be to profile everything -- CPU, goroutine, and heap, looking for ways to increase speed, cut wait times, and reduce memory usage. With 16 parallel workers, the fuzz target was peaking around 90 GB of memory on my machine. I can't run with all 32 cores on my machine or it runs out of memory. Maybe we won't be able to reduce that too much, but it's worth checking to see if there's any easy wins.

I profiled the target, and from this we can see that the most resource consuming operations are the db updates (we are using the actual bolt db in our tests). For most of the other functions, the majority of the time is also spent in db update calls:

(pprof) top -cum    
Showing nodes accounting for 27.68s, 22.85% of 121.12s total
Dropped 847 nodes (cum <= 0.61s)
Showing top 20 nodes out of 332
      flat  flat%   sum%        cum   cum%
         0     0%     0%     81.26s 67.09%  testing.tRunner
         0     0%     0%     78.37s 64.70%  github.com/lightningnetwork/lnd/htlcswitch.FuzzHTLCStates.func1
         0     0%     0%     78.37s 64.70%  reflect.Value.Call
         0     0%     0%     78.37s 64.70%  reflect.Value.call
         0     0%     0%     78.37s 64.70%  testing.(*F).Fuzz.func1.1
         0     0%     0%     55.11s 45.50%  github.com/lightningnetwork/lnd/htlcswitch.runHTLCFuzzStateMachine
         0     0%     0%     44.66s 36.87%  github.com/lightningnetwork/lnd/htlcswitch.(*fuzzNetwork).exchangeUpdates
         0     0%     0%     44.64s 36.86%  github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg
         0     0%     0%     34.48s 28.47%  github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update
         0     0%     0%     34.48s 28.47%  github.com/lightningnetwork/lnd/kvdb.Update (inline)
     0.04s 0.033% 0.033%     28.50s 23.53%  runtime.mallocgc
    27.63s 22.81% 22.85%     27.63s 22.81%  runtime.memclrNoHeapPointers
         0     0% 22.85%     25.24s 20.84%  github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).processRemoteRevokeAndAck
         0     0% 22.85%     23.26s 19.20%  github.com/lightningnetwork/lnd/htlcswitch.setupFuzzNetwork
         0     0% 22.85%     21.31s 17.59%  runtime.systemstack
         0     0% 22.85%     20.61s 17.02%  github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx
         0     0% 22.85%     20.61s 17.02%  github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail
     0.01s 0.0083% 22.85%     20.30s 16.76%  github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment
         0     0% 22.85%     19.29s 15.93%  github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).processRemoteCommitSig
         0     0% 22.85%     18.86s 15.57%  github.com/lightningnetwork/lnd/htlcswitch.createTestChannel

I haven’t looked into it yet, but would using an in-memory db for these operations help reduce this overhead?

@NishantBansal2003

NishantBansal2003 commented Dec 9, 2025

Copy link
Copy Markdown
Owner Author

I wrote a basic mock in-memory database to avoid disk backed DB calls. This has significantly improved execs/sec and also reduced CPU wait time and memory usage. I think we can discuss whether this approach makes sense overall.
The reason I implemented this mock DB is that based on profiling most of the time-consuming calls were coming from database writes. Replacing them with an in-memory version helped a lot.
I also verified with fuzz seed #0 and the full HTLC dance runs correctly with this DB. Here is the log:

=== RUN   FuzzHTLCStates
=== RUN   FuzzHTLCStates/seed#0
    fuzz_test.go:559: Remote -> Local: UpdateAddHTLC
    fuzz_test.go:559: Remote -> Local: CommitSig
    fuzz_test.go:559: Local -> Remote: RevokeAndAck
    fuzz_test.go:559: Local -> Remote: CommitSig
    fuzz_test.go:559: Remote -> Local: RevokeAndAck
    fuzz_test.go:559: Local -> Remote: UpdateFulfillHTLC
    fuzz_test.go:559: Local -> Remote: CommitSig
    fuzz_test.go:559: Remote -> Local: RevokeAndAck
    fuzz_test.go:559: Remote -> Local: CommitSig
    fuzz_test.go:559: Local -> Remote: RevokeAndAck
    fuzz_test.go:559: Local -> Remote: UpdateAddHTLC
    fuzz_test.go:559: Local -> Remote: CommitSig
    fuzz_test.go:559: Remote -> Local: RevokeAndAck
    fuzz_test.go:559: Remote -> Local: CommitSig
    fuzz_test.go:559: Local -> Remote: RevokeAndAck
    fuzz_test.go:559: Remote -> Local: UpdateFailHTLC
    fuzz_test.go:559: Remote -> Local: CommitSig
    fuzz_test.go:559: Local -> Remote: RevokeAndAck
    fuzz_test.go:559: Local -> Remote: CommitSig
    fuzz_test.go:559: Remote -> Local: RevokeAndAck
--- PASS: FuzzHTLCStates (0.07s)
    --- PASS: FuzzHTLCStates/seed#0 (0.03s)
PASS
ok      github.com/lightningnetwork/lnd/htlcswitch      0.362s

Can you take a look at this approach now?

@morehouse

Copy link
Copy Markdown

Could you share web views of the CPU and memory profiles? I think the web views are easier to understand.

I haven't looked super close, but this in-memory database looks pretty buggy to me. The transaction functionality looks completely wrong (e.g., Rollback basically does nothing), and there seems to be an unsafe locking strategy that can cause races and DB inconsistencies.

Unless we're 100% sure that implementing an in-memory DB provides worthwhile gains, I would stay away.

@NishantBansal2003

NishantBansal2003 commented Dec 12, 2025

Copy link
Copy Markdown
Owner Author

I haven't looked super close, but this in-memory database looks pretty buggy to me. The transaction functionality looks completely wrong (e.g., Rollback basically does nothing), and there seems to be an unsafe locking strategy that can cause races and DB inconsistencies.

Unless we're 100% sure that implementing an in-memory DB provides worthwhile gains, I would stay away.

This is just a PoC implementation, the in memory db design will likely change significantly later. I implemented only the basic functionality to evaluate whether an in memory approach is beneficial and to measure its CPU and memory usage. This version is not close to the final design, it’s purely for validating the concept.

The web views are too large, so I’m sharing the SVGs for the CPU and memory profiles generated from 2872 inputs.
cpu.svg:
https://github.com/user-attachments/assets/4ac18d1a-d4e0-413e-9b77-44a5d74a2bb1

mem.svg:
https://github.com/user-attachments/assets/bda20f2b-a51e-4739-a003-f1fdb4c69d43

EDIT: Now thinking of it, If we leave aside the design of the in-memory DB could there be any fundamental issues during fuzzing because of this in-memory DB? for example: could loading everything into memory cause false OOM errors? Although high memory usage in a disk based DB would also indicate a bug, it might still be worth discussing these points before proceeding.

@morehouse

Copy link
Copy Markdown

cpu.svg:
https://github.com/user-attachments/assets/4ac18d1a-d4e0-413e-9b77-44a5d74a2bb1

mem.svg:
https://github.com/user-attachments/assets/bda20f2b-a51e-4739-a003-f1fdb4c69d43

These profiles are with the in-memory DB. Can you also upload ones using the default DB? It would be good to see exactly how much overhead is caused by the default DB.

The main thing I see from this CPU profile is lots of cycles spent during garbage collection. This is not surprising since the fuzz target creates and destroys lots of structures on each iteration. I don't think it's worth trying to optimize that too much at this point.

EDIT: Now thinking of it, If we leave aside the design of the in-memory DB could there be any fundamental issues during fuzzing because of this in-memory DB? for example: could loading everything into memory cause false OOM errors? Although high memory usage in a disk based DB would also indicate a bug, it might still be worth discussing these points before proceeding.

Yes, an in-memory DB makes a CPU-RAM tradeoff -- less CPU and more RAM usage. I tried running the current code and could only manage 4 parallel runners due to the high memory usage, and I still eventually got an OOM crash. The original code was slower but I could use 16 parallel runners, which ended being similar execs/sec overall.

This is just a PoC implementation, the in memory db design will likely change significantly later. I implemented only the basic functionality to evaluate whether an in memory approach is beneficial and to measure its CPU and memory usage. This version is not close to the final design, it’s purely for validating the concept.

My main concern is that the current in-memory DB likely introduces new data race bugs since DB transactions are basically ignored and can't be rolled back. If we want to go this direction, we really need to implement the full interface, and I'm not sure it's worth the effort.

@NishantBansal2003

NishantBansal2003 commented Dec 13, 2025

Copy link
Copy Markdown
Owner Author

Profiling data for the current fuzz tests using the default database:
cpu.svg: https://github.com/user-attachments/assets/c858d5d7-1df5-4a91-8544-0b6ece47ecd1

mem.svg: https://github.com/user-attachments/assets/2992f067-0e49-4a5e-872f-505b7ccfb474

As you can see, most of the time and memory usage is caused by disk related operations in the fuzz tests. This improves significantly when using an in-memory database.
In the current commit I attempted a few optimizations, for example: reducing the message buffer size, capping the size of the fuzz input, and cleaning up goroutines at the end of the tests to improve resource usage. While these changes do not result in improvements comparable to the in-memory database, they do provide a small improvement.
Beyond these changes, I’m not sure how we can further reduce memory usage, since the profiling data shows that the majority of memory consumption comes from the database itself. Do you have any suggestions for improving memory usage in this case? I haven’t looked into it completely yet, but I was thinking how other LN implementations manage memory in this case.

@morehouse

Copy link
Copy Markdown

Thanks. I think this is good enough for now.

I'd suggest focusing on code coverage next. The general process looks like this:

  1. Make sure the expected paths are being executed by the fuzz target.
  2. Look for new paths that could be hit if the fuzz target is improved.
  3. Improve the fuzz target.
  4. Repeat from Step 1.

Unfortunately I haven't been able to get Go to provide coverage of lines in the fuzz_test.go file itself, but that coverage information is very useful for finding bugs in the fuzz target. I've worked around this in the past by basically moving all the test helper functions to a separate file.

I suspect there's a lot of interesting coverage (and bugs) to find by making the target less restricted. It looks like currently it only really tests a partial happy path -- send some number of update_add_htlc messages, then commitment_signed. But there are more messages that can be sent: update_fail_htlc, update_fulfill_htlc, update_fail_malformed_htlc, revoke_and_ack, update_fee, stfu, shutdown, etc. Each of these messages can either be valid (e.g., failing an existing HTLC) or invalid (e.g., failing a non-existant HTLC). A good fuzz target could send all these messages in the correct orders, or in incorrect orders, with valid fields or invalid fields. There's a fair amount of complexity in the protocol, so my gut feeling is that there has to be bugs we could uncover this way.

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

I suspect there's a lot of interesting coverage (and bugs) to find by making the target less restricted. It looks like currently it only really tests a partial happy path -- send some number of update_add_htlc messages, then commitment_signed.

In the fuzz tests, I am either adding the preimage to the invoice registry so that once the HTLC dance is completed, it will automatically check whether the preimage of the hash exists in the registry using

event, err := l.cfg.Registry.NotifyExitHopHtlc(
.
If the invoice is present, an update_fulfill_htlc will be sent, otherwise, an update_fail_htlc will be sent automatically (i.e., added to the message queue, which is then sent to the peer through the exchangeUpdates function). This registry check is triggered as soon as the revoke_and_ack is received from the peer:
l.processRemoteAdds(fwdPkg)
.
For revoke_and_ack, it is automatically sent once the other peer sends a commit_sig, which follows BOLT 2. This can be seen here:
err = l.cfg.Peer.SendMessage(false, nextRevocation)
.

But there are more messages that can be sent: update_fail_htlc, update_fulfill_htlc, update_fail_malformed_htlc, revoke_and_ack, update_fee, stfu, shutdown, etc. Each of these messages can either be valid (e.g., failing an existing HTLC) or invalid (e.g., failing a non-existant HTLC). A good fuzz target could send all these messages in the correct orders, or in incorrect orders, with valid fields or invalid fields.

I have now added support for update_fail_malformed_htlc, update_fee, and stfu. Support for sending valid messages in the correct order already existed, but I have now also added support for sending invalid messages and sending messages in an incorrect order. Some of these cases may cause LND to force-close the channel, which I am currently ignoring and proceeding with the tests normally on the remote (attacker) side, so that I can also exercise edge cases involving failed channel links.

Regarding coverage, these fuzz tests are achieving a good amount of coverage and are able to exercise most of the relevant code paths. However, there are still some paths that are executed less frequently, and many combinations that could potentially reveal bugs. For this, I am running the fuzzing continuously so that coverage can further improve in those areas as well.

Can you check whether this message mutation and reordering approach makes sense?

@morehouse morehouse left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach.

I left a lot of inline comments; here's the highlights:

  • There's currently an OOM bug in the fuzz harness.
  • We might find more bugs if we give the fuzzer more freedom to choose invalid values.
  • Ideally we would also be able to test for some invariants after each iteration and validate link failures to determine if they were expected or not.

A couple other ideas that might uncover some bugs:

  • We could vary the initial states, allowing the fuzzer to choose some of the channel parameters and initial balances. This might uncover bugs that only happen in specific channel configs (e.g., balances near the dust limit, unusual fees or CLTV deltas).
  • We could add an action that simulates a node restart. We may be able to borrow some code from the link unit test:

    lnd/htlcswitch/link_test.go

    Lines 270 to 281 in 0cc84d0

    // Restart Bob as well by calling NewLightningChannel.
    bobSigner := harness.bobChannel.Signer
    signerMock := lnwallet.NewDefaultAuxSignerMock(t)
    bobPool := lnwallet.NewSigPool(runtime.NumCPU(), bobSigner)
    bobChannel, err := lnwallet.NewLightningChannel(
    bobSigner, harness.bobChannel.State(), bobPool,
    lnwallet.WithLeafStore(&lnwallet.MockAuxLeafStore{}),
    lnwallet.WithAuxSigner(signerMock),
    )
    require.NoError(t, err)
    err = bobPool.Start()
    require.NoError(t, err)

Comment thread htlcswitch/fuzz_test.go Outdated
Comment on lines +292 to +294
amount: lnwire.NewMSatFromSatoshis(
btcutil.Amount(data[offset+3]) *
btcutil.SatoshiPerBitcent,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be bugs that only show up when a channel is nearly depleted, or only for very large HTLCs. Perhaps we should allow amounts up to the channel capacity here.

Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go
network := setupFuzzNetwork(t, data)

// Execute the HTLC state machine with fuzz input.
network.runHTLCFuzzStateMachine()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might find some bugs if we check channel invariants at the end of the test.

Some ideas:

  • local and remote balances should sum to channel capacity
  • commitment numbers should be synced, within 1 of each other
  • number of pending updates should match between nodes

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

This PR is ready for review. Sorry for taking a lot of time, I was trying to fix all the loose ends and better understand the flow. In this PR, unlike some of my previous comments, there are a few points to highlight:

  • The remote link can fail if it sends an invalid stfu message that is not persisted in its system. In that case, the local link may send a correct message that the remote side does not expect, causing the link to fail.
  • Pending updates might not be matched because one side of the link may have already sent a message, but it is not received by the other side for a couple of messages. This can lead to inconsistencies in pending updates.
  • LND accepts an empty commit_sig as well (https://github.com/lightningnetwork/lnd/blob/70dfbd284b2ffb1ad40c4e67f79d243bedf181e4/lnwallet/channel.go#L5267-L5281). In this case, it still updates the commit height on one side, since the other side does not actually owe anything. Because of this, there can be a delta of more than 1 between the commit heights of the remote and local peers.

@morehouse morehouse left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run the fuzz target and got a failure:

$ cat testdata/fuzz/FuzzHTLCStates/558b8118ff00983c 
go test fuzz v1
[]byte("\x00\x00000000\x00\x000000000000000000000000\x00\x000000 0\x00\x000000002\x00\x00\x000000 \x00\x00\x00\x00\x00\x00\x00000000000\x00\x00\x00\x0000\x00000000000000000000000000000000000000000000000000000000101\xcd0000000000000000000000000000000000000000000")

$ go test -run=FuzzHTLCStates/558b8118ff00983c
--- FAIL: FuzzHTLCStates (0.00s)
    --- FAIL: FuzzHTLCStates/558b8118ff00983c (0.00s)
        fuzz_test.go:200: 
            	Error Trace:	htlcswitch/fuzz_test.go:200
            	            				htlcswitch/link.go:3736
            	            				htlcswitch/link.go:4047
            	            				htlcswitch/link.go:1802
            	            				htlcswitch/fuzz_test.go:1285
            	            				htlcswitch/fuzz_test.go:1345
            	            				htlcswitch/fuzz_test.go:1363
            	            				/usr/lib/golang/src/reflect/value.go:584
            	            				/usr/lib/golang/src/reflect/value.go:368
            	            				/usr/lib/golang/src/testing/fuzz.go:340
            	Error:      	Should be true
            	Test:       	FuzzHTLCStates/558b8118ff00983c
FAIL
exit status 1
FAIL	github.com/lightningnetwork/lnd/htlcswitch	0.015s

Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment on lines +1165 to +1193
// We can receive an invalid commitment from the remote if the
// remote is sending empty HTLC sigs. This is very subtle and
// can happen when the remote sends an empty sig but it is never
// delivered to the local due to message reordering. As a result
// ,the remote has moved the local's commit chain tip ht. When a
// restart happens, the remote checks the local's next commit ht
// against its local tip ht (on the remote side) and again tries
// to send the commit sig. Channel sync does not error out here
// because the remote never shared the empty commit sig, it only
// added it to its own local's commit chain.
//
// During this process, some normal HTLC operations may already
// have been shared between both peers, so the commitment now
// has some HTLCs. If the remote again sends the empty commit
// sig, the local will error out since it wants updated HTLC/
// commitment sigs for all the HTLCs in its commitment, but it
// instead receives the empty sig that was carried over from the
// start.
//
// The easiest way to reproduce this issue is: the remote
// generates an empty commit sig and queues it. The local sends
// add HTLC and commit sig, the remote generates revoke_ack,
// then the queue is reordered and the revoke_ack message is
// sent. After both peers restart, the remote automatically
// generates a new empty commit sig message and sends it.
if isRemoteSender && (len(m.HtlcSigs) == 0) {
network.localExpectedErrs.invalidCommitment = true
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to be very thorough while explaining this. Let me know if it makes sense, otherwise, I will try to explain it in a more detailed manner.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?

If the reordering is the issue, then shouldn't we set invalidCommitment during the reordering rather than here?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?

I think this diagram might help:

sequenceDiagram
    participant Remote
    participant Local

    Note over Remote: Generate empty CommitSig
    Note over Remote: Queue = [CommitSig]

    Local->>Remote: AddHTLC
    Local->>Remote: CommitSig

    Note over Remote: Generate RevokeAndAck
    Note over Remote: Queue = [CommitSig, RevokeAndAck]

    Note over Remote: Reorder Queue
    Note over Remote: Queue = [RevokeAndAck, CommitSig]

    Remote->>Local: RevokeAndAck

    Note over Local,Remote: Both peers restart (queue resets)

    Note over Remote: Regenerate the previous empty CommitSig<br/>(because of ht mismatch between remote and local chain)
    Remote->>Local: empty CommitSig
Loading

This commit sig is sent again because of ProcessChanSyncMsg (in resumeLink). I explained the reason for this in the 1st para of the comments(let me know if that is unclear?).

If the reordering is the issue, then shouldn't we set invalidCommitment during the reordering rather than here?

Yes, I am already setting that, but, during the restartNode state, we reset the errors with network.localExpectedErrs = localExpectedErrs, because of this, the expected invalidCommitment error gets wiped out.
If I do not reset the expected errors during restart, then previous expected errors (which were not actual errors e.g. stfu and therefore were not flagged) might be falsely treated as expected errors. As a result, there is a chance that a real error occurring after the restart could be missed

seed, if you would like to check the logs:

func addEmptyCommitSigReorderSeed(f *testing.F) {
        data, preimage, htlcAmt, cltvDelta := buildHTLCFuzzSetup()
        emptyCommitSigReorder := make([]byte, 0)

        // Base setup.
        emptyCommitSigReorder = append(emptyCommitSigReorder, data...)

        // Remote generates and queues an empty commit sig (but doesn't send it yet).
        emptyCommitSigReorder = append(emptyCommitSigReorder, byte(sendCommitSig), 1)

        // Add HTLC.
        emptyCommitSigReorder = append(emptyCommitSigReorder, byte(sendAddHTLC), 0, 0)
        emptyCommitSigReorder = append(emptyCommitSigReorder, htlcAmt...)
        emptyCommitSigReorder = append(emptyCommitSigReorder, cltvDelta...)
        emptyCommitSigReorder = append(emptyCommitSigReorder, preimage...)

        // State transitions.
        emptyCommitSigReorder = append(emptyCommitSigReorder,
                byte(exchangeStateUpdates), 0,
                byte(sendCommitSig), 0,
                byte(exchangeStateUpdates), 0)

        // We exchange with reordering enabled so that RevokeAndAck is sent
        // before the queued empty CommitSig.
        emptyCommitSigReorder = append(emptyCommitSigReorder,
                byte(exchangeStateUpdates), 1, 1, 0, 0,
        )

        // Restart both nodes. After restart, remote will regenerate the empty
        // CommitSig because it never received a RevokeAndAck for it.
        emptyCommitSigReorder = append(emptyCommitSigReorder, byte(restartNode), 0)

        f.Add(emptyCommitSigReorder)
}

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

I tried to run the fuzz target and got a failure:

I ran the fuzz tests continuously for 5 days, and they seem to be holding good. But, I’m still expecting bugs in the fuzz tests since we are now exercising a huge number of combinations. Some of these may trigger link failures much later in the execution. Because of this, there are many cases where link failures are expected, but we aren’t currently handling them, mostly due to the expectedErr state being reset on restart. For that reason, I’m still running the fuzz tests side by side to make sure the fuzz target itself is stable.

One thing I wanted to discuss is while it’s good that we are checking for link failures in the fuzz tests, it’s quite challenging to check and add all possible expected error cases. In practice, this means we have to run the fuzz tests, observe a link error, and then determine whether it’s a real bug in the code or an issue with the fuzz logic. That’s not inherently bad, but I worry it could make the fuzz target quite unstable. Another approach I can think of is to simply track whether any link fail can occur, doesn’t matter what kind. If the remote peer tries to act smart at any point, we just set it to true for both sides. WDYT?

@morehouse morehouse left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another failure after a few minutes of fuzzing:

$ cat testdata/fuzz/FuzzHTLCStates/36a25783f8520f6e 
go test fuzz v1
[]byte("070010170121A0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000A1\x00\x00\x00\x00\x00000A1bX000000")

$ go test -run=FuzzHTLCStates/36a25783f8520f6e
--- FAIL: FuzzHTLCStates (0.01s)
    --- FAIL: FuzzHTLCStates/36a25783f8520f6e (0.01s)
        fuzz_test.go:223: 
            	Error Trace:	htlcswitch/fuzz_test.go:223
            	            				htlcswitch/link.go:3736
            	            				htlcswitch/link.go:1967
            	            				htlcswitch/link.go:4494
            	            				htlcswitch/link.go:1817
            	            				htlcswitch/fuzz_test.go:1195
            	            				htlcswitch/fuzz_test.go:1386
            	            				htlcswitch/fuzz_test.go:1450
            	            				htlcswitch/fuzz_test.go:1470
            	            				/usr/lib/golang/src/reflect/value.go:584
            	            				/usr/lib/golang/src/reflect/value.go:368
            	            				/usr/lib/golang/src/testing/fuzz.go:340
            	Error:      	Should be true
            	Test:       	FuzzHTLCStates/36a25783f8520f6e
FAIL
exit status 1
FAIL	github.com/lightningnetwork/lnd/htlcswitch	0.021s

Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment on lines +1165 to +1193
// We can receive an invalid commitment from the remote if the
// remote is sending empty HTLC sigs. This is very subtle and
// can happen when the remote sends an empty sig but it is never
// delivered to the local due to message reordering. As a result
// ,the remote has moved the local's commit chain tip ht. When a
// restart happens, the remote checks the local's next commit ht
// against its local tip ht (on the remote side) and again tries
// to send the commit sig. Channel sync does not error out here
// because the remote never shared the empty commit sig, it only
// added it to its own local's commit chain.
//
// During this process, some normal HTLC operations may already
// have been shared between both peers, so the commitment now
// has some HTLCs. If the remote again sends the empty commit
// sig, the local will error out since it wants updated HTLC/
// commitment sigs for all the HTLCs in its commitment, but it
// instead receives the empty sig that was carried over from the
// start.
//
// The easiest way to reproduce this issue is: the remote
// generates an empty commit sig and queues it. The local sends
// add HTLC and commit sig, the remote generates revoke_ack,
// then the queue is reordered and the revoke_ack message is
// sent. After both peers restart, the remote automatically
// generates a new empty commit sig message and sends it.
if isRemoteSender && (len(m.HtlcSigs) == 0) {
network.localExpectedErrs.invalidCommitment = true
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?

If the reordering is the issue, then shouldn't we set invalidCommitment during the reordering rather than here?

Comment thread htlcswitch/fuzz_test.go Outdated
@morehouse

Copy link
Copy Markdown

One thing I wanted to discuss is while it’s good that we are checking for link failures in the fuzz tests, it’s quite challenging to check and add all possible expected error cases. In practice, this means we have to run the fuzz tests, observe a link error, and then determine whether it’s a real bug in the code or an issue with the fuzz logic. That’s not inherently bad, but I worry it could make the fuzz target quite unstable. Another approach I can think of is to simply track whether any link fail can occur, doesn’t matter what kind. If the remote peer tries to act smart at any point, we just set it to true for both sides. WDYT?

Yeah, this is the tradeoff. We can put in more effort to identify false positives and fix the error-checking logic, with the benefit of potentially finding more kinds of bugs (especially unnecessary force closes). Or we can ignore that class of bugs entirely and put less effort into getting error-checking correct.

It would be great to find a real bug with this fuzz target, which is why I suggested adding the error validation. But if you just want to get something merged first, we could rip that out and add it back in future PRs.

@MPins MPins left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @NishantBansal2003, I’m also working on the HTLC state machine fuzz test (MPins#1). I’m still at an early stage, and after reviewing your PR, I’m thinking about how to improve mine. Thanks!

Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go Outdated
Comment thread htlcswitch/fuzz_test.go
@NishantBansal2003

Copy link
Copy Markdown
Owner Author

Hello @NishantBansal2003, I’m also working on the HTLC state machine fuzz test (MPins#1). I’m still at an early stage, and after reviewing your PR, I’m thinking about how to improve mine. Thanks!

Hi @MPins, thanks for the review. I think this PR is almost complete (apart from the unresolved comments). I'm currently facing some false-positive link errors, so I've been continuously running fuzz tests and fixing them.

I was planning to open this upstream soon once the initial review is done, so we can continue the discussion there and avoid duplicated effort. I'd really appreciate further reviews here, or I can also provide one if you're testing broader scenarios in your fuzzing.

Overall, I think https://github.com/morehouse/smite is very efficient/less work for these kinds of fuzz tests.

@NishantBansal2003 NishantBansal2003 force-pushed the fuzz-htlcswitch branch 2 times, most recently from c17a8fc to 73e90e6 Compare March 20, 2026 19:27
@NishantBansal2003

Copy link
Copy Markdown
Owner Author

In the last two commits, I did some code cleanup also, so it should now be a bit easier to iterate.

Regarding the link failures, I've added support to check them only on the local side (the victim). The remote side can get link errors (which is expected ;)), especially if it tries to operate on a state that is not in sync with the victim node.

Previously, I was clearing the expected link errors on restart, but that could lead to false positives, since some error state might still be carried forward. Now, we don't reset that state. However, we do reset the state for stfu, since after a restart there is no state restoration for stfu messages.

I've been running the fuzz tests continuously, and for quite a while I haven't seen any errors (though I'm still not fully confident). I'm opening these changes for review so we can move toward finalizing the PR. We can continue improving and fixing any remaining false positives in parallel.

PS: CI failures can be ignored for now, they're likely due to breaking changes between commits. I’ll squash the commits once the PR is finalized.

@morehouse morehouse left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to run the fuzzer. Hit a deadlock during startup:

$ cat testdata/fuzz/FuzzHTLCStates/1b1169501cbe101c 
go test fuzz v1
[]byte("\x1a0\x1c\x1d\x01\x1f\"%(\xf5\x028CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\tCC\t\t\tCC\tCCCCC\tCCCC\tC\tC\t\t\t\tC\tC\t\t\t\tC\tC\tC\t\tCCC\t\tC\tC\t\t\tC\t\t\t\t\t\tC\tCCC\t\tC\t\t\tC\t\t\t\t\tCC\t\tC\t\tC\t\tCCCC\t\tC\tCC\t\tC\t\t\tCC\t\t\tCCCCC\tCC\tCC\tC\t\tC\t\tCCCCC\t\tCCCCCC\t\tCC\t\tC\t\t\t\t\t\t\tCC\t\t\tCC\t\t\t\tCCCC\t\tCCC\tCCC\tCCC\tCCCCC\t\t\t\t\t\tCCCC\tCC\tCC\tC\tCC\t\tCCC\tC\t\t\tCCC\tC\tC\tC\t\t\tCC\tC\tCC\tCC\t\t\t\tC\t\t\tC\t\t\t\tCCC\tC\tCC\tC\t\tC\t\t\tCC\tC\tCC\t\t\tC\t\tC\tCC\t\t\t\tC\t\t\tCC\t\t\tC\t\tCC\tCCC\tCCCC\tC\t\tC\t\tCC\t\t\t\tCCC\t\t\tCCCCC\tC\tCCC\t\tCCC\t\tC\tCCCCC\tC\tC\t\t\tCCC\tC\tCCC\t\tC\t\tC\t\t\t\t\tCCCC\tCCC\tCCC\t\t\t\t\tC\tCC\tCC\t\t\t\tCCCCC\t\tC\tC\tC\tC\tCCCCC\xc2\t\tCC\t\tCC\tCCCC\tCCC\t\tC\t\t\t\tCC\tC\t\tC\tCC\t\t\t\tC\t\tCCCCC\t\t\tCC\t\t\t\tCCC\t\tCCCCCC\tCC\t\tC\t\t\t\t\tC\tC\t\t\t\tCCC\tCC\t\tCC\tCCCCC\t\tCCCCCC\t\tC\tCC\tC\tC\tC\t\t\t\t\t\t\t\tC\tCCCC\tCCCCC\tCC\t\tC\t\tCC\tCCC\tC\tCCCC\tC\tCC\tC\tCCC\t\t\tCC\tCCC\tC\tC\tCCCCCC\t\t\t\t\t\t\t\tC\tCCCCCCC\t\tCCC\tCC\t\t\t\t\tC\t\tC\t\tC\tCCCCCC\t\t\tC\tC\t\tCCC\tCCC\t\tC\tC\t\tC\tC\t\t\t\tC\t\tC\tCC\t\tC\tC\t\t\t\tC\tC\t\tC\tCC\tC\tC\t\tCCC\tC\t\t\t\t\tC\tC\tC\tCCC\tC\tC\tCCC\t\tCCCCC\xc2CC\t\t\tC\tC\t\tCCCC\tC\tC\t\tCC\t\tC\tCC\tCC\t\tC\tCC\tC\t\t\t\tCC\tC\t\tC\tC\tC\tCCCCC\t\tCCC\tCC\tC\tC\t\tC\t\t\tC\t\t\tCC\tCCCCC\t\xc2CCCCCC\t\tCC\tCC\tC\tCC\t\t\t\tC\t\tC\t\tC\tCCCC\tCCCCCCC\tC\t\t\tCC\t\tC\tCCCCC\t\tC\tC\tC\t\t\tCC\tCCCC\t\tC\t\tCC\t\t\tC\tCCC\t\t\t\tCCCCC\t\tC\tCC\tC\t\t\t\tCC\t\tCC\tCCCCC\tCCC\t\t\tCCCCCC\t\t\tC\t\tCC\tC\t\t\t\tC\t\tCC\tCCCCCC\tC\tC\tC\tC\t\tCCCC\tC\t\t\tCC\t\tCC\tC\t\t\tCCCC\t\tCCC\tCCC\t\tCCC\tC\tC\t\t\t\tCC\t\t\tC\tCCCC\t\t\tC\tCC\t\t\t\tC\tCCC\tC\tC\t\tCCCCCCCCCCC\tCCCC\t\tCCC\tC\t\t\t\tC\tC\t\tC\tC\tCC\tCC\tCCC\t\tCCCCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\t\tC\tCC\tCCCC\tCC\tCCCCCCC\tCC\tC\tC\t\t\tCCCCC\t\tCC\tC\t\t\t\t\t\t\tC\t\t\tCCCCC\tC\tC\t\t\t\tCC\t\tC\t\tCCC\tC\t\t\tC\tCCC\t\tC\t\t\t\t\t\t\tC\tCC\tC\tCC\t\t\tCC\tCCCC\t\t\t\tCC\tC\tC\tC\t\tCCCCCC\tCCCCCC\t\tC\t\tC\t\t\tC\tCCC\t\t\tCCCC\t\tCC\t\t\tCCCC\tCC\tC\tCCCC\t\tCC\t\t\t\t\tCCC\t\t\tCC\t\tC\tC\tC\tC\t\tCCCC\tC\tCC\t\t\t\tCC\tCCC\t\tCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\tCCC\tCC\t\t\tC\t\tC\tC\tC\t\tCCCC\t\tC\t\t\t\tC\tCCCCC\tCCCCC\t\tCCCC\tCCCCC\t\t\tC\tCCC\t\tC\tC\tC\t\t\tCC\t\t\tCCCC\t\tC\tCC\t\tCCCCCC!!!!!!!\tCCCCCCC\tCCC\t\t\t\t\t\t\tCC\tC\x03\tCCCCCC\tCCC\tCCCC\t\tCCC\tC\tC\t\t\tCCCCCC\tC\t\tC\t\t\tCCCCC\tCC\t\tCCCCC\tCCC\t\tCCCC\tC\t\t\tC\t\t\t\tCCCCCCCC\tC\tC\tCCC\tC\tCCC\tCCC\tC\tC\t\tCCC\t\tCC\t\tC\t\tCCCC\t\t\tC\t\tCCC\t\tCCC\tCC\tC\tCCCCCCC\t\tC\t\tCC5\xbdI\xfd\tC\tCCC\tCCC\tC\tCCC\t\t\t\t\tCCCC\t\tC\tCCCC\t\tC\t\tCCC\tCC\tC\tCCC\tC\tC\t\tCCCCCCC\tC\t\tC\tC\tC\tCC\tCC\t\tCC\tC\tC\tC\tCC\t\tCCC\t\t\tCC\t\tC\tC\t\t\tC\t\tC\tC\t\tC\tCC\tC\tC\t\t\tCC\tC\tC\tCC\tC\tC\tC\t\t\tCCC\tC\t\t\tCCCCC\tCC\tCCC\tCC\t\tCC\t\t\tC\t\t\tCCCCC\tCC\t\tCC\t\t\tC\tCCC\t\t\t\tCCC\t\tCCC\t\tC\t\t\t\tC\tCC\tC\tC\tC\t\t\t\t\t\t\t\tC\tCCCC\tCCCCC\tCC\t\tC\t\tCC\tCCC\tC\tCCCC\tC\tCC\tC\tCCC\t\t\tCC\tCCC\tC\tC\tCCCCCC\t\t\t\t\t\t\t\tC\tCCCCCCC\t\tCCC\tCC\t\t\t\t\tC\t\tC\t\tC\tCCCCCC\t\t\tC\tC\t\tCCC\tCCC\t\tC\tC\t\tC\tC\t\t\t\tC\t\tC\tCC\t\tC\tC\t\t\t\tC\tC\t\tC\tCC\tC\tC\t\tCCC\tC\t\t\t\t\tC\tC\tC\tCCC\tC\tC\tCCC\t\tCCCCC\xc2CC\t\t\tC\tC\t\tCCCC\tC\tC\t\tCC\t\tC\tCC\tCC\t\tC\tCC\tC\t\t\t\tCC\tC\t\tC\tC\tC\tCCCCC\t\tCCC\tCC\tC\tC\t\tC\t\t\tC\t\t\tCC\tCCCCC\t\xc2CCCCCC\t\tCC\tCC\tC\tCC\t\t\t\tC\t\tC\t\tC\tCCCC\tCCCCCCC\tC\t\t\tCC\t\tC\tCCCCC\t\tC\tC\tC\t\t\tCC\tCCCC\t\tC\t\tCC\t\t\tC\tCCC\t\t\t\tCCCCC\t\tC\tCC\tC\t\t\t\tCC\t\tCC\tCCCCC\tCCC\t\t\tCCCCCC\t\t\tC\t\tCC\tC\t\t\t\tC\t\tCC\tCCCCCC\tC\tC\tC\tC\t\tCCCC\tC\t\t\tCC\t\tCC\tC\t\t\tCCCC\t\tCCC\tCCC\t\tCCC\tC\tC\t\t\t\tCC\t\t\tC\tCCCC\t\t\tC\tCC\t\t\t\tC\tCCC\tC\tC\t\tCCCCCCCCCCC\tCCCC\t\tCCC\tC\t\t\t\tC\tC\t\tC\tC\tCC\tCC\tCCC\t\tCCCCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\t\tC\tCC\tCCCC\tCC\tCCCCCCC\tCC\tC\tC\t\t\tCCCCC\t\tCC\tC\t\t\t\t\t\t\tC\t\t\tCCCCC\tC\tC\t\t\t\tCC\t\tC\t\tCCC\tC\t\t\tC\tCCC\t\tC\t\t\t\t\t\t\tC\tCC\tC\tCC\t\t\tCC\tCCCC\t\t\t\tCC\tC\tC\tC\t\tCCCCCC\tCCCCCC\t\tC\t\tC\t\t\tC\tCCC\t\t\tCCCC\t\tCC\t\t\tCCCC\tCC\tC\tCCCC\t\tCC\t\t\t\t\tCCC\t\t\tCC\t\tC\tC\tC\tC\t\tCCCC\tC\tCC\t\t\t\tCC\tCCC\t\tCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\tCCC\tCC\t\t\tC\t\tC\tC\tC\t\tCCCC\t\tC\t\t\t\tC\tCCCCC\tCCCCC\t\tCCCC\tCCCCC\t\t\tC\tCCC\t\tC\tC\tC\t\t\tCC\t\t\tCCCC\t\tC\tCC\t\tCCCCCC\tCCCCCCC\tCCC\t\t\t\t\t\t\tCC\tC\x03\tCCCCCC\tCCC\tCCCC\t\tCCC\tC\tC\t\t\tCCCCCC\tC\t\tC\t\t\tCCCCC\tCC\t\tCCCCC\tCCC\t\tCCCC\tC\t\t\tC\t\t\t\tCCCCCCCC\tC\tC\tCCC\tC\tCCC\tCCC\tC\tC\t\tCCC\t\tCC\t\tC\t\tCCCC\t\t\tC\t\tCCC\t\tCCC\tCC\tC\tCCCCCCC\t\tC\t\tCC\tC\tCCC\tCCC\tC\tCCC\t\t\t\t\tCCCC\t\tC\tCCCC\t\tC\t\tCCC\tCC\tC\tCCC\tC\tC\t\tCCCCCCC\tC\t\tC\tC\tC\tCC\tCC\t\tCC\tC\tC\tC\tCC\t\tCCC\t\t\tCC\t\tC\tC\t\t\tC\t\tC\tC\t\tC\tCC\tC\tC\t\t\tCC\tC\tC\tCC\tC\tC\tC\t\t\tCCC\tC\t\t\tCCCCC\tCC\tCCC\tCC\t\tCC\t\t\tC\t\t\tCCCCC\tCC\t\tCC\t\t\tC\tCCC\t\t\t\tCCC\t\tCCC\t\tC\t\t\tCCC\tCC\t\t\tCCCCC\t\t\t\tCCCCC\tC\t\tC\tCCC\tCCC\t\xc2CC\tC\t\tC\tCC\t\tC\t\t\t\tC\t\t\tCCCCC\tC\tC\t\tC\t\tC\t\tCC\t\xc2\tCCC\tC\t\tCCC\tCCCC\tC\t\tCCC\t\tC\t\tC\tCC\t\tCCCC\t\t\t\t\t\t\t\t\t\tCC\tC\tCCCCC\t\tCC\t\tCCCCCC\t\t\tC\tC\tC\t\tCC\t\tCC\tCC\t\t\tC\t\t\tCCCCCCC\tCC\t\t\tC\tC\t\t\tCC\tC\tCCCC\tCCCCC\tC\tCC\tC\tCCC\t\tCCC\t\tCC\tC\tC\tC\tCCCC\tCCCC\t\t\t\tC\t(\t\tC\t\t\tCC\tC\t\tC\t\t\tC\t\tCCC\tC\tCCCC\t\tC\t\tCCCCCC\t\tCC\t\t\tCC\tCC\t\t\tC\t\tC\t\tCCCC\t\t\tCC\t\tCCC\t\t\tC\t\tCCC\t\tCC\tCCC\t\t\tC\t\tCCC\tC\t\tC\t\t\tCC\t\t\t\tCCC\t\tCCC\t\t\tC\tCCCC\tCCCCC\t\tCCCCCC\t\tC\tC\tC\t\t\t\tCC\t\tC\tCCCCCCC\t\tC\tCC\tC\tC\tC\t\tCC\t\tCC\t\tC\t\t\t\tC\tC\t\t\t\tCCCC\t\tC\t\tCC\t\t\tCC\tCC\t\tC\tCC\t\tC\t\t\tC\t\t\t\t\tCC\t\t\tCC\tCCCCC\tC\tCC\t\tCCCCCC\tCCCCCC\t\tCCCCC\t\tC\tC\t\t\tCCC\tCCCC\tC\t\tCCC\tCCCC\t\t\t\tC\tCCCCCCC\t\t\tCC\tCC\t\tCCCCCCCC\tCCCCCCCCC\t\tC\t\tCCC\tCCC\tC\tC\tCCC\t\tC\tC\tCCCC\t\t\xc2\xc2CC\t\tC\t\t\t\tC\tCCC\t\tC\t\t\t\tC\tC\t\tCC\tCCC\tCCCCCC\t\t\t\tC\t\t\tC\tC\t\t\tC\t\t\t\t\t\t\tCCCC\tCCC\tCCC\t\xc2C\t\t\t\t\t\t\tCCCC\t\t\t\tN\t\t\tC\t\tCCC\tCCCCC\t\tCC\t\tC\tC\tCCCCCCCCCC\t\tC\t\tCC\tCCC\t\tCCCCCC\tCCCC\t\tCC\t\t\t\tC\tCCCCCCCC\tCC\t\t\tCCCCC\t\t\t\tCCCCC\tC\t\tC\tCCC\tCCC\t\xc2CC\tC\t\tC\tCC\t\tC\t\t\t\tC\t\t\tCCCCC\tC\tC\t\tC\t\tC\t\tCC\t\xc2\tCCC\tC\t\tCCC\tCCCC\tC\t\tCCC\t\tC\t\tC\tCC\t\tCCCC\t\t\t\t\t\t\t\t\t\tCC\tC\tCCCCC\t\tCC\t\tCCCCCC\t\t\tC\tC\tC\t\tCC\t\tCC\tCC\t\t\tC\t\t\tCCCCCCC\tCC\t\t\tC\tC\t\t\tCC\tC\tCCCC\tCCCCC\tC\tCC\tC\tCCC\t\tCCC\t\tCC\tC\tC\tC\tCCCC\tCCCC\t\t\t\tC\t(\t\tC\t\t\tCC\tC\t\tC\t\t\tC\t\tCCC\tC\tCCCC\t\tC\t\tCCCCCC\t\tCC\t\t\tCC\tCC\t\t\tC\t\tC\t\tCCCC\t\t\tCC\t\tCCC\t\t\tC\t\tCCC\t\tCC\tCCC\t\t\tC\t\tCCC\tC\t\tC\t\t\tCC\t\t\t\tCCC\t\tCCC\t\t\tC\tCCCC\tCCCCC\t\tCCCCCC\t\tC\tC\tC\t\t\t\tCC\t\tC\tCCCCCCC\t\tC\tCC\tC\tC\tC\t\tCC\t\tCC\t\tC\t\t\t\tC\tC\t\t\t\tCCCC\t\tC\t\tCC\t\t\tCC\tCC\t\tC\tCC\t\tC\t\t\tC\t\t\t\t\tCC\t\t\tCC\tCCCCC\tC\tCC\t\tCCCCCC\tCCCCCC\t\tCCCCC\t\tC\tC\t\t\tCCC\tCCCC\tC\t\tCCC\tCCCC\t\t\t\tC\tCCCCCCC\t\t\tCC\tCC\t\tCCCCCCCC\tCCCCCCCCC\t\tC\t\tCCC\tCCC\tC\tC\tCCC\t\tC\tC\tCCCC\t\t\xc2\xc2CC\t\tC\t\t\t\tC\tCCC\t\tC\t\t\t\tC\tC\t\tCC\tCCC\tCCCCCC\t\t\t\tC\t\t\tC\tC\t\t\tC\t\t\t\t\t\t\tCCCC\tCCC\tCCC\t\xc2C\t\t\t\t\t\t\tCCCC\t\t\t\tN\t\t\tC\t\tCCC\tCCCCC\t\tCC\t\tC\tC\tCCCCCCCCCC\t\tC\t\tCC\tCCC\t\tCCCCCC\tCCCC\t\tCC\t\t\t\tC\tCCCCC\t\tCCC\tC\t\tC\tC\tCC\tC\tCC\tCCCCCCC\t\tCCC\t\tC\t\t\tC\t\t\tCCCCCCCCC\tC\tC\t\tCCCCCCCC\tCCC\tCC\tC\tCCC\tC\t\t\tCC\tCCC\t\tCCCCC\t\tCC\x15CC\tC\t\tCCCCC\t\tCC\t\tC\x7fC\tC\t\tC\t\t\t\tCC\tCC\tC\t\tC\tC\t\tC\t\tCC\tCCC\t\t\t\tCCCCC\tCC\tCC\t\tC\t\t\t\tCCC\tCCCC\tC\tC\t\t\tCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\x12CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xb8\xb8\xb8CCCCC\tCCC\tCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xc3CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xad\xc9<d\x06\xe7CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\tCC\tCCCC\x1d$\x1b&\x17\x162x$X'\x10(y.B\v\xf1\xf1\xf1\xf1\xf1\xf1\xf1\xf1+\t\b\aB\x05087\x01\x00\x02\x00\x01\x00\x028y\x01y08a\x02\xff\xe8\x01\x01\x01#\x01 \x00$\x00!")

$ go test -run=FuzzHTLCStates/1b1169501cbe101c -timeout=60s
panic: test timed out after 1m0s
...
goroutine 59 [select]:
github.com/lightningnetwork/lnd/htlcswitch.(*mockPeer).SendMessage(0xc001154ed0?, 0x68?, {0xc0011b2380?, 0xc000fe3680?, 0x1?})
        htlcswitch/link_test.go:2091 +0x87
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).syncChanStates(0xc000173408, {0xf660d0, 0xc0001d4a00})
        htlcswitch/link.go:1048 +0xe43
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resumeLink(0xc000173408, {0xf660d0, 0xc0001d4a00})
        htlcswitch/link.go:3896 +0x45
github.com/lightningnetwork/lnd/htlcswitch.setupSide(0xc000582c40, 0xc00042b1a0, {0x3, 0x8d, 0x9a, 0x42, 0xb8, 0x9c, 0x7e, 0xdf, ...}, ...)
        htlcswitch/fuzz_test.go:355 +0x255
github.com/lightningnetwork/lnd/htlcswitch.(*fuzzNetwork).restartNode(0xc0001b5e30)
        htlcswitch/fuzz_test.go:1230 +0x2eb
github.com/lightningnetwork/lnd/htlcswitch.(*fuzzNetwork).runHTLCFuzzStateMachine(0xc0001b5e30)
        htlcswitch/fuzz_test.go:1405 +0x1f7
github.com/lightningnetwork/lnd/htlcswitch.FuzzHTLCStates.func1(0x0?, {0xc00053f980?, 0x0?, 0x480893?})
        htlcswitch/fuzz_test.go:1433 +0x25
reflect.Value.call({0xc2b0a0?, 0xe87de8?, 0x13?}, {0xd900ca, 0x4}, {0xc000434b10, 0x2, 0x2?})
        /usr/lib/golang/src/reflect/value.go:581 +0xcc6
reflect.Value.Call({0xc2b0a0?, 0xe87de8?, 0x1498740?}, {0xc000434b10?, 0xd8e160?, 0x1357e08?})
        /usr/lib/golang/src/reflect/value.go:365 +0xb9
testing.(*F).Fuzz.func1.1(0xc000582c40?)
        /usr/lib/golang/src/testing/fuzz.go:341 +0x32a
testing.tRunner(0xc000582c40, 0xc00026e000)
        /usr/lib/golang/src/testing/testing.go:1934 +0xea
created by testing.(*F).Fuzz.func1 in goroutine 58
        /usr/lib/golang/src/testing/fuzz.go:328 +0x637
...

Comment thread channeldb/channel.go
Comment on lines 3399 to 3405

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this block looks suspicious. Previously we returned immediately if updateBytes == nil; now we fall through to several code blocks below, potentially even doing DB operations.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there was a subtle issue where we shouldn’t update the DB for remote updates that we have acked (I’ve raised this in the upstream review: lightningnetwork#10622 (review)). However, we still need to update the DB for local updates that the remote party has acked, since that’s where things were failing for update_fee (the remote sends the revoke, so we have an update to store in the DB)

Comment thread htlcswitch/fuzz_test.go Outdated
Comment on lines +131 to +133
// We will revoke the STFU violation after the restart, since the
// message queue has been reset.
e.revoke(ErrStfuViolation)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A casual reader wouldn't expect merge to also revoke STFU violations. The revoke should be done by the caller of this function instead.

Comment thread htlcswitch/fuzz_test.go

require.True(t, expectedErrs.allows(linkErr.code),
"unexpected link error: %v", linkErr.code)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the channel be borked? If it shouldn't happen in prod, perhaps it is a real bug?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the only case where the channel might be borked is:

if err := l.channel.MarkBorked(); err != nil {

And, since we mark this as expected, the tests will continue. Later, when another message is processed, the borked check may be triggered internally within lnwallet, and it could result in an error of any kind

Comment thread htlcswitch/fuzz_test.go Outdated
Comment on lines +1266 to +1267
localErrors.merge(network.localErrors)
network.localErrors = localErrors

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we can simplify this pattern to:

Suggested change
localErrors.merge(network.localErrors)
network.localErrors = localErrors
network.localErrors.merge(localErrors)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, onChannelFailure is tied to localErrors (new) and not network.localErrors (old). So, we need to carry the old data into the new variable and then reassign the old reference to the new variable

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

I wasn't able to run the fuzzer. Hit a deadlock during startup:

May I know after how much time you got into this, or was it in the first pass only?

@morehouse

Copy link
Copy Markdown

May I know after how much time you got into this, or was it in the first pass only?

It occurred while loading the previous corpus. No actual fuzzing was done.

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

It occurred while loading the previous corpus. No actual fuzzing was done.

Hmm, I'm a bit confused about why I wasn't able to catch all of these. I ran fuzzing for a long time, even on two systems. Not sure if this is a good approach, but I also cleared the fuzz corpus multiple times since many mutations are exercised, and sometimes running with an empty corpus helped me find bugs in my fuzz tests. However, that also didn't happen for a long time.

Is this something that is fully random, or is it possible that certain mutations were triggered on one system but not on the other? Any suggestions to avoid this (fuzzing continuously for 2–3 weeks non-stop?) I really don't want to hinder the review because of these oversights

@morehouse

Copy link
Copy Markdown

Is this something that is fully random, or is it possible that certain mutations were triggered on one system but not on the other? Any suggestions to avoid this (fuzzing continuously for 2–3 weeks non-stop?) I really don't want to hinder the review because of these oversights

I'm not sure. I ran previous versions of this PR on 30 cores for a while (until I hit the previous crashes). I think all previous crashes triggered in under 2 hours. I haven't deleted the corpus at all -- currently theres 3k inputs there.

I'm not using any unusual flags or anything:

$ go test -fuzz=FuzzHTLCStates -parallel=30 -tags=dev
fuzz: elapsed: 1s, gathering baseline coverage: 0/3028 completed
fuzz: elapsed: 4s, gathering baseline coverage: 479/3028 completed
fuzz: elapsed: 7s, gathering baseline coverage: 983/3028 completed
fuzz: elapsed: 10s, gathering baseline coverage: 1433/3028 completed
fuzz: elapsed: 13s, gathering baseline coverage: 1888/3028 completed
fuzz: elapsed: 14s, gathering baseline coverage: 1888/3028 completed
--- FAIL: FuzzHTLCStates (14.37s)
    fuzzing process hung or terminated unexpectedly: exit status 2
    Failing input written to testdata/fuzz/FuzzHTLCStates/1b1169501cbe101c

@MPins

MPins commented Mar 30, 2026

Copy link
Copy Markdown

Hello @NishantBansal2003, I think you could significantly improve executions per second by mocking signature generation/verification and key derivation, and running the DB in memory. I’ve already done this in my approach to link fuzzing—feel free to take a look if it’s useful: MPins#1

@NishantBansal2003

Copy link
Copy Markdown
Owner Author

running the DB in memory

We discussed using in-memory DB in the above PR thread itself, but there was a lot of memory usage because of it, so we dropped that idea (See: #7 (comment))

mocking signature generation/verification and key derivation

Yes, maybe we can do this. But my thought is that mocking should be done at the test package level, right? Is it a good idea to change the critical lnwallet code just for fuzzing. That would mean we’d have to test that part separately, which would significantly increase the effort.

@MPins

MPins commented Apr 9, 2026

Copy link
Copy Markdown

I read through all the historical comments here 😉, it was a great way to learn from your approach.
That said, in my setup I’m running the DB on a RAM disk available on Linux machines, it’s a simple change and worth trying.

This, combined with mocked signing, verification, and key derivation, yields hundreds of executions per second.

Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants