htlcswitch: Add HTLC state machine fuzz tests#7
Conversation
Coverage Report for CI Build 25013897283Coverage increased (+0.009%) to 62.214%Details
Uncovered Changes
Coverage Regressions83 previously-covered lines in 22 files lost coverage.
Coverage Stats
💛 - Coveralls |
385bf2e to
375eed4
Compare
|
The only drawback of these fuzz tests is that they can be a bit flaky (on restart). The htlcswitch relies on many goroutines, so there are cases where some goroutines may not finish in time, and the Sleep duration we set may not be enough. One way to fix this is to increase the Sleep delay, but that trades off against the overall fuzzing timeout (we might get: |
4b38cb7 to
08bd03c
Compare
|
One more thing -- fuzzing is only effective with a lot of CPU time, so the reason I’m hitting test flake sometimes might be due to the limited runtime resources on my machine. It would be better to validate this on a different machine with more CPU capacity to ensure correctness. |
f0a61a9 to
02230e3
Compare
|
I tried running the fuzz test and encountered a big issue: it's too slow! Running a single input can take ~1 minute, which is way too slow for fuzzing to be useful. Ideally we like to see thousands of executions per second, but we often tolerate 10s or 100s per second for more complex targets. And the target often stops altogether with "fuzzing process hung or terminated unexpectedly" errors. I noticed a couple 100 ms sleeps in the target, which we should eliminate if possible. It would also be very helpful to profile the target and figure out where most of the time is getting spent. Perhaps there's some restructuring we can do to improve performance. I suspect we'd have a much better time if we try to fuzz the channel link directly instead of through mock nodes. That would eliminate a lot of setup and teardown time and reduce latency from passing messages around. I'm not sure exactly what that would look like, but we'd probably need to mock lots of stuff the channel link depends on, and we might need to refactor the link a bit to facilitate testing/fuzzing. |
02230e3 to
e441822
Compare
Yes, I was concerned about this issue. The main reason I tried such a heavy setup was to exercise restart behaviour in the fuzz state machine, but now I think that will only add more startup overhead to the fuzz tests.
Yes, to reduce setup overhead, I switched to a simple 2-hop network with only channelLink instances. I also mocked the mailbox, since passing messages through the real mailbox introduces additional delays (goroutines) during fuzzing, So now the functions are invoked directly. Can you check now and see if there are any other areas which can be fixed? |
|
This is a good start. The timeouts are gone, and I'm getting ~20 execs/s. I think this is a good direction to move in. The next step I would take would be to profile everything -- CPU, goroutine, and heap, looking for ways to increase speed, cut wait times, and reduce memory usage. With 16 parallel workers, the fuzz target was peaking around 90 GB of memory on my machine. I can't run with all 32 cores on my machine or it runs out of memory. Maybe we won't be able to reduce that too much, but it's worth checking to see if there's any easy wins. Also, it's important to check the code coverage after you've built up a decent corpus, to make sure the fuzz target is executing the expected code paths. I think there's a lot of other improvements we can make to this fuzz target -- currently it is quite limited in the sequences it can generate and the corner cases it can trigger. But I think first we should optimize the general approach. |
|
After running for a while, I got this crash: --- FAIL: FuzzHTLCStates (0.00s)
--- FAIL: FuzzHTLCStates/ea357960e1bde8a0 (0.00s)
fuzz_test.go:555:
Error Trace: /home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:337
/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:348
/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:387
/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:495
/home/matt/Code/NishantBansal2003/lnd/htlcswitch/fuzz_test.go:555
/usr/lib/golang/src/reflect/value.go:584
/usr/lib/golang/src/reflect/value.go:368
/usr/lib/golang/src/testing/fuzz.go:340
Error: Target error should be in err chain:
expected: "invoice with payment hash already exists"
in chain: "cannot use hash of all-zeroes preimage"
Test: FuzzHTLCStates/ea357960e1bde8a0
FAILInput: $ cat testdata/fuzz/FuzzHTLCStates/ea357960e1bde8a0
go test fuzz v1
[]byte("0100\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")It appears we're attempting to create an invalid invoice on the "victim" node, which gets caught. Probably we should just avoid doing this, since an attacker wouldn't get to choose the preimage anyway. |
I profiled the target, and from this we can see that the most resource consuming operations are the db updates (we are using the actual bolt db in our tests). For most of the other functions, the majority of the time is also spent in db update calls: I haven’t looked into it yet, but would using an in-memory db for these operations help reduce this overhead? |
364a81b to
4c89921
Compare
|
I wrote a basic mock in-memory database to avoid disk backed DB calls. This has significantly improved execs/sec and also reduced CPU wait time and memory usage. I think we can discuss whether this approach makes sense overall. Can you take a look at this approach now? |
4c89921 to
64f6b9b
Compare
|
Could you share web views of the CPU and memory profiles? I think the web views are easier to understand. I haven't looked super close, but this in-memory database looks pretty buggy to me. The transaction functionality looks completely wrong (e.g., Unless we're 100% sure that implementing an in-memory DB provides worthwhile gains, I would stay away. |
This is just a PoC implementation, the in memory db design will likely change significantly later. I implemented only the basic functionality to evaluate whether an in memory approach is beneficial and to measure its CPU and memory usage. This version is not close to the final design, it’s purely for validating the concept. The web views are too large, so I’m sharing the SVGs for the CPU and memory profiles generated from 2872 inputs. mem.svg: EDIT: Now thinking of it, If we leave aside the design of the in-memory DB could there be any fundamental issues during fuzzing because of this in-memory DB? for example: could loading everything into memory cause false OOM errors? Although high memory usage in a disk based DB would also indicate a bug, it might still be worth discussing these points before proceeding. |
These profiles are with the in-memory DB. Can you also upload ones using the default DB? It would be good to see exactly how much overhead is caused by the default DB. The main thing I see from this CPU profile is lots of cycles spent during garbage collection. This is not surprising since the fuzz target creates and destroys lots of structures on each iteration. I don't think it's worth trying to optimize that too much at this point.
Yes, an in-memory DB makes a CPU-RAM tradeoff -- less CPU and more RAM usage. I tried running the current code and could only manage 4 parallel runners due to the high memory usage, and I still eventually got an OOM crash. The original code was slower but I could use 16 parallel runners, which ended being similar execs/sec overall.
My main concern is that the current in-memory DB likely introduces new data race bugs since DB transactions are basically ignored and can't be rolled back. If we want to go this direction, we really need to implement the full interface, and I'm not sure it's worth the effort. |
|
Profiling data for the current fuzz tests using the default database: mem.svg: https://github.com/user-attachments/assets/2992f067-0e49-4a5e-872f-505b7ccfb474 As you can see, most of the time and memory usage is caused by disk related operations in the fuzz tests. This improves significantly when using an in-memory database. |
64f6b9b to
5c136d6
Compare
|
Thanks. I think this is good enough for now. I'd suggest focusing on code coverage next. The general process looks like this:
Unfortunately I haven't been able to get Go to provide coverage of lines in the I suspect there's a lot of interesting coverage (and bugs) to find by making the target less restricted. It looks like currently it only really tests a partial happy path -- send some number of |
5c136d6 to
cab9f62
Compare
In the fuzz tests, I am either adding the preimage to the invoice registry so that once the HTLC dance is completed, it will automatically check whether the preimage of the hash exists in the registry using Line 3452 in 91423ee If the invoice is present, an update_fulfill_htlc will be sent, otherwise, an update_fail_htlc will be sent automatically (i.e., added to the message queue, which is then sent to the peer through the exchangeUpdates function). This registry check is triggered as soon as the revoke_and_ack is received from the peer: Line 4475 in 91423ee For revoke_and_ack, it is automatically sent once the other peer sends a commit_sig, which follows BOLT 2. This can be seen here: Line 4339 in 91423ee
I have now added support for update_fail_malformed_htlc, update_fee, and stfu. Support for sending valid messages in the correct order already existed, but I have now also added support for sending invalid messages and sending messages in an incorrect order. Some of these cases may cause LND to force-close the channel, which I am currently ignoring and proceeding with the tests normally on the remote (attacker) side, so that I can also exercise edge cases involving failed channel links. Regarding coverage, these fuzz tests are achieving a good amount of coverage and are able to exercise most of the relevant code paths. However, there are still some paths that are executed less frequently, and many combinations that could potentially reveal bugs. For this, I am running the fuzzing continuously so that coverage can further improve in those areas as well. Can you check whether this message mutation and reordering approach makes sense? |
morehouse
left a comment
There was a problem hiding this comment.
I think this is a good approach.
I left a lot of inline comments; here's the highlights:
- There's currently an OOM bug in the fuzz harness.
- We might find more bugs if we give the fuzzer more freedom to choose invalid values.
- Ideally we would also be able to test for some invariants after each iteration and validate link failures to determine if they were expected or not.
A couple other ideas that might uncover some bugs:
- We could vary the initial states, allowing the fuzzer to choose some of the channel parameters and initial balances. This might uncover bugs that only happen in specific channel configs (e.g., balances near the dust limit, unusual fees or CLTV deltas).
- We could add an action that simulates a node restart. We may be able to borrow some code from the link unit test:
Lines 270 to 281 in 0cc84d0
| amount: lnwire.NewMSatFromSatoshis( | ||
| btcutil.Amount(data[offset+3]) * | ||
| btcutil.SatoshiPerBitcent, |
There was a problem hiding this comment.
There may be bugs that only show up when a channel is nearly depleted, or only for very large HTLCs. Perhaps we should allow amounts up to the channel capacity here.
| network := setupFuzzNetwork(t, data) | ||
|
|
||
| // Execute the HTLC state machine with fuzz input. | ||
| network.runHTLCFuzzStateMachine() |
There was a problem hiding this comment.
We might find some bugs if we check channel invariants at the end of the test.
Some ideas:
- local and remote balances should sum to channel capacity
- commitment numbers should be synced, within 1 of each other
- number of pending updates should match between nodes
cab9f62 to
7ac92fd
Compare
|
This PR is ready for review. Sorry for taking a lot of time, I was trying to fix all the loose ends and better understand the flow. In this PR, unlike some of my previous comments, there are a few points to highlight:
|
morehouse
left a comment
There was a problem hiding this comment.
I tried to run the fuzz target and got a failure:
$ cat testdata/fuzz/FuzzHTLCStates/558b8118ff00983c
go test fuzz v1
[]byte("\x00\x00000000\x00\x000000000000000000000000\x00\x000000 0\x00\x000000002\x00\x00\x000000 \x00\x00\x00\x00\x00\x00\x00000000000\x00\x00\x00\x0000\x00000000000000000000000000000000000000000000000000000000101\xcd0000000000000000000000000000000000000000000")
$ go test -run=FuzzHTLCStates/558b8118ff00983c
--- FAIL: FuzzHTLCStates (0.00s)
--- FAIL: FuzzHTLCStates/558b8118ff00983c (0.00s)
fuzz_test.go:200:
Error Trace: htlcswitch/fuzz_test.go:200
htlcswitch/link.go:3736
htlcswitch/link.go:4047
htlcswitch/link.go:1802
htlcswitch/fuzz_test.go:1285
htlcswitch/fuzz_test.go:1345
htlcswitch/fuzz_test.go:1363
/usr/lib/golang/src/reflect/value.go:584
/usr/lib/golang/src/reflect/value.go:368
/usr/lib/golang/src/testing/fuzz.go:340
Error: Should be true
Test: FuzzHTLCStates/558b8118ff00983c
FAIL
exit status 1
FAIL github.com/lightningnetwork/lnd/htlcswitch 0.015s
| // We can receive an invalid commitment from the remote if the | ||
| // remote is sending empty HTLC sigs. This is very subtle and | ||
| // can happen when the remote sends an empty sig but it is never | ||
| // delivered to the local due to message reordering. As a result | ||
| // ,the remote has moved the local's commit chain tip ht. When a | ||
| // restart happens, the remote checks the local's next commit ht | ||
| // against its local tip ht (on the remote side) and again tries | ||
| // to send the commit sig. Channel sync does not error out here | ||
| // because the remote never shared the empty commit sig, it only | ||
| // added it to its own local's commit chain. | ||
| // | ||
| // During this process, some normal HTLC operations may already | ||
| // have been shared between both peers, so the commitment now | ||
| // has some HTLCs. If the remote again sends the empty commit | ||
| // sig, the local will error out since it wants updated HTLC/ | ||
| // commitment sigs for all the HTLCs in its commitment, but it | ||
| // instead receives the empty sig that was carried over from the | ||
| // start. | ||
| // | ||
| // The easiest way to reproduce this issue is: the remote | ||
| // generates an empty commit sig and queues it. The local sends | ||
| // add HTLC and commit sig, the remote generates revoke_ack, | ||
| // then the queue is reordered and the revoke_ack message is | ||
| // sent. After both peers restart, the remote automatically | ||
| // generates a new empty commit sig message and sends it. | ||
| if isRemoteSender && (len(m.HtlcSigs) == 0) { | ||
| network.localExpectedErrs.invalidCommitment = true | ||
| } | ||
| } |
There was a problem hiding this comment.
I tried to be very thorough while explaining this. Let me know if it makes sense, otherwise, I will try to explain it in a more detailed manner.
There was a problem hiding this comment.
It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?
If the reordering is the issue, then shouldn't we set invalidCommitment during the reordering rather than here?
There was a problem hiding this comment.
It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?
I think this diagram might help:
sequenceDiagram
participant Remote
participant Local
Note over Remote: Generate empty CommitSig
Note over Remote: Queue = [CommitSig]
Local->>Remote: AddHTLC
Local->>Remote: CommitSig
Note over Remote: Generate RevokeAndAck
Note over Remote: Queue = [CommitSig, RevokeAndAck]
Note over Remote: Reorder Queue
Note over Remote: Queue = [RevokeAndAck, CommitSig]
Remote->>Local: RevokeAndAck
Note over Local,Remote: Both peers restart (queue resets)
Note over Remote: Regenerate the previous empty CommitSig<br/>(because of ht mismatch between remote and local chain)
Remote->>Local: empty CommitSig
This commit sig is sent again because of ProcessChanSyncMsg (in resumeLink). I explained the reason for this in the 1st para of the comments(let me know if that is unclear?).
If the reordering is the issue, then shouldn't we set
invalidCommitmentduring the reordering rather than here?
Yes, I am already setting that, but, during the restartNode state, we reset the errors with network.localExpectedErrs = localExpectedErrs, because of this, the expected invalidCommitment error gets wiped out.
If I do not reset the expected errors during restart, then previous expected errors (which were not actual errors e.g. stfu and therefore were not flagged) might be falsely treated as expected errors. As a result, there is a chance that a real error occurring after the restart could be missed
seed, if you would like to check the logs:
func addEmptyCommitSigReorderSeed(f *testing.F) {
data, preimage, htlcAmt, cltvDelta := buildHTLCFuzzSetup()
emptyCommitSigReorder := make([]byte, 0)
// Base setup.
emptyCommitSigReorder = append(emptyCommitSigReorder, data...)
// Remote generates and queues an empty commit sig (but doesn't send it yet).
emptyCommitSigReorder = append(emptyCommitSigReorder, byte(sendCommitSig), 1)
// Add HTLC.
emptyCommitSigReorder = append(emptyCommitSigReorder, byte(sendAddHTLC), 0, 0)
emptyCommitSigReorder = append(emptyCommitSigReorder, htlcAmt...)
emptyCommitSigReorder = append(emptyCommitSigReorder, cltvDelta...)
emptyCommitSigReorder = append(emptyCommitSigReorder, preimage...)
// State transitions.
emptyCommitSigReorder = append(emptyCommitSigReorder,
byte(exchangeStateUpdates), 0,
byte(sendCommitSig), 0,
byte(exchangeStateUpdates), 0)
// We exchange with reordering enabled so that RevokeAndAck is sent
// before the queued empty CommitSig.
emptyCommitSigReorder = append(emptyCommitSigReorder,
byte(exchangeStateUpdates), 1, 1, 0, 0,
)
// Restart both nodes. After restart, remote will regenerate the empty
// CommitSig because it never received a RevokeAndAck for it.
emptyCommitSigReorder = append(emptyCommitSigReorder, byte(restartNode), 0)
f.Add(emptyCommitSigReorder)
}
I ran the fuzz tests continuously for 5 days, and they seem to be holding good. But, I’m still expecting bugs in the fuzz tests since we are now exercising a huge number of combinations. Some of these may trigger link failures much later in the execution. Because of this, there are many cases where link failures are expected, but we aren’t currently handling them, mostly due to the expectedErr state being reset on restart. For that reason, I’m still running the fuzz tests side by side to make sure the fuzz target itself is stable. One thing I wanted to discuss is while it’s good that we are checking for link failures in the fuzz tests, it’s quite challenging to check and add all possible expected error cases. In practice, this means we have to run the fuzz tests, observe a link error, and then determine whether it’s a real bug in the code or an issue with the fuzz logic. That’s not inherently bad, but I worry it could make the fuzz target quite unstable. Another approach I can think of is to simply track whether any link fail can occur, doesn’t matter what kind. If the remote peer tries to act smart at any point, we just set it to true for both sides. WDYT? |
morehouse
left a comment
There was a problem hiding this comment.
Another failure after a few minutes of fuzzing:
$ cat testdata/fuzz/FuzzHTLCStates/36a25783f8520f6e
go test fuzz v1
[]byte("070010170121A0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000A1\x00\x00\x00\x00\x00000A1bX000000")
$ go test -run=FuzzHTLCStates/36a25783f8520f6e
--- FAIL: FuzzHTLCStates (0.01s)
--- FAIL: FuzzHTLCStates/36a25783f8520f6e (0.01s)
fuzz_test.go:223:
Error Trace: htlcswitch/fuzz_test.go:223
htlcswitch/link.go:3736
htlcswitch/link.go:1967
htlcswitch/link.go:4494
htlcswitch/link.go:1817
htlcswitch/fuzz_test.go:1195
htlcswitch/fuzz_test.go:1386
htlcswitch/fuzz_test.go:1450
htlcswitch/fuzz_test.go:1470
/usr/lib/golang/src/reflect/value.go:584
/usr/lib/golang/src/reflect/value.go:368
/usr/lib/golang/src/testing/fuzz.go:340
Error: Should be true
Test: FuzzHTLCStates/36a25783f8520f6e
FAIL
exit status 1
FAIL github.com/lightningnetwork/lnd/htlcswitch 0.021s
| // We can receive an invalid commitment from the remote if the | ||
| // remote is sending empty HTLC sigs. This is very subtle and | ||
| // can happen when the remote sends an empty sig but it is never | ||
| // delivered to the local due to message reordering. As a result | ||
| // ,the remote has moved the local's commit chain tip ht. When a | ||
| // restart happens, the remote checks the local's next commit ht | ||
| // against its local tip ht (on the remote side) and again tries | ||
| // to send the commit sig. Channel sync does not error out here | ||
| // because the remote never shared the empty commit sig, it only | ||
| // added it to its own local's commit chain. | ||
| // | ||
| // During this process, some normal HTLC operations may already | ||
| // have been shared between both peers, so the commitment now | ||
| // has some HTLCs. If the remote again sends the empty commit | ||
| // sig, the local will error out since it wants updated HTLC/ | ||
| // commitment sigs for all the HTLCs in its commitment, but it | ||
| // instead receives the empty sig that was carried over from the | ||
| // start. | ||
| // | ||
| // The easiest way to reproduce this issue is: the remote | ||
| // generates an empty commit sig and queues it. The local sends | ||
| // add HTLC and commit sig, the remote generates revoke_ack, | ||
| // then the queue is reordered and the revoke_ack message is | ||
| // sent. After both peers restart, the remote automatically | ||
| // generates a new empty commit sig message and sends it. | ||
| if isRemoteSender && (len(m.HtlcSigs) == 0) { | ||
| network.localExpectedErrs.invalidCommitment = true | ||
| } | ||
| } |
There was a problem hiding this comment.
It's quite confusing. What is the actual order of messages that are sent/received (after reordering)?
If the reordering is the issue, then shouldn't we set invalidCommitment during the reordering rather than here?
Yeah, this is the tradeoff. We can put in more effort to identify false positives and fix the error-checking logic, with the benefit of potentially finding more kinds of bugs (especially unnecessary force closes). Or we can ignore that class of bugs entirely and put less effort into getting error-checking correct. It would be great to find a real bug with this fuzz target, which is why I suggested adding the error validation. But if you just want to get something merged first, we could rip that out and add it back in future PRs. |
MPins
left a comment
There was a problem hiding this comment.
Hello @NishantBansal2003, I’m also working on the HTLC state machine fuzz test (MPins#1). I’m still at an early stage, and after reviewing your PR, I’m thinking about how to improve mine. Thanks!
Hi @MPins, thanks for the review. I think this PR is almost complete (apart from the unresolved comments). I'm currently facing some false-positive link errors, so I've been continuously running fuzz tests and fixing them. I was planning to open this upstream soon once the initial review is done, so we can continue the discussion there and avoid duplicated effort. I'd really appreciate further reviews here, or I can also provide one if you're testing broader scenarios in your fuzzing. Overall, I think https://github.com/morehouse/smite is very efficient/less work for these kinds of fuzz tests. |
c17a8fc to
73e90e6
Compare
|
In the last two commits, I did some code cleanup also, so it should now be a bit easier to iterate. Regarding the link failures, I've added support to check them only on the local side (the victim). The remote side can get link errors (which is expected ;)), especially if it tries to operate on a state that is not in sync with the victim node. Previously, I was clearing the expected link errors on restart, but that could lead to false positives, since some error state might still be carried forward. Now, we don't reset that state. However, we do reset the state for stfu, since after a restart there is no state restoration for stfu messages. I've been running the fuzz tests continuously, and for quite a while I haven't seen any errors (though I'm still not fully confident). I'm opening these changes for review so we can move toward finalizing the PR. We can continue improving and fixing any remaining false positives in parallel. PS: CI failures can be ignored for now, they're likely due to breaking changes between commits. I’ll squash the commits once the PR is finalized. |
There was a problem hiding this comment.
I wasn't able to run the fuzzer. Hit a deadlock during startup:
$ cat testdata/fuzz/FuzzHTLCStates/1b1169501cbe101c
go test fuzz v1
[]byte("\x1a0\x1c\x1d\x01\x1f\"%(\xf5\x028CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\tCC\t\t\tCC\tCCCCC\tCCCC\tC\tC\t\t\t\tC\tC\t\t\t\tC\tC\tC\t\tCCC\t\tC\tC\t\t\tC\t\t\t\t\t\tC\tCCC\t\tC\t\t\tC\t\t\t\t\tCC\t\tC\t\tC\t\tCCCC\t\tC\tCC\t\tC\t\t\tCC\t\t\tCCCCC\tCC\tCC\tC\t\tC\t\tCCCCC\t\tCCCCCC\t\tCC\t\tC\t\t\t\t\t\t\tCC\t\t\tCC\t\t\t\tCCCC\t\tCCC\tCCC\tCCC\tCCCCC\t\t\t\t\t\tCCCC\tCC\tCC\tC\tCC\t\tCCC\tC\t\t\tCCC\tC\tC\tC\t\t\tCC\tC\tCC\tCC\t\t\t\tC\t\t\tC\t\t\t\tCCC\tC\tCC\tC\t\tC\t\t\tCC\tC\tCC\t\t\tC\t\tC\tCC\t\t\t\tC\t\t\tCC\t\t\tC\t\tCC\tCCC\tCCCC\tC\t\tC\t\tCC\t\t\t\tCCC\t\t\tCCCCC\tC\tCCC\t\tCCC\t\tC\tCCCCC\tC\tC\t\t\tCCC\tC\tCCC\t\tC\t\tC\t\t\t\t\tCCCC\tCCC\tCCC\t\t\t\t\tC\tCC\tCC\t\t\t\tCCCCC\t\tC\tC\tC\tC\tCCCCC\xc2\t\tCC\t\tCC\tCCCC\tCCC\t\tC\t\t\t\tCC\tC\t\tC\tCC\t\t\t\tC\t\tCCCCC\t\t\tCC\t\t\t\tCCC\t\tCCCCCC\tCC\t\tC\t\t\t\t\tC\tC\t\t\t\tCCC\tCC\t\tCC\tCCCCC\t\tCCCCCC\t\tC\tCC\tC\tC\tC\t\t\t\t\t\t\t\tC\tCCCC\tCCCCC\tCC\t\tC\t\tCC\tCCC\tC\tCCCC\tC\tCC\tC\tCCC\t\t\tCC\tCCC\tC\tC\tCCCCCC\t\t\t\t\t\t\t\tC\tCCCCCCC\t\tCCC\tCC\t\t\t\t\tC\t\tC\t\tC\tCCCCCC\t\t\tC\tC\t\tCCC\tCCC\t\tC\tC\t\tC\tC\t\t\t\tC\t\tC\tCC\t\tC\tC\t\t\t\tC\tC\t\tC\tCC\tC\tC\t\tCCC\tC\t\t\t\t\tC\tC\tC\tCCC\tC\tC\tCCC\t\tCCCCC\xc2CC\t\t\tC\tC\t\tCCCC\tC\tC\t\tCC\t\tC\tCC\tCC\t\tC\tCC\tC\t\t\t\tCC\tC\t\tC\tC\tC\tCCCCC\t\tCCC\tCC\tC\tC\t\tC\t\t\tC\t\t\tCC\tCCCCC\t\xc2CCCCCC\t\tCC\tCC\tC\tCC\t\t\t\tC\t\tC\t\tC\tCCCC\tCCCCCCC\tC\t\t\tCC\t\tC\tCCCCC\t\tC\tC\tC\t\t\tCC\tCCCC\t\tC\t\tCC\t\t\tC\tCCC\t\t\t\tCCCCC\t\tC\tCC\tC\t\t\t\tCC\t\tCC\tCCCCC\tCCC\t\t\tCCCCCC\t\t\tC\t\tCC\tC\t\t\t\tC\t\tCC\tCCCCCC\tC\tC\tC\tC\t\tCCCC\tC\t\t\tCC\t\tCC\tC\t\t\tCCCC\t\tCCC\tCCC\t\tCCC\tC\tC\t\t\t\tCC\t\t\tC\tCCCC\t\t\tC\tCC\t\t\t\tC\tCCC\tC\tC\t\tCCCCCCCCCCC\tCCCC\t\tCCC\tC\t\t\t\tC\tC\t\tC\tC\tCC\tCC\tCCC\t\tCCCCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\t\tC\tCC\tCCCC\tCC\tCCCCCCC\tCC\tC\tC\t\t\tCCCCC\t\tCC\tC\t\t\t\t\t\t\tC\t\t\tCCCCC\tC\tC\t\t\t\tCC\t\tC\t\tCCC\tC\t\t\tC\tCCC\t\tC\t\t\t\t\t\t\tC\tCC\tC\tCC\t\t\tCC\tCCCC\t\t\t\tCC\tC\tC\tC\t\tCCCCCC\tCCCCCC\t\tC\t\tC\t\t\tC\tCCC\t\t\tCCCC\t\tCC\t\t\tCCCC\tCC\tC\tCCCC\t\tCC\t\t\t\t\tCCC\t\t\tCC\t\tC\tC\tC\tC\t\tCCCC\tC\tCC\t\t\t\tCC\tCCC\t\tCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\tCCC\tCC\t\t\tC\t\tC\tC\tC\t\tCCCC\t\tC\t\t\t\tC\tCCCCC\tCCCCC\t\tCCCC\tCCCCC\t\t\tC\tCCC\t\tC\tC\tC\t\t\tCC\t\t\tCCCC\t\tC\tCC\t\tCCCCCC!!!!!!!\tCCCCCCC\tCCC\t\t\t\t\t\t\tCC\tC\x03\tCCCCCC\tCCC\tCCCC\t\tCCC\tC\tC\t\t\tCCCCCC\tC\t\tC\t\t\tCCCCC\tCC\t\tCCCCC\tCCC\t\tCCCC\tC\t\t\tC\t\t\t\tCCCCCCCC\tC\tC\tCCC\tC\tCCC\tCCC\tC\tC\t\tCCC\t\tCC\t\tC\t\tCCCC\t\t\tC\t\tCCC\t\tCCC\tCC\tC\tCCCCCCC\t\tC\t\tCC5\xbdI\xfd\tC\tCCC\tCCC\tC\tCCC\t\t\t\t\tCCCC\t\tC\tCCCC\t\tC\t\tCCC\tCC\tC\tCCC\tC\tC\t\tCCCCCCC\tC\t\tC\tC\tC\tCC\tCC\t\tCC\tC\tC\tC\tCC\t\tCCC\t\t\tCC\t\tC\tC\t\t\tC\t\tC\tC\t\tC\tCC\tC\tC\t\t\tCC\tC\tC\tCC\tC\tC\tC\t\t\tCCC\tC\t\t\tCCCCC\tCC\tCCC\tCC\t\tCC\t\t\tC\t\t\tCCCCC\tCC\t\tCC\t\t\tC\tCCC\t\t\t\tCCC\t\tCCC\t\tC\t\t\t\tC\tCC\tC\tC\tC\t\t\t\t\t\t\t\tC\tCCCC\tCCCCC\tCC\t\tC\t\tCC\tCCC\tC\tCCCC\tC\tCC\tC\tCCC\t\t\tCC\tCCC\tC\tC\tCCCCCC\t\t\t\t\t\t\t\tC\tCCCCCCC\t\tCCC\tCC\t\t\t\t\tC\t\tC\t\tC\tCCCCCC\t\t\tC\tC\t\tCCC\tCCC\t\tC\tC\t\tC\tC\t\t\t\tC\t\tC\tCC\t\tC\tC\t\t\t\tC\tC\t\tC\tCC\tC\tC\t\tCCC\tC\t\t\t\t\tC\tC\tC\tCCC\tC\tC\tCCC\t\tCCCCC\xc2CC\t\t\tC\tC\t\tCCCC\tC\tC\t\tCC\t\tC\tCC\tCC\t\tC\tCC\tC\t\t\t\tCC\tC\t\tC\tC\tC\tCCCCC\t\tCCC\tCC\tC\tC\t\tC\t\t\tC\t\t\tCC\tCCCCC\t\xc2CCCCCC\t\tCC\tCC\tC\tCC\t\t\t\tC\t\tC\t\tC\tCCCC\tCCCCCCC\tC\t\t\tCC\t\tC\tCCCCC\t\tC\tC\tC\t\t\tCC\tCCCC\t\tC\t\tCC\t\t\tC\tCCC\t\t\t\tCCCCC\t\tC\tCC\tC\t\t\t\tCC\t\tCC\tCCCCC\tCCC\t\t\tCCCCCC\t\t\tC\t\tCC\tC\t\t\t\tC\t\tCC\tCCCCCC\tC\tC\tC\tC\t\tCCCC\tC\t\t\tCC\t\tCC\tC\t\t\tCCCC\t\tCCC\tCCC\t\tCCC\tC\tC\t\t\t\tCC\t\t\tC\tCCCC\t\t\tC\tCC\t\t\t\tC\tCCC\tC\tC\t\tCCCCCCCCCCC\tCCCC\t\tCCC\tC\t\t\t\tC\tC\t\tC\tC\tCC\tCC\tCCC\t\tCCCCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\t\tC\tCC\tCCCC\tCC\tCCCCCCC\tCC\tC\tC\t\t\tCCCCC\t\tCC\tC\t\t\t\t\t\t\tC\t\t\tCCCCC\tC\tC\t\t\t\tCC\t\tC\t\tCCC\tC\t\t\tC\tCCC\t\tC\t\t\t\t\t\t\tC\tCC\tC\tCC\t\t\tCC\tCCCC\t\t\t\tCC\tC\tC\tC\t\tCCCCCC\tCCCCCC\t\tC\t\tC\t\t\tC\tCCC\t\t\tCCCC\t\tCC\t\t\tCCCC\tCC\tC\tCCCC\t\tCC\t\t\t\t\tCCC\t\t\tCC\t\tC\tC\tC\tC\t\tCCCC\tC\tCC\t\t\t\tCC\tCCC\t\tCCCCCC\tC\t\t\tC\t\t\t\tC\t\t\tCCC\tCC\t\t\tC\t\tC\tC\tC\t\tCCCC\t\tC\t\t\t\tC\tCCCCC\tCCCCC\t\tCCCC\tCCCCC\t\t\tC\tCCC\t\tC\tC\tC\t\t\tCC\t\t\tCCCC\t\tC\tCC\t\tCCCCCC\tCCCCCCC\tCCC\t\t\t\t\t\t\tCC\tC\x03\tCCCCCC\tCCC\tCCCC\t\tCCC\tC\tC\t\t\tCCCCCC\tC\t\tC\t\t\tCCCCC\tCC\t\tCCCCC\tCCC\t\tCCCC\tC\t\t\tC\t\t\t\tCCCCCCCC\tC\tC\tCCC\tC\tCCC\tCCC\tC\tC\t\tCCC\t\tCC\t\tC\t\tCCCC\t\t\tC\t\tCCC\t\tCCC\tCC\tC\tCCCCCCC\t\tC\t\tCC\tC\tCCC\tCCC\tC\tCCC\t\t\t\t\tCCCC\t\tC\tCCCC\t\tC\t\tCCC\tCC\tC\tCCC\tC\tC\t\tCCCCCCC\tC\t\tC\tC\tC\tCC\tCC\t\tCC\tC\tC\tC\tCC\t\tCCC\t\t\tCC\t\tC\tC\t\t\tC\t\tC\tC\t\tC\tCC\tC\tC\t\t\tCC\tC\tC\tCC\tC\tC\tC\t\t\tCCC\tC\t\t\tCCCCC\tCC\tCCC\tCC\t\tCC\t\t\tC\t\t\tCCCCC\tCC\t\tCC\t\t\tC\tCCC\t\t\t\tCCC\t\tCCC\t\tC\t\t\tCCC\tCC\t\t\tCCCCC\t\t\t\tCCCCC\tC\t\tC\tCCC\tCCC\t\xc2CC\tC\t\tC\tCC\t\tC\t\t\t\tC\t\t\tCCCCC\tC\tC\t\tC\t\tC\t\tCC\t\xc2\tCCC\tC\t\tCCC\tCCCC\tC\t\tCCC\t\tC\t\tC\tCC\t\tCCCC\t\t\t\t\t\t\t\t\t\tCC\tC\tCCCCC\t\tCC\t\tCCCCCC\t\t\tC\tC\tC\t\tCC\t\tCC\tCC\t\t\tC\t\t\tCCCCCCC\tCC\t\t\tC\tC\t\t\tCC\tC\tCCCC\tCCCCC\tC\tCC\tC\tCCC\t\tCCC\t\tCC\tC\tC\tC\tCCCC\tCCCC\t\t\t\tC\t(\t\tC\t\t\tCC\tC\t\tC\t\t\tC\t\tCCC\tC\tCCCC\t\tC\t\tCCCCCC\t\tCC\t\t\tCC\tCC\t\t\tC\t\tC\t\tCCCC\t\t\tCC\t\tCCC\t\t\tC\t\tCCC\t\tCC\tCCC\t\t\tC\t\tCCC\tC\t\tC\t\t\tCC\t\t\t\tCCC\t\tCCC\t\t\tC\tCCCC\tCCCCC\t\tCCCCCC\t\tC\tC\tC\t\t\t\tCC\t\tC\tCCCCCCC\t\tC\tCC\tC\tC\tC\t\tCC\t\tCC\t\tC\t\t\t\tC\tC\t\t\t\tCCCC\t\tC\t\tCC\t\t\tCC\tCC\t\tC\tCC\t\tC\t\t\tC\t\t\t\t\tCC\t\t\tCC\tCCCCC\tC\tCC\t\tCCCCCC\tCCCCCC\t\tCCCCC\t\tC\tC\t\t\tCCC\tCCCC\tC\t\tCCC\tCCCC\t\t\t\tC\tCCCCCCC\t\t\tCC\tCC\t\tCCCCCCCC\tCCCCCCCCC\t\tC\t\tCCC\tCCC\tC\tC\tCCC\t\tC\tC\tCCCC\t\t\xc2\xc2CC\t\tC\t\t\t\tC\tCCC\t\tC\t\t\t\tC\tC\t\tCC\tCCC\tCCCCCC\t\t\t\tC\t\t\tC\tC\t\t\tC\t\t\t\t\t\t\tCCCC\tCCC\tCCC\t\xc2C\t\t\t\t\t\t\tCCCC\t\t\t\tN\t\t\tC\t\tCCC\tCCCCC\t\tCC\t\tC\tC\tCCCCCCCCCC\t\tC\t\tCC\tCCC\t\tCCCCCC\tCCCC\t\tCC\t\t\t\tC\tCCCCCCCC\tCC\t\t\tCCCCC\t\t\t\tCCCCC\tC\t\tC\tCCC\tCCC\t\xc2CC\tC\t\tC\tCC\t\tC\t\t\t\tC\t\t\tCCCCC\tC\tC\t\tC\t\tC\t\tCC\t\xc2\tCCC\tC\t\tCCC\tCCCC\tC\t\tCCC\t\tC\t\tC\tCC\t\tCCCC\t\t\t\t\t\t\t\t\t\tCC\tC\tCCCCC\t\tCC\t\tCCCCCC\t\t\tC\tC\tC\t\tCC\t\tCC\tCC\t\t\tC\t\t\tCCCCCCC\tCC\t\t\tC\tC\t\t\tCC\tC\tCCCC\tCCCCC\tC\tCC\tC\tCCC\t\tCCC\t\tCC\tC\tC\tC\tCCCC\tCCCC\t\t\t\tC\t(\t\tC\t\t\tCC\tC\t\tC\t\t\tC\t\tCCC\tC\tCCCC\t\tC\t\tCCCCCC\t\tCC\t\t\tCC\tCC\t\t\tC\t\tC\t\tCCCC\t\t\tCC\t\tCCC\t\t\tC\t\tCCC\t\tCC\tCCC\t\t\tC\t\tCCC\tC\t\tC\t\t\tCC\t\t\t\tCCC\t\tCCC\t\t\tC\tCCCC\tCCCCC\t\tCCCCCC\t\tC\tC\tC\t\t\t\tCC\t\tC\tCCCCCCC\t\tC\tCC\tC\tC\tC\t\tCC\t\tCC\t\tC\t\t\t\tC\tC\t\t\t\tCCCC\t\tC\t\tCC\t\t\tCC\tCC\t\tC\tCC\t\tC\t\t\tC\t\t\t\t\tCC\t\t\tCC\tCCCCC\tC\tCC\t\tCCCCCC\tCCCCCC\t\tCCCCC\t\tC\tC\t\t\tCCC\tCCCC\tC\t\tCCC\tCCCC\t\t\t\tC\tCCCCCCC\t\t\tCC\tCC\t\tCCCCCCCC\tCCCCCCCCC\t\tC\t\tCCC\tCCC\tC\tC\tCCC\t\tC\tC\tCCCC\t\t\xc2\xc2CC\t\tC\t\t\t\tC\tCCC\t\tC\t\t\t\tC\tC\t\tCC\tCCC\tCCCCCC\t\t\t\tC\t\t\tC\tC\t\t\tC\t\t\t\t\t\t\tCCCC\tCCC\tCCC\t\xc2C\t\t\t\t\t\t\tCCCC\t\t\t\tN\t\t\tC\t\tCCC\tCCCCC\t\tCC\t\tC\tC\tCCCCCCCCCC\t\tC\t\tCC\tCCC\t\tCCCCCC\tCCCC\t\tCC\t\t\t\tC\tCCCCC\t\tCCC\tC\t\tC\tC\tCC\tC\tCC\tCCCCCCC\t\tCCC\t\tC\t\t\tC\t\t\tCCCCCCCCC\tC\tC\t\tCCCCCCCC\tCCC\tCC\tC\tCCC\tC\t\t\tCC\tCCC\t\tCCCCC\t\tCC\x15CC\tC\t\tCCCCC\t\tCC\t\tC\x7fC\tC\t\tC\t\t\t\tCC\tCC\tC\t\tC\tC\t\tC\t\tCC\tCCC\t\t\t\tCCCCC\tCC\tCC\t\tC\t\t\t\tCCC\tCCCC\tC\tC\t\t\tCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\x12CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xb8\xb8\xb8CCCCC\tCCC\tCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xc3CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xad\xc9<d\x06\xe7CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\tCC\tCCCC\x1d$\x1b&\x17\x162x$X'\x10(y.B\v\xf1\xf1\xf1\xf1\xf1\xf1\xf1\xf1+\t\b\aB\x05087\x01\x00\x02\x00\x01\x00\x028y\x01y08a\x02\xff\xe8\x01\x01\x01#\x01 \x00$\x00!")
$ go test -run=FuzzHTLCStates/1b1169501cbe101c -timeout=60s
panic: test timed out after 1m0s
...
goroutine 59 [select]:
github.com/lightningnetwork/lnd/htlcswitch.(*mockPeer).SendMessage(0xc001154ed0?, 0x68?, {0xc0011b2380?, 0xc000fe3680?, 0x1?})
htlcswitch/link_test.go:2091 +0x87
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).syncChanStates(0xc000173408, {0xf660d0, 0xc0001d4a00})
htlcswitch/link.go:1048 +0xe43
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resumeLink(0xc000173408, {0xf660d0, 0xc0001d4a00})
htlcswitch/link.go:3896 +0x45
github.com/lightningnetwork/lnd/htlcswitch.setupSide(0xc000582c40, 0xc00042b1a0, {0x3, 0x8d, 0x9a, 0x42, 0xb8, 0x9c, 0x7e, 0xdf, ...}, ...)
htlcswitch/fuzz_test.go:355 +0x255
github.com/lightningnetwork/lnd/htlcswitch.(*fuzzNetwork).restartNode(0xc0001b5e30)
htlcswitch/fuzz_test.go:1230 +0x2eb
github.com/lightningnetwork/lnd/htlcswitch.(*fuzzNetwork).runHTLCFuzzStateMachine(0xc0001b5e30)
htlcswitch/fuzz_test.go:1405 +0x1f7
github.com/lightningnetwork/lnd/htlcswitch.FuzzHTLCStates.func1(0x0?, {0xc00053f980?, 0x0?, 0x480893?})
htlcswitch/fuzz_test.go:1433 +0x25
reflect.Value.call({0xc2b0a0?, 0xe87de8?, 0x13?}, {0xd900ca, 0x4}, {0xc000434b10, 0x2, 0x2?})
/usr/lib/golang/src/reflect/value.go:581 +0xcc6
reflect.Value.Call({0xc2b0a0?, 0xe87de8?, 0x1498740?}, {0xc000434b10?, 0xd8e160?, 0x1357e08?})
/usr/lib/golang/src/reflect/value.go:365 +0xb9
testing.(*F).Fuzz.func1.1(0xc000582c40?)
/usr/lib/golang/src/testing/fuzz.go:341 +0x32a
testing.tRunner(0xc000582c40, 0xc00026e000)
/usr/lib/golang/src/testing/testing.go:1934 +0xea
created by testing.(*F).Fuzz.func1 in goroutine 58
/usr/lib/golang/src/testing/fuzz.go:328 +0x637
...There was a problem hiding this comment.
Removing this block looks suspicious. Previously we returned immediately if updateBytes == nil; now we fall through to several code blocks below, potentially even doing DB operations.
There was a problem hiding this comment.
Yes, there was a subtle issue where we shouldn’t update the DB for remote updates that we have acked (I’ve raised this in the upstream review: lightningnetwork#10622 (review)). However, we still need to update the DB for local updates that the remote party has acked, since that’s where things were failing for update_fee (the remote sends the revoke, so we have an update to store in the DB)
| // We will revoke the STFU violation after the restart, since the | ||
| // message queue has been reset. | ||
| e.revoke(ErrStfuViolation) |
There was a problem hiding this comment.
A casual reader wouldn't expect merge to also revoke STFU violations. The revoke should be done by the caller of this function instead.
|
|
||
| require.True(t, expectedErrs.allows(linkErr.code), | ||
| "unexpected link error: %v", linkErr.code) | ||
| } |
There was a problem hiding this comment.
Why would the channel be borked? If it shouldn't happen in prod, perhaps it is a real bug?
There was a problem hiding this comment.
Nope, the only case where the channel might be borked is:
Line 1259 in 392d4c8
And, since we mark this as expected, the tests will continue. Later, when another message is processed, the borked check may be triggered internally within
lnwallet, and it could result in an error of any kind
| localErrors.merge(network.localErrors) | ||
| network.localErrors = localErrors |
There was a problem hiding this comment.
IIUC, we can simplify this pattern to:
| localErrors.merge(network.localErrors) | |
| network.localErrors = localErrors | |
| network.localErrors.merge(localErrors) |
There was a problem hiding this comment.
So, onChannelFailure is tied to localErrors (new) and not network.localErrors (old). So, we need to carry the old data into the new variable and then reassign the old reference to the new variable
May I know after how much time you got into this, or was it in the first pass only? |
It occurred while loading the previous corpus. No actual fuzzing was done. |
Hmm, I'm a bit confused about why I wasn't able to catch all of these. I ran fuzzing for a long time, even on two systems. Not sure if this is a good approach, but I also cleared the fuzz corpus multiple times since many mutations are exercised, and sometimes running with an empty corpus helped me find bugs in my fuzz tests. However, that also didn't happen for a long time. Is this something that is fully random, or is it possible that certain mutations were triggered on one system but not on the other? Any suggestions to avoid this (fuzzing continuously for 2–3 weeks non-stop?) I really don't want to hinder the review because of these oversights |
I'm not sure. I ran previous versions of this PR on 30 cores for a while (until I hit the previous crashes). I think all previous crashes triggered in under 2 hours. I haven't deleted the corpus at all -- currently theres 3k inputs there. I'm not using any unusual flags or anything: $ go test -fuzz=FuzzHTLCStates -parallel=30 -tags=dev
fuzz: elapsed: 1s, gathering baseline coverage: 0/3028 completed
fuzz: elapsed: 4s, gathering baseline coverage: 479/3028 completed
fuzz: elapsed: 7s, gathering baseline coverage: 983/3028 completed
fuzz: elapsed: 10s, gathering baseline coverage: 1433/3028 completed
fuzz: elapsed: 13s, gathering baseline coverage: 1888/3028 completed
fuzz: elapsed: 14s, gathering baseline coverage: 1888/3028 completed
--- FAIL: FuzzHTLCStates (14.37s)
fuzzing process hung or terminated unexpectedly: exit status 2
Failing input written to testdata/fuzz/FuzzHTLCStates/1b1169501cbe101c |
|
Hello @NishantBansal2003, I think you could significantly improve executions per second by mocking signature generation/verification and key derivation, and running the DB in memory. I’ve already done this in my approach to link fuzzing—feel free to take a look if it’s useful: MPins#1 |
We discussed using in-memory DB in the above PR thread itself, but there was a lot of memory usage because of it, so we dropped that idea (See: #7 (comment))
Yes, maybe we can do this. But my thought is that mocking should be done at the test package level, right? Is it a good idea to change the critical lnwallet code just for fuzzing. That would mean we’d have to test that part separately, which would significantly increase the effort. |
|
I read through all the historical comments here 😉, it was a great way to learn from your approach. This, combined with mocked signing, verification, and key derivation, yields hundreds of executions per second. |
73e90e6 to
f412fda
Compare
Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
f412fda to
1f3b2af
Compare
Change Description
Topology:
Local (ChannelLink) <---> Remote (ChannelLink)This PR adds HTLC state machine fuzz tests using a 2-hop network. Based on the fuzz input, the test executes various state transitions such as Local sending HTLC adds to Remote, Remote sending adds to Local, and exchanging revoke, commit-sig, fulfill/fail, and other messages between peers.
Message ordering is controlled via a mockPeer instance, allowing us to simulate unordered commitment dance behavior and precisely manage message flow.
The primary goal is to ensure that after all HTLC operations, the channels never fall out of sync or force-close unexpectedly.
Steps to Test