rpcclient: fix several synchronization bugs#2500
Conversation
|
hello @wydengyre @seeforschauer @jcvernaleo would you consider reviewing this one? |
seeforschauer
left a comment
There was a problem hiding this comment.
LGTM — all three bugs are real and the fixes are correct. I've been working in rpcclient/infrastructure.go recently (#2506, #2505), so I have direct context on these paths.
Commit structure is clean (one bug, one test, one commit). Each test fails when the fix is reverted — solid regression coverage.
One suggestion on sendPostRequest (inline) for tighter shutdown determinism using the priority-select pattern already in addRequest. Two minor notes on failBatchRequests.
| // Atomically either queue the request or fail it due to shutdown. | ||
| // | ||
| // This avoids delivering two terminal responses to the same request, which | ||
| // can otherwise block shutdown cleanup on the second send. | ||
| select { | ||
| case <-c.shutdown: | ||
| jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown} | ||
| default: | ||
| } | ||
| return | ||
|
|
||
| select { | ||
| case c.sendPostChan <- jReq: | ||
| log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id) | ||
|
|
||
| case <-c.shutdown: | ||
| return | ||
| } |
There was a problem hiding this comment.
suggestion (non-blocking): Consider the priority-select pattern here.
When shutdown is already closed (its permanent state after Shutdown()), the single select randomly picks between responding and enqueueing (~50/50 per Go spec). If sendPostHandler has already exited cleanup, the enqueued request's future is never resolved.
A non-blocking shutdown guard first makes the common post-shutdown path deterministic — addRequest (line 213) already uses this exact pattern for the same reason. The remaining race (shutdown closing between the two selects) has a much narrower window.
| // Atomically either queue the request or fail it due to shutdown. | |
| // | |
| // This avoids delivering two terminal responses to the same request, which | |
| // can otherwise block shutdown cleanup on the second send. | |
| select { | |
| case <-c.shutdown: | |
| jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown} | |
| default: | |
| } | |
| return | |
| select { | |
| case c.sendPostChan <- jReq: | |
| log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id) | |
| case <-c.shutdown: | |
| return | |
| } | |
| // Prefer shutdown: if already closed, fail the request immediately. | |
| // This avoids a random race between shutdown and enqueue when both | |
| // channels are ready, consistent with the guard in addRequest. | |
| select { | |
| case <-c.shutdown: | |
| jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown} | |
| return | |
| default: | |
| } | |
| // Normal path: enqueue or fail on shutdown. Exactly one outcome. | |
| select { | |
| case c.sendPostChan <- jReq: | |
| log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id) | |
| case <-c.shutdown: | |
| jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown} | |
| } |
There was a problem hiding this comment.
This is effectively how the code already was.
@starius isn't it better to prioritize the shutdown path?
There was a problem hiding this comment.
Right, the structure is two selects like the original — the critical difference is the return after the first shutdown case. The original fell through into the second select even after sending ErrClientShutdown, which allowed the double-resolve.
With the return, it becomes the standard priority-select: if shutdown is already closed, respond deterministically and exit. The second select is only reached when shutdown wasn't closed at check time.
There was a problem hiding this comment.
Fixed! Now it prioritizes the shutdown path:
// sendPostRequest sends the passed HTTP request to the RPC server using the
// HTTP client associated with the client. It is backed by a buffered channel,
// so it will not block until the send channel is full.
func (c *Client) sendPostRequest(jReq *jsonRequest) {
// Prefer shutdown when it is already closed so this path is
// deterministic. This mirrors addRequest and avoids post-shutdown
// enqueueing.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{
result: nil,
err: ErrClientShutdown,
}
return
default:
}
// Normal path: either enqueue, or fail if shutdown closes in the race
// window after the guard above.
select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
case <-c.shutdown:
jReq.responseChan <- &Response{
result: nil,
err: ErrClientShutdown,
}
}
}|
|
||
| // Resolve all pending futures on the first batch-level failure so | ||
| // callers waiting on Receive don't block indefinitely. | ||
| req.responseChan <- &Response{err: err} |
There was a problem hiding this comment.
nit: This send is safe because in batch mode sendRequest only calls addRequest (never sendPostRequest), so individual responseChan buffers (size 1) are guaranteed unwritten at this point. A brief comment documenting this invariant would help future readers — e.g.:
// Safe: batch-mode responseChan buffers are unwritten here,
// so this send won't block while locks are held.
req.responseChan <- &Response{err: err}There was a problem hiding this comment.
Added a comment:
// Resolve all pending futures on the first batch-level failure
// so callers waiting on Receive don't block indefinitely.
// Safe: batch-mode responseChan buffers are unwritten here,
// so this send won't block while locks are held. Batch-mode
// requests only use addRequest (not sendPostRequest), so each
// responseChan buffer is still empty.
req.responseChan <- &Response{err: err}| } | ||
|
|
||
| c.requestMap = make(map[uint64]*list.Element) | ||
| c.requestList.Init() |
There was a problem hiding this comment.
question: In batch mode, addRequest pushes to batchList, never requestList, so this should already be empty. Intentional defensive reset, or leftover? If intentional, a quick comment would clarify.
There was a problem hiding this comment.
It is a defensive reset. Added a comment:
// Batch-mode requests are tracked in batchList, so requestList should
// already be empty. Keep this defensive reset for invariants and future
// call paths.
c.requestList.Init()
Roasbeef
left a comment
There was a problem hiding this comment.
Change looks good, only comment is why we'd move away from the pattern that prioritizes a cancel path.
| // Atomically either queue the request or fail it due to shutdown. | ||
| // | ||
| // This avoids delivering two terminal responses to the same request, which | ||
| // can otherwise block shutdown cleanup on the second send. | ||
| select { | ||
| case <-c.shutdown: | ||
| jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown} | ||
| default: | ||
| } | ||
| return | ||
|
|
||
| select { | ||
| case c.sendPostChan <- jReq: | ||
| log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id) | ||
|
|
||
| case <-c.shutdown: | ||
| return | ||
| } |
There was a problem hiding this comment.
This is effectively how the code already was.
@starius isn't it better to prioritize the shutdown path?
264b120 to
5da6fa2
Compare
1acd9a8 to
54be134
Compare
|
I addressed the remaining comments in #2451 and added the updated version of it here as commit "rpcclient: support canceling in-flight http requests". Functionally it is the same, but refactored. I simplified the code by making a function that returns |
|
I read the diff and ran the tests, LGTM. Nice work @starius. |
seeforschauer
left a comment
There was a problem hiding this comment.
LGTM, thanks for the clean fixes!
| } | ||
|
|
||
| // handleSendPostMessageWithRetry runs handleSendPostMessage using the provided | ||
| // retry count. It exists so tests can exercise retry edge cases quickly. |
There was a problem hiding this comment.
nit: this might read as handleSendPostMessageWithRetry is for internal tests.
There was a problem hiding this comment.
I renamed it to sendPostRequestWithRetry so it does not sound like test-specific.
|
|
||
| // TestHandleSendPostMessageShutdownDuringRetryBackoff ensures shutdown | ||
| // cancellation interrupts retry backoff and remaps to ErrClientShutdown. | ||
| func TestHandleSendPostMessageShutdownDuringRetryBackoff(t *testing.T) { |
There was a problem hiding this comment.
These three tests:
TestHandleSendPostMessageShutdownDuringRetryBackoff
TestHandleSendPostMessageShutdownOnFinalRetry
TestHandleSendPostMessageShutdownDuringBodyRead
iiuc, these are all just trying to assert that the err returns as ErrClientShutdown? They exist to test like this one branch and is ridiculously verbose. imo these need to be cleaned up into one table driven test for future maintainability.
There was a problem hiding this comment.
I replaced them with a single table-driven test that reuses the same shutdown scenarios as the helper-level test.
these are all just trying to assert that the err returns as ErrClientShutdown?
Yes, every shutdown-triggered path must receive a ErrClientShutdown on responseChan. The tests check this.
| // TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff ensures | ||
| // that handleSendPostMessageWithRetry returns context cancellation from the | ||
| // retry-backoff path. | ||
| func TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff( |
There was a problem hiding this comment.
The tests TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff, TestHandleSendPostMessageWithRetryShutdownOnFinalRetry, TestHandleSendPostMessageWithRetryShutdownOnBodyRead should also really be a single table driven test.
There was a problem hiding this comment.
I collapsed those three helper-level shutdown tests into one table-driven test and kept the success case separate:
- retry-backoff cancellation
- final-retry cancellation
- cancellation during body read.
I also tightened the final-retry case so it depends on request-context propagation rather than a transport stub returning context.Canceled directly, which keeps it a stronger regression test.
|
The actual code looks good but the tests... they're rough. Like it looks like completely claude generated in a single go. I left some reviews but even the other tests don't look great. |
I cleaned up the shutdown-path tests to reduce copy-paste and converted them into table tests. I also did other small changes, described in new commits. I kept the changes in fixup commits which can be applied automatically by |
Use a shutdown-aware context for HTTP POST handling so shutdown can interrupt in-flight requests. Centralize shutdown error remapping in sendPostRequestAndRespond so all error exits consistently return ErrClientShutdown when shutdown causes a context cancellation. Move the retrying HTTP POST path into sendPostRequestWithRetry and cover it with shutdown regression tests.
When shutdown races with sendPostRequest, a request could be marked as ErrClientShutdown and still be enqueued. The sendPostHandler cleanup loop would then try to send a second terminal response and could block forever on a full response channel. Fix this by prioritizing the shutdown path. First check shutdown with a non-blocking select and return immediately when it is already closed. Then use a second select to choose between enqueue and shutdown for the remaining race window. A regression test verifies a shutdown request is failed immediately and never enqueued.
Batch requests were only clearing batchList on Send() errors. The per-request futures remained unresolved, so callers waiting on Receive could block forever after a failed batch round trip. Add failBatchRequests to fan out the Send() error to every queued batch request and clear tracking state in one place. A regression test now verifies queued futures complete with the same error returned by Send().
NewBatch called New() and then called start() again. In HTTP POST mode that created a second sendPostHandler and another shutdown-cancel goroutine, which broke the expected single-flight serialization of POST sends. Keep NewBatch as a semantic toggle only: rely on New() to start handlers once, then set batch=true. A regression test now checks that batch POST requests stay serialized through one active transport call.
|
Rebased and formatted code (pre-existing not formatted code). CI is failing because of pre-existing issues in other packages:
Why this is happening:
Local reproduction:
|
PR btcsuite#2467 changed regtest to match Core's BIP34/65/66 activation rules, and the merged stack carries that (commit cd4e542 "regtest: align activations with Bitcoin Core"). Height-1 regtest blocks now need a BIP34-compliant coinbase height and a post-BIP66 block version. The failing tests came from commits added after PR btcsuite#2467 was opened on December 25, 2025 but before it merged on April 30, 2026: - c1a4612 ("blockchain: add ProcessBlockHeader") - f9645f0 ("blockchain: reuse existing header node in maybeAcceptBlock") - dc6e096 ("netsync: add TestSyncStateMachine for end-to-end IBD sync flow") - ce09426 ("netsync: add TestStartSyncBlockFallback for block-only sync path") - 2aae8a6 ("netsync: add TestStartSyncChainCurrent for chain-current noop path") Because those tests landed later, they kept the old regtest assumptions even though btcsuite#2467 had already been authored and tested against the older tree. Once btcsuite#2467 finally merged, these newer tests started building invalid regtest blocks and headers. Fix them by setting the genesis tip height to 0 before generating descendants, using Version 4 in the regtest block/header helpers, and encoding the test coinbase height with a minimal BIP34 push plus padding for the generic coinbase script-length rule.
|
Fixed the existing tests failures in a separate commit. |
Change Description
Incorporated #2451 :
Modify the rpcclient http POST call to ensure that a shutdown immediately interrupts in-flight requests, which otherwise would have to wait until timeout.
Fix 3 other problems in rpcclient:
Steps to Test
Each test is a regression test. It fails if the patch is reverted.
Pull Request Checklist
Testing
Code Style and Documentation
📝 Please see our Contribution Guidelines for further guidance.