rpcclient: fix several synchronization bugs by starius · Pull Request #2500 · btcsuite/btcd

starius · 2026-03-20T23:40:24Z

Change Description

Incorporated #2451 :

Modify the rpcclient http POST call to ensure that a shutdown immediately interrupts in-flight requests, which otherwise would have to wait until timeout.

Fix 3 other problems in rpcclient:

HTTP POST shutdown can deadlock WaitForShutdown due to double-response race
Batch-mode Send() error path leaves queued per-request futures unresolved
NewBatch starts HTTP POST handlers twice, increasing concurrency surface

Steps to Test

go test ./rpcclient -count=1

Each test is a regression test. It fails if the patch is reverted.

Pull Request Checklist

Testing

Your PR passes all CI checks.
Tests covering the positive and negative (error paths) are included.
Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

The change is not insubstantial. Typo fixes are not accepted to fight bot spam.
The change obeys the Code Documentation and Commenting guidelines, and lines wrap at 80.
Commits follow the Ideal Git Commit Structure.
Any new logging statements use an appropriate subsystem and logging level.

📝 Please see our Contribution Guidelines for further guidance.

saubyk · 2026-03-23T15:13:16Z

hello @wydengyre @seeforschauer @jcvernaleo would you consider reviewing this one?

seeforschauer

LGTM — all three bugs are real and the fixes are correct. I've been working in rpcclient/infrastructure.go recently (#2506, #2505), so I have direct context on these paths.

Commit structure is clean (one bug, one test, one commit). Each test fails when the fix is reverted — solid regression coverage.

One suggestion on sendPostRequest (inline) for tighter shutdown determinism using the priority-select pattern already in addRequest. Two minor notes on failBatchRequests.

seeforschauer · 2026-03-23T19:53:30Z

+	// Atomically either queue the request or fail it due to shutdown.
+	//
+	// This avoids delivering two terminal responses to the same request, which
+	// can otherwise block shutdown cleanup on the second send.
 	select {
 	case <-c.shutdown:
 		jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
-	default:
-	}
+		return

-	select {
 	case c.sendPostChan <- jReq:
 		log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
-
-	case <-c.shutdown:
-		return
 	}


suggestion (non-blocking): Consider the priority-select pattern here.

When shutdown is already closed (its permanent state after Shutdown()), the single select randomly picks between responding and enqueueing (~50/50 per Go spec). If sendPostHandler has already exited cleanup, the enqueued request's future is never resolved.

A non-blocking shutdown guard first makes the common post-shutdown path deterministic — addRequest (line 213) already uses this exact pattern for the same reason. The remaining race (shutdown closing between the two selects) has a much narrower window.

Suggested change

// Atomically either queue the request or fail it due to shutdown.

//

// This avoids delivering two terminal responses to the same request, which

// can otherwise block shutdown cleanup on the second send.

select {

case <-c.shutdown:

jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}

default:

}

return

select {

case c.sendPostChan <- jReq:

log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:

return

}

// Prefer shutdown: if already closed, fail the request immediately.

// This avoids a random race between shutdown and enqueue when both

// channels are ready, consistent with the guard in addRequest.

select {

case <-c.shutdown:

jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}

return

default:

}

// Normal path: enqueue or fail on shutdown. Exactly one outcome.

select {

case c.sendPostChan <- jReq:

log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:

jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}

}

This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

Right, the structure is two selects like the original — the critical difference is the return after the first shutdown case. The original fell through into the second select even after sending ErrClientShutdown, which allowed the double-resolve.

With the return, it becomes the standard priority-select: if shutdown is already closed, respond deterministically and exit. The second select is only reached when shutdown wasn't closed at check time.

Fixed! Now it prioritizes the shutdown path:

// sendPostRequest sends the passed HTTP request to the RPC server using the // HTTP client associated with the client. It is backed by a buffered channel, // so it will not block until the send channel is full. func (c *Client) sendPostRequest(jReq *jsonRequest) { // Prefer shutdown when it is already closed so this path is // deterministic. This mirrors addRequest and avoids post-shutdown // enqueueing. select { case <-c.shutdown: jReq.responseChan <- &Response{ result: nil, err: ErrClientShutdown, } return default: } // Normal path: either enqueue, or fail if shutdown closes in the race // window after the guard above. select { case c.sendPostChan <- jReq: log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id) case <-c.shutdown: jReq.responseChan <- &Response{ result: nil, err: ErrClientShutdown, } } }

seeforschauer · 2026-03-23T19:53:30Z

+
+		// Resolve all pending futures on the first batch-level failure so
+		// callers waiting on Receive don't block indefinitely.
+		req.responseChan <- &Response{err: err}


nit: This send is safe because in batch mode sendRequest only calls addRequest (never sendPostRequest), so individual responseChan buffers (size 1) are guaranteed unwritten at this point. A brief comment documenting this invariant would help future readers — e.g.:

// Safe: batch-mode responseChan buffers are unwritten here, // so this send won't block while locks are held. req.responseChan <- &Response{err: err}

Added a comment:

// Resolve all pending futures on the first batch-level failure // so callers waiting on Receive don't block indefinitely. // Safe: batch-mode responseChan buffers are unwritten here, // so this send won't block while locks are held. Batch-mode // requests only use addRequest (not sendPostRequest), so each // responseChan buffer is still empty. req.responseChan <- &Response{err: err}

seeforschauer · 2026-03-23T19:53:30Z

+	}
+
+	c.requestMap = make(map[uint64]*list.Element)
+	c.requestList.Init()


question: In batch mode, addRequest pushes to batchList, never requestList, so this should already be empty. Intentional defensive reset, or leftover? If intentional, a quick comment would clarify.

It is a defensive reset. Added a comment:

// Batch-mode requests are tracked in batchList, so requestList should // already be empty. Keep this defensive reset for invariants and future // call paths. c.requestList.Init()

Roasbeef

Change looks good, only comment is why we'd move away from the pattern that prioritizes a cancel path.

Roasbeef · 2026-03-24T00:00:53Z

+	// Atomically either queue the request or fail it due to shutdown.
+	//
+	// This avoids delivering two terminal responses to the same request, which
+	// can otherwise block shutdown cleanup on the second send.
 	select {
 	case <-c.shutdown:
 		jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
-	default:
-	}
+		return

-	select {
 	case c.sendPostChan <- jReq:
 		log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
-
-	case <-c.shutdown:
-		return
 	}


This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

starius · 2026-03-25T04:52:11Z

I addressed the remaining comments in #2451 and added the updated version of it here as commit "rpcclient: support canceling in-flight http requests". Functionally it is the same, but refactored. I simplified the code by making a function that returns ([]byte, error); error remapping and channel sending done is done on the call site.

CC @Roasbeef @seeforschauer @wydengyre

guibressan · 2026-04-07T17:50:43Z

I read the diff and ran the tests, LGTM. Nice work @starius.

seeforschauer

LGTM, thanks for the clean fixes!

kcalvinalvin · 2026-05-04T17:02:43Z

+}
+
+// handleSendPostMessageWithRetry runs handleSendPostMessage using the provided
+// retry count. It exists so tests can exercise retry edge cases quickly.


nit: this might read as handleSendPostMessageWithRetry is for internal tests.

I renamed it to sendPostRequestWithRetry so it does not sound like test-specific.

kcalvinalvin · 2026-05-04T17:10:04Z

+
+// TestHandleSendPostMessageShutdownDuringRetryBackoff ensures shutdown
+// cancellation interrupts retry backoff and remaps to ErrClientShutdown.
+func TestHandleSendPostMessageShutdownDuringRetryBackoff(t *testing.T) {


These three tests:

TestHandleSendPostMessageShutdownDuringRetryBackoff
TestHandleSendPostMessageShutdownOnFinalRetry
TestHandleSendPostMessageShutdownDuringBodyRead

iiuc, these are all just trying to assert that the err returns as ErrClientShutdown? They exist to test like this one branch and is ridiculously verbose. imo these need to be cleaned up into one table driven test for future maintainability.

I replaced them with a single table-driven test that reuses the same shutdown scenarios as the helper-level test.

these are all just trying to assert that the err returns as ErrClientShutdown?

Yes, every shutdown-triggered path must receive a ErrClientShutdown on responseChan. The tests check this.

kcalvinalvin · 2026-05-04T17:29:26Z

+// TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff ensures
+// that handleSendPostMessageWithRetry returns context cancellation from the
+// retry-backoff path.
+func TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff(


The tests TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff, TestHandleSendPostMessageWithRetryShutdownOnFinalRetry, TestHandleSendPostMessageWithRetryShutdownOnBodyRead should also really be a single table driven test.

I collapsed those three helper-level shutdown tests into one table-driven test and kept the success case separate:

retry-backoff cancellation

final-retry cancellation

cancellation during body read.

I also tightened the final-retry case so it depends on request-context propagation rather than a transport stub returning context.Canceled directly, which keeps it a stronger regression test.

kcalvinalvin · 2026-05-04T17:36:35Z

The actual code looks good but the tests... they're rough.

Like it looks like completely claude generated in a single go. I left some reviews but even the other tests don't look great.

starius · 2026-05-05T04:26:47Z

The actual code looks good but the tests... they're rough.

Like it looks like completely claude generated in a single go. I left some reviews but even the other tests don't look great.

I cleaned up the shutdown-path tests to reduce copy-paste and converted them into table tests. I also did other small changes, described in new commits.

I kept the changes in fixup commits which can be applied automatically by git rebase --autosquash. I'm keeping them separate for now to make it clear what changed.

Use a shutdown-aware context for HTTP POST handling so shutdown can interrupt in-flight requests. Centralize shutdown error remapping in sendPostRequestAndRespond so all error exits consistently return ErrClientShutdown when shutdown causes a context cancellation. Move the retrying HTTP POST path into sendPostRequestWithRetry and cover it with shutdown regression tests.

When shutdown races with sendPostRequest, a request could be marked as ErrClientShutdown and still be enqueued. The sendPostHandler cleanup loop would then try to send a second terminal response and could block forever on a full response channel. Fix this by prioritizing the shutdown path. First check shutdown with a non-blocking select and return immediately when it is already closed. Then use a second select to choose between enqueue and shutdown for the remaining race window. A regression test verifies a shutdown request is failed immediately and never enqueued.

Batch requests were only clearing batchList on Send() errors. The per-request futures remained unresolved, so callers waiting on Receive could block forever after a failed batch round trip. Add failBatchRequests to fan out the Send() error to every queued batch request and clear tracking state in one place. A regression test now verifies queued futures complete with the same error returned by Send().

NewBatch called New() and then called start() again. In HTTP POST mode that created a second sendPostHandler and another shutdown-cancel goroutine, which broke the expected single-flight serialization of POST sends. Keep NewBatch as a semantic toggle only: rely on New() to start handlers once, then set batch=true. A regression test now checks that batch POST requests stay serialized through one active transport call.

starius · 2026-05-06T21:46:49Z

Rebased and formatted code (pre-existing not formatted code).

CI is failing because of pre-existing issues in other packages:

blockchain/TestMaybeAcceptBlockReusesHeaderNode fails with coinbase height enforcement.
blockchain/TestProcessBlockHeader fails because it still builds regtest headers with Version: 1.
netsync/TestStartSyncChainCurrent, TestStartSyncBlockFallback, and the TestSyncStateMachine cases fail for the same reason, and the expected 20, actual 0 assertions are downstream from those headers being rejected.

Why this is happening:

Regtest now enforces BIP34/BIP65/BIP66 from height 1 and rejects old block versions.
But these older tests still construct invalid regtest blocks/headers:
- blockchain/process_test.go:39 uses Version: 1
- netsync/manager_test.go:1190 uses Version: 1
- blockchain/accept_test.go:29 goes through addBlocks(...), which now trips the coinbase-height rule in this setup

Local reproduction:

go test ./blockchain -run 'TestMaybeAcceptBlockReusesHeaderNode|TestProcessBlockHeader|TestRegtestBlockVersions|TestRegtestRejectsCoinbaseMissingHeight' -count=1
go test ./netsync -run 'TestStartSyncChainCurrent|TestStartSyncBlockFallback|TestSyncStateMachine|TestIsSyncCandidateRegtest' -count=1

PR btcsuite#2467 changed regtest to match Core's BIP34/65/66 activation rules, and the merged stack carries that (commit cd4e542 "regtest: align activations with Bitcoin Core"). Height-1 regtest blocks now need a BIP34-compliant coinbase height and a post-BIP66 block version. The failing tests came from commits added after PR btcsuite#2467 was opened on December 25, 2025 but before it merged on April 30, 2026: - c1a4612 ("blockchain: add ProcessBlockHeader") - f9645f0 ("blockchain: reuse existing header node in maybeAcceptBlock") - dc6e096 ("netsync: add TestSyncStateMachine for end-to-end IBD sync flow") - ce09426 ("netsync: add TestStartSyncBlockFallback for block-only sync path") - 2aae8a6 ("netsync: add TestStartSyncChainCurrent for chain-current noop path") Because those tests landed later, they kept the old regtest assumptions even though btcsuite#2467 had already been authored and tested against the older tree. Once btcsuite#2467 finally merged, these newer tests started building invalid regtest blocks and headers. Fix them by setting the genesis tip height to 0 before generating descendants, using Version 4 in the regtest block/header helpers, and encoding the test coinbase height with a minimal BIP34 push plus padding for the generic coinbase script-length rule.

starius · 2026-05-06T22:18:07Z

Fixed the existing tests failures in a separate commit.

Roasbeef

LGTM 🐴

saubyk assigned starius Mar 23, 2026

seeforschauer approved these changes Mar 23, 2026

View reviewed changes

saubyk added this to the v0.25.1 milestone Mar 23, 2026

Roasbeef requested changes Mar 24, 2026

View reviewed changes

starius force-pushed the rpcclient-fixes branch from 264b120 to 5da6fa2 Compare March 24, 2026 17:53

starius requested review from Roasbeef and seeforschauer March 24, 2026 17:57

starius mentioned this pull request Mar 24, 2026

rpcclient: make shutdown interrupt in-flight POSTs #2451

Closed

7 tasks

starius force-pushed the rpcclient-fixes branch 2 times, most recently from 1acd9a8 to 54be134 Compare March 25, 2026 03:48

seeforschauer approved these changes Apr 10, 2026

View reviewed changes

saubyk requested a review from kcalvinalvin April 14, 2026 00:17

kcalvinalvin reviewed May 4, 2026

View reviewed changes

starius requested a review from kcalvinalvin May 5, 2026 04:27

wydengyre and others added 5 commits May 6, 2026 16:29

make fmt

51944b2

starius force-pushed the rpcclient-fixes branch from 837a18c to 51944b2 Compare May 6, 2026 21:33

allocz mentioned this pull request May 11, 2026

blockchain+netsync: fix broken unit tests #2528

Closed

Roasbeef approved these changes May 11, 2026

View reviewed changes

Roasbeef merged commit edaf284 into btcsuite:master May 11, 2026
3 checks passed

starius deleted the rpcclient-fixes branch May 14, 2026 04:08

Conversation

starius commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Steps to Test

Pull Request Checklist

Testing

Code Style and Documentation

Uh oh!

saubyk commented Mar 23, 2026

Uh oh!

seeforschauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

starius commented Mar 25, 2026

Uh oh!

guibressan commented Apr 7, 2026

Uh oh!

seeforschauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kcalvinalvin commented May 4, 2026

Uh oh!

starius commented May 5, 2026

Uh oh!

starius commented May 6, 2026

Uh oh!

starius commented May 6, 2026

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

starius commented Mar 20, 2026 •

edited

Loading