Skip to content

rpcclient: fix several synchronization bugs#2500

Merged
Roasbeef merged 6 commits into
btcsuite:masterfrom
starius:rpcclient-fixes
May 11, 2026
Merged

rpcclient: fix several synchronization bugs#2500
Roasbeef merged 6 commits into
btcsuite:masterfrom
starius:rpcclient-fixes

Conversation

@starius

@starius starius commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Change Description

Incorporated #2451 :

Modify the rpcclient http POST call to ensure that a shutdown immediately interrupts in-flight requests, which otherwise would have to wait until timeout.

Fix 3 other problems in rpcclient:

  • HTTP POST shutdown can deadlock WaitForShutdown due to double-response race
  • Batch-mode Send() error path leaves queued per-request futures unresolved
  • NewBatch starts HTTP POST handlers twice, increasing concurrency surface

Steps to Test

go test ./rpcclient -count=1

Each test is a regression test. It fails if the patch is reverted.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

@saubyk

saubyk commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

hello @wydengyre @seeforschauer @jcvernaleo would you consider reviewing this one?

@seeforschauer seeforschauer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all three bugs are real and the fixes are correct. I've been working in rpcclient/infrastructure.go recently (#2506, #2505), so I have direct context on these paths.

Commit structure is clean (one bug, one test, one commit). Each test fails when the fix is reverted — solid regression coverage.

One suggestion on sendPostRequest (inline) for tighter shutdown determinism using the priority-select pattern already in addRequest. Two minor notes on failBatchRequests.

Comment thread rpcclient/infrastructure.go Outdated
Comment on lines 931 to 942
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return

select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Consider the priority-select pattern here.

When shutdown is already closed (its permanent state after Shutdown()), the single select randomly picks between responding and enqueueing (~50/50 per Go spec). If sendPostHandler has already exited cleanup, the enqueued request's future is never resolved.

A non-blocking shutdown guard first makes the common post-shutdown path deterministic — addRequest (line 213) already uses this exact pattern for the same reason. The remaining race (shutdown closing between the two selects) has a much narrower window.

Suggested change
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return
select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
case <-c.shutdown:
return
}
// Prefer shutdown: if already closed, fail the request immediately.
// This avoids a random race between shutdown and enqueue when both
// channels are ready, consistent with the guard in addRequest.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
return
default:
}
// Normal path: enqueue or fail on shutdown. Exactly one outcome.
select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the structure is two selects like the original — the critical difference is the return after the first shutdown case. The original fell through into the second select even after sending ErrClientShutdown, which allowed the double-resolve.

With the return, it becomes the standard priority-select: if shutdown is already closed, respond deterministically and exit. The second select is only reached when shutdown wasn't closed at check time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Now it prioritizes the shutdown path:

// sendPostRequest sends the passed HTTP request to the RPC server using the
// HTTP client associated with the client.  It is backed by a buffered channel,
// so it will not block until the send channel is full.
func (c *Client) sendPostRequest(jReq *jsonRequest) {
	// Prefer shutdown when it is already closed so this path is
	// deterministic. This mirrors addRequest and avoids post-shutdown
	// enqueueing.
	select {
	case <-c.shutdown:
		jReq.responseChan <- &Response{
			result: nil,
			err:    ErrClientShutdown,
		}

		return

	default:
	}

	// Normal path: either enqueue, or fail if shutdown closes in the race
	// window after the guard above.
	select {
	case c.sendPostChan <- jReq:
		log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

	case <-c.shutdown:
		jReq.responseChan <- &Response{
			result: nil,
			err:    ErrClientShutdown,
		}
	}
}


// Resolve all pending futures on the first batch-level failure so
// callers waiting on Receive don't block indefinitely.
req.responseChan <- &Response{err: err}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This send is safe because in batch mode sendRequest only calls addRequest (never sendPostRequest), so individual responseChan buffers (size 1) are guaranteed unwritten at this point. A brief comment documenting this invariant would help future readers — e.g.:

// Safe: batch-mode responseChan buffers are unwritten here,
// so this send won't block while locks are held.
req.responseChan <- &Response{err: err}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment:

		// Resolve all pending futures on the first batch-level failure
		// so callers waiting on Receive don't block indefinitely.
		// Safe: batch-mode responseChan buffers are unwritten here,
		// so this send won't block while locks are held. Batch-mode
		// requests only use addRequest (not sendPostRequest), so each
		// responseChan buffer is still empty.
		req.responseChan <- &Response{err: err}

Comment thread rpcclient/infrastructure.go Outdated
}

c.requestMap = make(map[uint64]*list.Element)
c.requestList.Init()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In batch mode, addRequest pushes to batchList, never requestList, so this should already be empty. Intentional defensive reset, or leftover? If intentional, a quick comment would clarify.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a defensive reset. Added a comment:

	// Batch-mode requests are tracked in batchList, so requestList should
	// already be empty. Keep this defensive reset for invariants and future
	// call paths.
	c.requestList.Init()

@saubyk saubyk added this to the v0.25.1 milestone Mar 23, 2026

@Roasbeef Roasbeef left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, only comment is why we'd move away from the pattern that prioritizes a cancel path.

Comment thread rpcclient/infrastructure.go Outdated
Comment on lines 931 to 942
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return

select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:
return
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

@starius starius force-pushed the rpcclient-fixes branch 2 times, most recently from 1acd9a8 to 54be134 Compare March 25, 2026 03:48
@starius

starius commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

I addressed the remaining comments in #2451 and added the updated version of it here as commit "rpcclient: support canceling in-flight http requests". Functionally it is the same, but refactored. I simplified the code by making a function that returns ([]byte, error); error remapping and channel sending done is done on the call site.

CC @Roasbeef @seeforschauer @wydengyre

@guibressan

Copy link
Copy Markdown

I read the diff and ran the tests, LGTM. Nice work @starius.

@seeforschauer seeforschauer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the clean fixes!

@saubyk saubyk requested a review from kcalvinalvin April 14, 2026 00:17
Comment thread rpcclient/infrastructure.go Outdated
}

// handleSendPostMessageWithRetry runs handleSendPostMessage using the provided
// retry count. It exists so tests can exercise retry edge cases quickly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might read as handleSendPostMessageWithRetry is for internal tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to sendPostRequestWithRetry so it does not sound like test-specific.

Comment thread rpcclient/infrastructure_test.go Outdated

// TestHandleSendPostMessageShutdownDuringRetryBackoff ensures shutdown
// cancellation interrupts retry backoff and remaps to ErrClientShutdown.
func TestHandleSendPostMessageShutdownDuringRetryBackoff(t *testing.T) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three tests:

TestHandleSendPostMessageShutdownDuringRetryBackoff
TestHandleSendPostMessageShutdownOnFinalRetry
TestHandleSendPostMessageShutdownDuringBodyRead

iiuc, these are all just trying to assert that the err returns as ErrClientShutdown? They exist to test like this one branch and is ridiculously verbose. imo these need to be cleaned up into one table driven test for future maintainability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced them with a single table-driven test that reuses the same shutdown scenarios as the helper-level test.

these are all just trying to assert that the err returns as ErrClientShutdown?

Yes, every shutdown-triggered path must receive a ErrClientShutdown on responseChan. The tests check this.

Comment thread rpcclient/infrastructure_test.go Outdated
// TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff ensures
// that handleSendPostMessageWithRetry returns context cancellation from the
// retry-backoff path.
func TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests TestHandleSendPostMessageWithRetryShutdownDuringRetryBackoff, TestHandleSendPostMessageWithRetryShutdownOnFinalRetry, TestHandleSendPostMessageWithRetryShutdownOnBodyRead should also really be a single table driven test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I collapsed those three helper-level shutdown tests into one table-driven test and kept the success case separate:

  • retry-backoff cancellation
  • final-retry cancellation
  • cancellation during body read.

I also tightened the final-retry case so it depends on request-context propagation rather than a transport stub returning context.Canceled directly, which keeps it a stronger regression test.

@kcalvinalvin

Copy link
Copy Markdown
Collaborator

The actual code looks good but the tests... they're rough.

Like it looks like completely claude generated in a single go. I left some reviews but even the other tests don't look great.

@starius

starius commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

The actual code looks good but the tests... they're rough.

Like it looks like completely claude generated in a single go. I left some reviews but even the other tests don't look great.

I cleaned up the shutdown-path tests to reduce copy-paste and converted them into table tests. I also did other small changes, described in new commits.

I kept the changes in fixup commits which can be applied automatically by git rebase --autosquash. I'm keeping them separate for now to make it clear what changed.

@starius starius requested a review from kcalvinalvin May 5, 2026 04:27
wydengyre and others added 5 commits May 6, 2026 16:29
Use a shutdown-aware context for HTTP POST handling so shutdown can
interrupt in-flight requests.

Centralize shutdown error remapping in sendPostRequestAndRespond so all
error exits consistently return ErrClientShutdown when shutdown causes a
context cancellation. Move the retrying HTTP POST path into
sendPostRequestWithRetry and cover it with shutdown regression tests.
When shutdown races with sendPostRequest, a request could be marked
as ErrClientShutdown and still be enqueued. The sendPostHandler cleanup
loop would then try to send a second terminal response and could block
forever on a full response channel.

Fix this by prioritizing the shutdown path. First check shutdown with a
non-blocking select and return immediately when it is already closed.
Then use a second select to choose between enqueue and shutdown for the
remaining race window.

A regression test verifies a shutdown request is failed immediately and
never enqueued.
Batch requests were only clearing batchList on Send() errors. The
per-request futures remained unresolved, so callers waiting on Receive
could block forever after a failed batch round trip.

Add failBatchRequests to fan out the Send() error to every queued batch
request and clear tracking state in one place. A regression test now
verifies queued futures complete with the same error returned by Send().
NewBatch called New() and then called start() again. In HTTP POST mode that
created a second sendPostHandler and another shutdown-cancel goroutine, which
broke the expected single-flight serialization of POST sends.

Keep NewBatch as a semantic toggle only: rely on New() to start handlers
once, then set batch=true. A regression test now checks that batch POST
requests stay serialized through one active transport call.
@starius starius force-pushed the rpcclient-fixes branch from 837a18c to 51944b2 Compare May 6, 2026 21:33
@starius

starius commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased and formatted code (pre-existing not formatted code).

CI is failing because of pre-existing issues in other packages:
  • blockchain/TestMaybeAcceptBlockReusesHeaderNode fails with coinbase height enforcement.
  • blockchain/TestProcessBlockHeader fails because it still builds regtest headers with Version: 1.
  • netsync/TestStartSyncChainCurrent, TestStartSyncBlockFallback, and the TestSyncStateMachine cases fail for the same reason, and the expected 20, actual 0 assertions are downstream from those headers being rejected.

Why this is happening:

  • Regtest now enforces BIP34/BIP65/BIP66 from height 1 and rejects old block versions.
  • But these older tests still construct invalid regtest blocks/headers:
    • blockchain/process_test.go:39 uses Version: 1
    • netsync/manager_test.go:1190 uses Version: 1
    • blockchain/accept_test.go:29 goes through addBlocks(...), which now trips the coinbase-height rule in this setup

Local reproduction:

  • go test ./blockchain -run 'TestMaybeAcceptBlockReusesHeaderNode|TestProcessBlockHeader|TestRegtestBlockVersions|TestRegtestRejectsCoinbaseMissingHeight' -count=1
  • go test ./netsync -run 'TestStartSyncChainCurrent|TestStartSyncBlockFallback|TestSyncStateMachine|TestIsSyncCandidateRegtest' -count=1

PR btcsuite#2467 changed regtest to match Core's BIP34/65/66 activation rules, and
the merged stack carries that (commit cd4e542 "regtest: align activations
with Bitcoin Core"). Height-1 regtest blocks now need a BIP34-compliant
coinbase height and a post-BIP66 block version.

The failing tests came from commits added after PR btcsuite#2467 was opened on
December 25, 2025 but before it merged on April 30, 2026:

- c1a4612 ("blockchain: add ProcessBlockHeader")
- f9645f0 ("blockchain: reuse existing header node in maybeAcceptBlock")
- dc6e096 ("netsync: add TestSyncStateMachine for end-to-end IBD sync flow")
- ce09426 ("netsync: add TestStartSyncBlockFallback for block-only sync path")
- 2aae8a6 ("netsync: add TestStartSyncChainCurrent for chain-current noop path")

Because those tests landed later, they kept the old regtest assumptions even
though btcsuite#2467 had already been authored and tested against the older tree.
Once btcsuite#2467 finally merged, these newer tests started building invalid regtest
blocks and headers.

Fix them by setting the genesis tip height to 0 before generating descendants,
using Version 4 in the regtest block/header helpers, and encoding the test
coinbase height with a minimal BIP34 push plus padding for the generic coinbase
script-length rule.
@starius

starius commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Fixed the existing tests failures in a separate commit.

@Roasbeef Roasbeef left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐴

@Roasbeef Roasbeef merged commit edaf284 into btcsuite:master May 11, 2026
3 checks passed
@starius starius deleted the rpcclient-fixes branch May 14, 2026 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants