Skip to content

Refactor big mutex to reduce contention#917

Draft
seanmonstar wants to merge 2 commits into
masterfrom
sean/bye-bye-big-mutex
Draft

Refactor big mutex to reduce contention#917
seanmonstar wants to merge 2 commits into
masterfrom
sean/bye-bye-big-mutex

Conversation

@seanmonstar

Copy link
Copy Markdown
Member

This refactor puts the big internal shared mutex on a diet. The goal is to reduce contention, providing performance improvements when h2 is used on multi-threaded runtimes.

Warning

This is a large internal refactor, and it hasn't yet been sufficiently tested in a production deployment. While I tried to keep the behavior exactly the same, it's possible there's subtle changes or bugs.

If you do test this, let me know. I hope to have this tested thoroughly before possibly landing.

Architectural changes

There is now unique ownership of much of the connection state, owned solely by the "connection task".

Most conceptual things were split between an owned handle and shared values such as counts::Counts and counts::Shared. The Store is now uniquely owned by the connection, and the Streams stored in it contain an Arc<stream::Shared> for the parts that need to be shared with a "stream handle".

Stream handle operations now do very little. For receiving, it locks the shared connection recv buffer, pops a frame, and unlocks.

Likewise, when sending data, (or any other status-updating operation), it briefly locks a connection pending_streams queue, inserts itself, and unlocks. All actual changes are written on the stream's pending_ops field. The connection task includes a loop to briefly lock the queue, pop a stream, unlock, and then process the streams pending ops.

So, while this does introduce more locks, they are finer-grained, and should only be held very briefly. Longer "work" is no longer done while holding any lock.

This should result in tasks needing to wait less time for another task that previously was holding the world-lock.

cc #531

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch 3 times, most recently from f6131ee to a1a3c07 Compare June 16, 2026 16:25
@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch from a1a3c07 to 170914e Compare June 16, 2026 16:48
@howardjohn

This comment was marked as resolved.

@seanmonstar

This comment was marked as resolved.

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch 2 times, most recently from 167cbdf to 1c670de Compare June 16, 2026 19:33
@seanmonstar

This comment was marked as resolved.

@howardjohn

This comment was marked as resolved.

@howardjohn

This comment was marked as resolved.

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch from 1c670de to e159e34 Compare June 22, 2026 18:21
@seanmonstar

This comment was marked as resolved.

@howardjohn

This comment was marked as resolved.

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch from e159e34 to d58e405 Compare June 22, 2026 20:54
@seanmonstar

This comment was marked as resolved.

@howardjohn

This comment was marked as resolved.

@seanmonstar

This comment was marked as resolved.

@howardjohn

This comment was marked as resolved.

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch 3 times, most recently from 51155df to b9b304a Compare June 23, 2026 19:56
@seanmonstar

This comment was marked as resolved.

@seanmonstar

This comment was marked as resolved.

@seanmonstar

This comment was marked as resolved.

@howardjohn

Copy link
Copy Markdown
Contributor

In my testing now all tests passing and cannot get any deadlocks!

Here is the result of proxying N streams of iperf traffic. "Sharing" is using 1 H2 connnection with N CONNECT streams over it, while "no sharing" is N h2 connections with 1 CONNECT stream each

Test 1 stream 2 stream 4 stream 16 stream
No proxy 157 283 400 122
With patch 30 40 39 31
Without patch 26 30 30 27
With patch, no sharing 32 48 50 36
Without patch, no sharing 24 36 44 31

overall great results!

@0x676e67

0x676e67 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

The benchmark script for h2 runs a lot slower than the master branch. Just in case this info helps, I'm running Windows 11 with an AMD 9950X (16 cores, 32 threads).

  1. master
win@Z10 E:\....\h2  master $ cargo bench

test result: ok. 0 passed; 0 failed; 425 ignored; 0 measured; 0 filtered out; finished in 0.05s

     Running benches\main.rs (F:\Cargo\release\deps\main-b2bece125f3c4420.exe)
H2 running in current-thread runtime at 127.0.0.1:5928:
Overall: 341ms.
Fastest: 273ms
Slowest: 315ms
Avg    : 278ms
H2 running in multi-thread runtime at 127.0.0.1:5929:
Overall: 346ms.
Fastest: 262ms
Slowest: 323ms
Avg    : 282ms
  1. sean/bye-bye-big-mutex
win@Z10 E:\....\h2  sean/bye-bye-big-mutex $ cargo bench


test result: ok. 0 passed; 0 failed; 425 ignored; 0 measured; 0 filtered out; finished in 0.05s

     Running benches\main.rs (F:\Cargo\release\deps\main-b2bece125f3c4420.exe)
H2 running in current-thread runtime at 127.0.0.1:5928:
Overall: 658ms.
Fastest: 504ms
Slowest: 626ms
Avg    : 549ms
H2 running in multi-thread runtime at 127.0.0.1:5929:
Overall: 591ms.
Fastest: 399ms
Slowest: 562ms
Avg    : 476ms

@seanmonstar

Copy link
Copy Markdown
Member Author

Yea, I was looking at that benchmark myself at first. But I realized it's not representative of a real load. It sends into the connection as hard as possible without yielding, so the task is imbalanced. When I add in a yield every N requests, the numbers improve. Not as great a master still.

But, if real world testing shows it to be better everywhere else, I'm fine with just killing that benchmark.

@seanaye

seanaye commented Jul 2, 2026

Copy link
Copy Markdown

Hello!

We benchmarked this PR against our production stack (axum servers + reqwest clients, both on hyper 1.x) and found some regressions for smaller request/responses, some hangs, but improvements for large body streaming.

Setup: [patch.crates-io] pinning head 92e3789 vs merge-base 21211d0, release builds, Intel i7-14700F (28 threads), Linux 7.1.1 nixos 26.11.

~40 ms per-request stall on the server at low per-connection stream concurrency. Measured with h2load (nghttp2, so client side is unaffected) against a plain axum server serving 1 KiB responses:

Shape base req/s PR req/s
1 conn, m=1 25,590 24
1 conn, m=8 2,305 221
1 conn, m=64 85,619 57,815
16 conns, m=16 452,819 7,095
4 conns, m=64 341,794 363,428 (+6%)
1 conn, m=64, 64 KiB 33,748 45,931 (+36%)
1 conn, m=8, 1 MiB 3,595 4,143 (+15%)

Every request takes almost exactly ~40 ms, which smells like the Linux delayed-ACK timer: it looks like the connection task loses a wakeup and only makes progress when the peer's TCP stack eventually sends something. High multiplexing (m≥64) masks it (new requests keep the connection task busy), and large bodies mask it (window updates provide wakeups) this would explain why the iperf-over-CONNECT results above look great while request-oriented benchmarks do not.

The same ~41 ms value shows up as rare max-latency outliers on master, so this race may pre-exist and the PR just made it deterministic.

We also found two hangs by cross-pairing h2 versions between client and server:

  • PR client (reqwest/hyper) hangs at ≥8 concurrent streams on one connection, even against a healthy master server (serial works fine).
  • PR server hangs receiving request bodies (POST echo, 8 KiB bodies) even from a healthy master client.

I think this suggests there is considerable improvement provided the issues with smaller requests are addressed

Yea, I was looking at that benchmark myself at first. But I realized it's not representative of a real load. It sends into the connection as hard as possible without yielding, so the task is imbalanced. When I add in a yield every N requests, the numbers improve. Not as great a master still.

But, if real world testing shows it to be better everywhere else, I'm fine with just killing that benchmark.

I think our tests might suggest something different. h2load -m 1 is fully synchronized request/response with no send pressure at all, and it's the most affected shape in our testing. The workers in our test await each response before sending the next request. The pattern of being worst at low concurrency, fine at high multiplexing or with large flowing bodies points at a missing wakeup rather than task imbalance.

@seanmonstar seanmonstar force-pushed the sean/bye-bye-big-mutex branch from 92e3789 to c970de5 Compare July 2, 2026 21:01
@seanmonstar

Copy link
Copy Markdown
Member Author

@seanaye that was extremely useful, thank you! I'm excited at the significant performance when streaming more data, and also your description of what was happening when things got slower was very helpful.

It was difficult to build a unit test triggering exactly what you described, since it was hard to find exactly where a wakeup might have been missed. I did eventually get one test that seemed to miss a wakeup.

The fix was quite simple and if it was related, possibly caught the other things too: moved registering the connection task waker to the top of the poll method, instead of at the bottom "if nothing else happened". Since, another stream handle or something might have enqueued work between the connection task draining the queue, and deciding to register, and then it wouldn't be woken from that.

Does the latest commit fix up your run?

@seanaye

seanaye commented Jul 2, 2026

Copy link
Copy Markdown

@seanmonstar I will check tomorrow to see if the fix is working. I can also push up the test harness to a branch in our repo so you can take a look if that would be helpful.

@seanaye

seanaye commented Jul 3, 2026

Copy link
Copy Markdown

I re-ran the tests on the new commits, this helps but it doesn't fully resolve the problem. I think the race might depend on CPU idle state. Its possible to get really poor results on the exact same test if you run the test cold (from cpu idle) vs warm.

Results for the same test h2load -n 300 -c 1 -m 1

CPU state req/s
warm (immediately after 8 s of all-core spin loop) 5,897
cold (after 45 s of idle) 24

This reproduces 100% reliably in both directions, and probably explains why it's hard to reproduce on your end: CI runners and machines running test suites keep their cores warm, so the connection task's thread wins the race.

You can try this out for yourself here. Full disclosure the test harness was written with AI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants