Refactor big mutex to reduce contention by seanmonstar · Pull Request #917 · hyperium/h2

seanmonstar · 2026-06-16T15:41:37Z

This refactor puts the big internal shared mutex on a diet. The goal is to reduce contention, providing performance improvements when h2 is used on multi-threaded runtimes.

Warning

This is a large internal refactor, and it hasn't yet been sufficiently tested in a production deployment. While I tried to keep the behavior exactly the same, it's possible there's subtle changes or bugs.

If you do test this, let me know. I hope to have this tested thoroughly before possibly landing.

Architectural changes

There is now unique ownership of much of the connection state, owned solely by the "connection task".

Most conceptual things were split between an owned handle and shared values such as counts::Counts and counts::Shared. The Store is now uniquely owned by the connection, and the Streams stored in it contain an Arc<stream::Shared> for the parts that need to be shared with a "stream handle".

Stream handle operations now do very little. For receiving, it locks the shared connection recv buffer, pops a frame, and unlocks.

Likewise, when sending data, (or any other status-updating operation), it briefly locks a connection pending_streams queue, inserts itself, and unlocks. All actual changes are written on the stream's pending_ops field. The connection task includes a loop to briefly lock the queue, pop a stream, unlock, and then process the streams pending ops.

So, while this does introduce more locks, they are finer-grained, and should only be held very briefly. Longer "work" is no longer done while holding any lock.

This should result in tasks needing to wait less time for another task that previously was holding the world-lock.

cc #531

howardjohn · 2026-06-29T17:37:15Z

In my testing now all tests passing and cannot get any deadlocks!

Here is the result of proxying N streams of iperf traffic. "Sharing" is using 1 H2 connnection with N CONNECT streams over it, while "no sharing" is N h2 connections with 1 CONNECT stream each

Test	1 stream	2 stream	4 stream	16 stream
No proxy	157	283	400	122
With patch	30	40	39	31
Without patch	26	30	30	27
With patch, no sharing	32	48	50	36
Without patch, no sharing	24	36	44	31

overall great results!

0x676e67 · 2026-06-29T23:15:34Z

The benchmark script for h2 runs a lot slower than the master branch. Just in case this info helps, I'm running Windows 11 with an AMD 9950X (16 cores, 32 threads).

master

win@Z10 E:\....\h2  master $ cargo bench

test result: ok. 0 passed; 0 failed; 425 ignored; 0 measured; 0 filtered out; finished in 0.05s

     Running benches\main.rs (F:\Cargo\release\deps\main-b2bece125f3c4420.exe)
H2 running in current-thread runtime at 127.0.0.1:5928:
Overall: 341ms.
Fastest: 273ms
Slowest: 315ms
Avg    : 278ms
H2 running in multi-thread runtime at 127.0.0.1:5929:
Overall: 346ms.
Fastest: 262ms
Slowest: 323ms
Avg    : 282ms

sean/bye-bye-big-mutex

win@Z10 E:\....\h2  sean/bye-bye-big-mutex $ cargo bench


test result: ok. 0 passed; 0 failed; 425 ignored; 0 measured; 0 filtered out; finished in 0.05s

     Running benches\main.rs (F:\Cargo\release\deps\main-b2bece125f3c4420.exe)
H2 running in current-thread runtime at 127.0.0.1:5928:
Overall: 658ms.
Fastest: 504ms
Slowest: 626ms
Avg    : 549ms
H2 running in multi-thread runtime at 127.0.0.1:5929:
Overall: 591ms.
Fastest: 399ms
Slowest: 562ms
Avg    : 476ms

seanmonstar · 2026-06-29T23:43:06Z

Yea, I was looking at that benchmark myself at first. But I realized it's not representative of a real load. It sends into the connection as hard as possible without yielding, so the task is imbalanced. When I add in a yield every N requests, the numbers improve. Not as great a master still.

But, if real world testing shows it to be better everywhere else, I'm fine with just killing that benchmark.

seanaye · 2026-07-02T16:29:32Z

Hello!

We benchmarked this PR against our production stack (axum servers + reqwest clients, both on hyper 1.x) and found some regressions for smaller request/responses, some hangs, but improvements for large body streaming.

Setup: [patch.crates-io] pinning head 92e3789 vs merge-base 21211d0, release builds, Intel i7-14700F (28 threads), Linux 7.1.1 nixos 26.11.

~40 ms per-request stall on the server at low per-connection stream concurrency. Measured with h2load (nghttp2, so client side is unaffected) against a plain axum server serving 1 KiB responses:

Shape	base req/s	PR req/s
1 conn, m=1	25,590	24
1 conn, m=8	2,305	221
1 conn, m=64	85,619	57,815
16 conns, m=16	452,819	7,095
4 conns, m=64	341,794	363,428 (+6%)
1 conn, m=64, 64 KiB	33,748	45,931 (+36%)
1 conn, m=8, 1 MiB	3,595	4,143 (+15%)

Every request takes almost exactly ~40 ms, which smells like the Linux delayed-ACK timer: it looks like the connection task loses a wakeup and only makes progress when the peer's TCP stack eventually sends something. High multiplexing (m≥64) masks it (new requests keep the connection task busy), and large bodies mask it (window updates provide wakeups) this would explain why the iperf-over-CONNECT results above look great while request-oriented benchmarks do not.

The same ~41 ms value shows up as rare max-latency outliers on master, so this race may pre-exist and the PR just made it deterministic.

We also found two hangs by cross-pairing h2 versions between client and server:

PR client (reqwest/hyper) hangs at ≥8 concurrent streams on one connection, even against a healthy master server (serial works fine).
PR server hangs receiving request bodies (POST echo, 8 KiB bodies) even from a healthy master client.

I think this suggests there is considerable improvement provided the issues with smaller requests are addressed

Yea, I was looking at that benchmark myself at first. But I realized it's not representative of a real load. It sends into the connection as hard as possible without yielding, so the task is imbalanced. When I add in a yield every N requests, the numbers improve. Not as great a master still.

But, if real world testing shows it to be better everywhere else, I'm fine with just killing that benchmark.

I think our tests might suggest something different. h2load -m 1 is fully synchronized request/response with no send pressure at all, and it's the most affected shape in our testing. The workers in our test await each response before sending the next request. The pattern of being worst at low concurrency, fine at high multiplexing or with large flowing bodies points at a missing wakeup rather than task imbalance.

seanmonstar · 2026-07-02T21:05:23Z

@seanaye that was extremely useful, thank you! I'm excited at the significant performance when streaming more data, and also your description of what was happening when things got slower was very helpful.

It was difficult to build a unit test triggering exactly what you described, since it was hard to find exactly where a wakeup might have been missed. I did eventually get one test that seemed to miss a wakeup.

The fix was quite simple and if it was related, possibly caught the other things too: moved registering the connection task waker to the top of the poll method, instead of at the bottom "if nothing else happened". Since, another stream handle or something might have enqueued work between the connection task draining the queue, and deciding to register, and then it wouldn't be woken from that.

Does the latest commit fix up your run?

seanaye · 2026-07-02T21:56:38Z

@seanmonstar I will check tomorrow to see if the fix is working. I can also push up the test harness to a branch in our repo so you can take a look if that would be helpful.

seanaye · 2026-07-03T15:18:49Z

I re-ran the tests on the new commits, this helps but it doesn't fully resolve the problem. I think the race might depend on CPU idle state. Its possible to get really poor results on the exact same test if you run the test cold (from cpu idle) vs warm.

Results for the same test h2load -n 300 -c 1 -m 1

CPU state	req/s
warm (immediately after 8 s of all-core spin loop)	5,897
cold (after 45 s of idle)	24

This reproduces 100% reliably in both directions, and probably explains why it's hard to reproduce on your end: CI runners and machines running test suites keep their cores warm, so the connection task's thread wins the race.

You can try this out for yourself here. Full disclosure the test harness was written with AI

seanmonstar force-pushed the sean/bye-bye-big-mutex branch 3 times, most recently from f6131ee to a1a3c07 Compare June 16, 2026 16:25

seanmonstar mentioned this pull request Jun 16, 2026

Library does not scale with multiple cores #531

Open

seanmonstar force-pushed the sean/bye-bye-big-mutex branch from a1a3c07 to 170914e Compare June 16, 2026 16:48