Refactor big mutex to reduce contention#917
Conversation
f6131ee to
a1a3c07
Compare
a1a3c07 to
170914e
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
167cbdf to
1c670de
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
1c670de to
e159e34
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
e159e34 to
d58e405
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
51155df to
b9b304a
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
In my testing now all tests passing and cannot get any deadlocks! Here is the result of proxying N streams of iperf traffic. "Sharing" is using 1 H2 connnection with N CONNECT streams over it, while "no sharing" is N h2 connections with 1 CONNECT stream each
overall great results! |
|
The benchmark script for h2 runs a lot slower than the master branch. Just in case this info helps, I'm running Windows 11 with an AMD 9950X (16 cores, 32 threads).
|
|
Yea, I was looking at that benchmark myself at first. But I realized it's not representative of a real load. It sends into the connection as hard as possible without yielding, so the task is imbalanced. When I add in a yield every N requests, the numbers improve. Not as great a master still. But, if real world testing shows it to be better everywhere else, I'm fine with just killing that benchmark. |
|
Hello! We benchmarked this PR against our production stack (axum servers + reqwest clients, both on hyper 1.x) and found some regressions for smaller request/responses, some hangs, but improvements for large body streaming. Setup: ~40 ms per-request stall on the server at low per-connection stream concurrency. Measured with
Every request takes almost exactly ~40 ms, which smells like the Linux delayed-ACK timer: it looks like the connection task loses a wakeup and only makes progress when the peer's TCP stack eventually sends something. High multiplexing (m≥64) masks it (new requests keep the connection task busy), and large bodies mask it (window updates provide wakeups) this would explain why the iperf-over-CONNECT results above look great while request-oriented benchmarks do not. The same ~41 ms value shows up as rare max-latency outliers on master, so this race may pre-exist and the PR just made it deterministic. We also found two hangs by cross-pairing h2 versions between client and server:
I think this suggests there is considerable improvement provided the issues with smaller requests are addressed
I think our tests might suggest something different. h2load -m 1 is fully synchronized request/response with no send pressure at all, and it's the most affected shape in our testing. The workers in our test await each response before sending the next request. The pattern of being worst at low concurrency, fine at high multiplexing or with large flowing bodies points at a missing wakeup rather than task imbalance. |
92e3789 to
c970de5
Compare
|
@seanaye that was extremely useful, thank you! I'm excited at the significant performance when streaming more data, and also your description of what was happening when things got slower was very helpful. It was difficult to build a unit test triggering exactly what you described, since it was hard to find exactly where a wakeup might have been missed. I did eventually get one test that seemed to miss a wakeup. The fix was quite simple and if it was related, possibly caught the other things too: moved registering the connection task waker to the top of the poll method, instead of at the bottom "if nothing else happened". Since, another stream handle or something might have enqueued work between the connection task draining the queue, and deciding to register, and then it wouldn't be woken from that. Does the latest commit fix up your run? |
|
@seanmonstar I will check tomorrow to see if the fix is working. I can also push up the test harness to a branch in our repo so you can take a look if that would be helpful. |
|
I re-ran the tests on the new commits, this helps but it doesn't fully resolve the problem. I think the race might depend on CPU idle state. Its possible to get really poor results on the exact same test if you run the test cold (from cpu idle) vs warm. Results for the same test
This reproduces 100% reliably in both directions, and probably explains why it's hard to reproduce on your end: CI runners and machines running test suites keep their cores warm, so the connection task's thread wins the race. You can try this out for yourself here. Full disclosure the test harness was written with AI |
This refactor puts the big internal shared mutex on a diet. The goal is to reduce contention, providing performance improvements when
h2is used on multi-threaded runtimes.Warning
This is a large internal refactor, and it hasn't yet been sufficiently tested in a production deployment. While I tried to keep the behavior exactly the same, it's possible there's subtle changes or bugs.
If you do test this, let me know. I hope to have this tested thoroughly before possibly landing.
Architectural changes
There is now unique ownership of much of the connection state, owned solely by the "connection task".
Most conceptual things were split between an owned handle and shared values such as
counts::Countsandcounts::Shared. TheStoreis now uniquely owned by the connection, and theStreams stored in it contain anArc<stream::Shared>for the parts that need to be shared with a "stream handle".Stream handle operations now do very little. For receiving, it locks the shared connection
recvbuffer, pops a frame, and unlocks.Likewise, when sending data, (or any other status-updating operation), it briefly locks a connection
pending_streamsqueue, inserts itself, and unlocks. All actual changes are written on the stream'spending_opsfield. The connection task includes a loop to briefly lock the queue, pop a stream, unlock, and then process the streams pending ops.So, while this does introduce more locks, they are finer-grained, and should only be held very briefly. Longer "work" is no longer done while holding any lock.
This should result in tasks needing to wait less time for another task that previously was holding the world-lock.
cc #531