fix(webrtx): ICE issue + substream shutdown procedure by gab8i · Pull Request #586 · paritytech/litep2p

gab8i · 2026-05-18T11:01:30Z

This PR contains a big rework which address comments left by @lexnv on #574.

Essentially:

Address ICE issue during negotiation
Add tracing/debug
Move the FIN/FIN_ACK half-close handshake from Substream::poll_shutdown into SubstreamHandle::poll_next (half_close). Now there are two clear halves that can close separately.
Introduce a dedicated State::Reset variant (previously folded into SendClosed), also send RESET_STREAM to the peer when the FIN_ACK timeout expires, instead of silently transitioning to FinAcked.
Clean up stale SubstreamHandle entries from SubstreamHandleSetwhen a channel is closed or an outbound write fails, so exhausted handles don't sit forever in the round-robin set.

lexnv · 2026-05-18T19:16:32Z

+                .protocol_set
+                .report_substream_open_failure(
+                    context.protocol,
+                    context.substream_id,


Is this a different substream ID from the one at pending_outbound?

What is stored within pending_outbound differs from what is stored within the channels field. A channel is initially opened and ends up in pending_outbound with a substream_id, once the WebRTC data channel is created, it is moved into the channels field, where the state OutboundOpening implies that the multistream-select protocol still needs to happen. These are the same channel in two different phases.

Couldn't we here call report_substream_open_failure twice from the higher-level protocols perspective?

Why so?

on_open_substream -> inserts pending_outbound item
on_channel_opened -> remove item from pending_outbound and inserts it as ChannelState::OutboundOpening within the channels field

While report_substream_open_failure is getting called only while negotiating the multistream-select protocol and while a channel is getting closed, within neither of the two it seems to be the case of it being called twice over the same substream id

lexnv · 2026-05-18T19:16:58Z

-/// Matches go-libp2p's 5 second stream close timeout.
-const FIN_ACK_TIMEOUT: Duration = Duration::from_secs(5);
+/// Matches go-libp2p and js-libp2p's 10-second stream close timeout.
+const FIN_ACK_TIMEOUT: Duration = Duration::from_secs(10);


Nice! Thanks for aligning this 🙏

lexnv · 2026-05-19T12:46:47Z

@@ -206,7 +240,7 @@ impl SubstreamHandle {
                        flag: Some(Flag::FinAck),
                    }) {
                        tracing::warn!(


IIUC, we cannot send the FinAck to the peer because of backpressure.
Then we return Ok(()) here and ignore this error, and this forces the remote peer to wait a full 10s.

Could we instead reserve a permit on the channel to always guarantee we can send out a FinAck?
Similarly, could this fail because the outbound tx has been dropped? Inthat case should we takle the gracefull shutdown to higher levels?

The only reason why this could fail is backpressure because the Substream AsyncWriter could have already written many message and the FIN_ACK could have no space to be sent.

Could we instead reserve a permit on the channel to always guarantee we can send out a FinAck?

Yes, I think this should be done because reading more carefully the code and the comment I understood that it doesn't align to the spec, waiting 10s does not imply a graceful shutdown but instead force the peer to send a ResetStream flag. So the way to go is to remove this outbound_tx and simply add a flag which is used to preempt over other things that needs to be sent over the channel.

lexnv · 2026-05-19T12:51:05Z


                    self.rtc.direct_api().close_data_channel(channel_id);
                    self.channels.insert(channel_id, ChannelState::Closing);
+                    self.handles.remove(&channel_id);


Do we need to advance state: Arc<Mutex<State>> into Reset here?

Otherwise the poll_shutdown would always return Poll::Pending:

if matches!(*self.state.lock(), State::FinAcked | State::Reset) { Poll::Ready(Ok(())) } else { Poll::Pending }

And this could stall further substream.close().await calls indefinetely

Maybe the easiest fix would be to set the state inside SubstreamHandle::drop

Oh yes, tbf I honestly thought that poll_shutdown would have been stop to be called automatically if Substream was drop, but now that I think about it, it is more plausible it is the other way around: the Substream doesn't get drop until the poll_shutdown doesn't complete.

As you said, there could be an additional flag, something like ForceStop which just implies that something higher in the stack is forcing this substream to close.

lexnv · 2026-05-19T12:59:12Z

-        }
+        // Let str0m handle input validation internally, similar to how the initial STUN packet is
+        // handled
+        self.rtc.handle_input(message).map_err(|error| {


IIUC, the rtc accepts method was too restrictive including ICE. Now, we rely on str0m to handle input validation internally.
This also means we are redirecting every STUN packet into the connection. Does str0m handle under the hood ICE credential validation?

accepts should be used to demultipex multiple Rtc instances while we are already doing so by tracking the source. Initially rtc gets created and then yes, handle_input manages ICE + DTLS + Noise + Scpt handshakes

lexnv · 2026-05-19T13:05:01Z

@@ -171,6 +199,11 @@ impl SubstreamHandle {
        // This ensures that if a FIN message contains data, we deliver it before closing.


The remote peer might send us a FIN frame containing data.
However, in the meanwhile we have droped the substream.
In this case, the self.inbound_tx.send would return an error and we'd never handle the Fin code path below, leaving again the remote to wait 10s until they reset the substream on timeout

Why have we dropped the substream in the meantime?

In a request-response protocol, there can be a case where a peer issues a request and attaches the FIN flag to it. In that case, the message is delivered to inbound_tx first, followed by Event::RecvClosed triggered by the FIN flag, the order is preserved. As noted in the previous comment (#586 (comment)), a follow-up refactor will send the FIN_ACK immediately, the substream remains alive until the shutdown procedure terminates, either by receiving a FIN_ACK or by exceeding the timeout. What I describe here is the most eager case from a peer: attaching a FIN together with the request. If this is well covered also everything else should be (?)

lexnv · 2026-05-19T13:14:45Z

                Flag::StopSending => {
-                    *self.state.lock() = State::SendClosed;
+                    let mut current_state = self.state.lock();
+                    if !matches!(*current_state, State::FinSent | State::FinAcked) {


This changes state::Reset into State::SendClosed:

if we enter this path after we've encountered an error (ie half close sets reset) we are nuking the state back to SendClosed

Then any poll_shutdown is going to wait indefinetely again

T0: we've encountered an error, our state is Reset
T1: on_message moves back our state to SendClosed
T2: poll_shutdown never returns Ok() and instead blocks the whole execution:

if matches!(*self.state.lock(), State::FinAcked | State::Reset) { Poll::Ready(Ok(())) } else { Poll::Pending }

If I'm not mistaken, reading your comment, I would say that you pointed out a nice bug, which should be solved by adding State::Reset to the matches, right?

lexnv · 2026-05-19T13:35:35Z

@@ -240,23 +278,97 @@ impl SubstreamHandle {
                    // (matching go-libp2p behavior)
                    // Close the read side
                    let _ = self.inbound_tx.try_send(Event::RecvClosed);


In this code path couldn't we actually send the RecvClosed twice?

For example, in the path above, we ensure that we send RecvClosed only once if we received Fin:

if self.read_closed.swap(true, Ordering::SeqCst) {

However, here we send unconditionally the RecvClosed regardless if we sent it above or vice-versa

T0: We receive Flag::ResetStream and send Event::RecvClosed
T1: A delayed Fin arrives if self.read_closed.swap(true, Ordering::SeqCst) was previously false. We queue another RecvClosed and also send a FinAck to a stream that was reset at T0.

Logically yes, practically no, but it remains a problem.

I didn't notice this before: the reset_stream_sent flag stops the SubstreamHandle Stream, preventing FIN_ACK from being sent. It's still a problem though, because even if inbound_tx should be dropped at some point, the on_message function could send another RecvClosed on the channel.

To solve this, either the state or the reset_stream_sent flag could be checked at the beginning of on_message.

lexnv · 2026-05-19T14:05:19Z

Since the ufrag / pass are provided by the remote peer, it could be possible to craft a payload such that we panic here

I don't see how specifically maliciously formed ufrag / pass can cause panics within make_rtc_client but definitely it generally contains unwrap and panics that should not be there, it needs to be changed and handle them 'gracefully'.

lexnv · 2026-05-19T14:07:19Z

If the STUN packet here is malformed, ICE / fingerprint missmatch we are also panicking

This expect doesn't make sense here. I'm sorry I didn't notice it before, it's entirely possible to receive a malformed message, which needs to be handled gracefully.

Move the FIN/FIN_ACK half-close handshake from `Substream::poll_shutdown` into `SubstreamHandle::poll_next` (`half_close`). Now there are two clear halves that can close separately. Both `poll_shutdown` and `Drop` go through the same path, so dropping a substream still produces a graceful FIN. Introduce a dedicated `State::Reset` variant (previously folded into `SendClosed`), also send `RESET_STREAM` to the peer when the FIN_ACK timeout expires, instead of silently transitioning to `FinAcked`. Clean up stale `SubstreamHandle` entries from `SubstreamHandleSet` when a channel is closed or an outbound write fails, so exhausted handles don't sit forever in the round-robin set.

Refactor the code to remove a channel writing messages to the same struct, and respect the spec by immediately replying with FIN_ACK after receiving a FIN instead of occasionally waiting 10 seconds and letting the remote reset the connection.

lexnv · 2026-05-21T13:10:03Z

The state management between Substream (object received by higher-level protocols) and SubstreamHandle polled on webrtc task is getting a bit too convoluted:

We share an state: Arc<Mutex<State>> between both objects

This is altered by SubstreamHandle::on_message to transition the state on FinAck and wake up the shutdown wakers. Similarly, transition into SendClosed on receiving Flag::StopSending and wake up the poll_write waker. And similar upon receiving ResetStream
Then the Substream::poll_write rejects any state thats not in State::Open with a brokek pipe error.
Substream::poll_shutdown returns pending until the state transitions into FinAck / Reset

We have a write_waker atomic waker registered by the poll_write

In combination with state mutex, it is effectively signaling a write error upon Flag::StopSending or Flag::ResetStream (broken pipe)

We have a shutdown_waker atomic waker registered inside Substream::poll_shutdown

This is woken up on FinAck / ResetStream
Also from fn half_close onm timeout if we didn't receive the FinAck in due time

We also got a substream_shutdown atomic bool with self.substream_shutdown_waker atomic waker

substream_shutdown_waker registered by SubstreamHandle::poll_next
Substream::poll_shutdown / on drop sets the bool to true and wakes the waker

Instead, could we simplify this by:

removing the mutex entirely and keeping the state private to the SubstreamHandle
Waker + atomicbool patterns could be replaced by channels

Ideally, the Substream could hold:

read_buffer and rx: Receiver<Event> similar to before
tx: Option<PollSender<Event>> signal when the poll_write should fail
done_signal: Option<oneshot sender<()>> signals the substreamHandle that shutdown was called / or substream dropped
shutdown_complete: Option<oneshot receiver<()>> awaited by shutdown to resolve when the fin ack/reset/timeout reached its state

…ures

gab8i · 2026-05-21T16:45:54Z

The latest commits address all the review comments except the last one, which would require refactoring the Substream/SubstreamHandle relationship (#593). One TODO remains in substream.rs (around the ReadingState::Fin transition), addressing it would mean adding or modifying wakers, but since the goal is to partially or fully remove them, it will be handled in a follow-up PR

gab8i · 2026-05-21T17:59:40Z

The last commit adds a simple if to apparently make things work, but it sweeps a bigger problem under the carpet, one related to the Substream <-> SubstreamHandle structure, which will be addressed in the refactor. The core issue:
once the Substream receives RecvClosed, poll_read returns Err(BrokenPipe), which causes a drop of the Substream, including the write side, which a half-close from the peer should not close.

lexnv

This is an improvement over what we had before 👍

I didn't look closely at the edge cases between the mutex states/atomic bools and wakers, so we still have some rough edges. However, most of the code would get simplified either way by: #586 (comment) 🙏

…600) Rework the communication mechanism between `Substream` and its `SubstreamHandle`. State is shared through `Mutex` and `AtomicWaker` abstracted behind a small helper, which makes them easy to use and ensures the relevant tasks are woken whenever the shared state changes. This also decouples the reading half from the writing half: a graceful close of either half no longer implies closing the other. An abrupt RESET_STREAM still tears down both, as required by the spec.

dmitry-markin · 2026-05-28T13:05:40Z

+                                target: LOG_TARGET,
+                                peer = ?self.peer,
+                                destination = ?v.destination,
+                                "UDP send buffer full, dropping datagram (str0m will retransmit)",


As I see it, str0m will retransmit after the hole in the sequence numbers is detected, leading to unordered delivery of packets and broken application data. Instead it's better to pause (backpressure) a producer and send all UDP packets in order they are produced by str0m.

Can be a follow-up PR.

I'm not sure that what you described is really needed.
The flow of messages producer -> str0m is different from str0m -> udp socket.
The latter benefits from all the guarantees that str0m gives, especially given how str0m is created:

ordered: false

reliability: Default::default() → Reliability::Reliable

So the channel is reliable: dropping an outgoing UDP datagram cannot lose application data, since SCTP guarantees retransmission.

The producer backpressure implemented by #575 was related to the first flow, messages that have not yet entered str0m, where it is up to us to control the order of things.

If we shift the focus from correctness to efficiency, then yes, it would be better to have a queue of pending UDP packets, because relying on str0m to recover here could be pretty slow.

Reading now #586 (comment) I think I got confused about how the 'ordered' flag works within str0m

dropping an outgoing UDP datagram cannot lose application data

It doesn't "lose" it in the precise sense, but because the packets arrive not in order now, and we do not reorder them on the application protocol level, the actual stream data will be garbage.

Actually, I don't know why libp2p spec is using unordered delivery. This is not what we get with TCP/WSS connections.

Ok, if I'm not mistaken, to follow golibp2p behavior, ordered should be set to true (which is also the default one) so that the peer is expected to hold upon retransmission of lost packet!

This would imply that the current implementation is expected to work fine! Beside a possible optimization by manually re-transmitting the packet instead of waiting for str0m detecting it!

Sorry, the last message was sent without reading your replies, gh didn't show them up to now.

Actually, let's set ordered: true. The spec says the implementations MAY expose unordered, so we are free to use ordered as well:
https://github.com/libp2p/specs/blob/master/webrtc/README.md#ordering

dmitry-markin · 2026-05-28T13:34:00Z

+                        "str0m rejected timeout input, closing connection",
+                    );
+                    return self.on_connection_closed().await;
+                }


Reading the str0m docs, I don't think we need to pass Input::Timeout before trying to receive the incoming UDP packet. I.e., the entire special case can be removed.

Given the select! below is biased, I would put the handle_input(Input::Timeout(...)) the second after the recv() arm, and this should be enough.

You are totally right, doc:

... poll_output, the function will only produce more output again when one of two things happen: - The polled timeout is reached. - New network input.

What does this mean is that we can safely remove this special case.

I just don't follow why we should move handle_input(Input::Timeout(...)) in second position just after the recv()?

I just don't follow why we should move handle_input(Input::Timeout(...)) in second position just after the recv()?

I would prioritize pumping the network traffic over processing the commands. This can be triggered under load when multiple futures resolve at the same time, and in this case we want to free/process the network buffers first before processing the user commands.

The current approach won't break anything, but we can try to process user commands only to discover we are backpressured by networking, then process networking, and only then process commands. With prioritizing networking we spare one poll cycle in such situation.

lexnv · 2026-05-28T17:20:17Z

        source: SocketAddr,
        destination: SocketAddr,
-    ) -> (Rtc, ChannelId) {
+    ) -> crate::Result<(Rtc, ChannelId)> {


Nice! This fixes the git issue for the make_rtc_client panics!

dmitry-markin

Overall looks good! One question, though: why can't we do without locks and implement communication betwee Substrem <-> SubstreamHandle purely over channels?

dmitry-markin · 2026-05-29T07:23:52Z

+        // In practice this should never happen because SCTP guarantees the order
+        // of messages, thus no other message is expected after a Reset.


SCTP is configured to unordered as per libp2p WebRTC spec:

litep2p/src/transport/webrtc/connection.rs

Line 936 in 475cb1b

ordered: false,

so there is no such guarantee.

Yes, I think this is a mistake and should be moved to ordered: true

Update Rtc ChannelConfig to use ordered messages and explicitly use `Reliability::Reliable` without relying on default one.

lexnv

Had another look! Thanks for switching to AtomicU8s 🚀

This was referenced May 18, 2026

fixes(webrtc): webrtc substream/ICE issues + request-response close ordering #574

Closed

fix(webrtc): support different ordering of protobuf flags #577

Closed

gab8i changed the title ~~[WIP] fix(webrtx): ICE issue + substream shutdown procedure~~ fix(webrtx): ICE issue + substream shutdown procedure May 18, 2026

gab8i mentioned this pull request May 18, 2026

feat(webrtc): multistream-select protocol implementation #573

Merged

lexnv reviewed May 18, 2026

View reviewed changes

lexnv reviewed May 19, 2026

View reviewed changes

This was referenced May 20, 2026

webrtc: Transport doesn't accept a flag-only frame #589

Open

webrtc: Stream implementation of WebRtcTransport allocates 16KiB per packet #588

Open

Base automatically changed from gab_webrtc_multistream_select to master May 20, 2026 09:09

timwu20 and others added 5 commits May 20, 2026 12:28

webrtc: Fix InputRejected error during ICE negotiation

d59cb4c

webrtc: Add diagnostic logging for notification handshake debugging

086ac4d

webrtc: report substream open failure when channel closes unexpectedly

624fccb

chore(webrtc): update trace target and fin_ack timeout

1dfbe6e

gab8i force-pushed the gab_webrtc_multiple_fixes_v2 branch from 4b5487c to 2be6862 Compare May 20, 2026 10:32

gab8i added 2 commits May 20, 2026 12:34

fix(webrtc): handle higher level drop SubstreamHandle

18a9173

lexnv mentioned this pull request May 21, 2026

webrtc: Simplify Substream / SubstreamHandle state machine #593

Open

gab8i added 3 commits May 21, 2026 18:24

fix(webrtc): don't downgrade Reset to SendClosed on StopSending

2908e57

fix(webrtc): do not panic within make_rtc_client

93e48b9

fix(webrtc): close connection, not panic and report pending open fail…

1f1d229

…ures

fix(webrtc): process flags even when Substream is dropped

9166876

fix(webrtc): drive pending shutdown

04dbf48

lexnv approved these changes May 22, 2026

View reviewed changes

lexnv mentioned this pull request May 25, 2026

libp2p-go: Add support for WebRTC testing using libp2p-go lexnv/litep2p-perf#4

Open

gab8i mentioned this pull request May 28, 2026

webrtc: Panics in run_event_loop can leak the connection to upper-layer #580

Open

gab8i and others added 3 commits May 28, 2026 15:47

Merge branch 'master' into gab_webrtc_multiple_fixes_v2

64a5c4b

test(webrtc): clippy

475cb1b

dmitry-markin reviewed May 28, 2026

View reviewed changes

lexnv reviewed May 28, 2026

View reviewed changes

Comment thread src/transport/webrtc/substream.rs Outdated

lexnv reviewed May 28, 2026

View reviewed changes

Comment thread src/transport/webrtc/mod.rs

dmitry-markin approved these changes May 29, 2026

View reviewed changes

gab8i added 6 commits May 29, 2026 10:46

fix(webrtc): avoid duration underflow

1b44a5f

refactor(webrtc): remove redundand writer_state check

e11314f

chore(webrtc): remove ice pass from debug trace

5b25dd9

fix(webrtc): rtc channel config, ordered and reliable messages

46259b1

Update Rtc ChannelConfig to use ordered messages and explicitly use `Reliability::Reliable` without relying on default one.

fix(webrtc): remove useless rct.handle_intput with timeout call

0494c1d

refactor(webrtc): lock free shared state

9db4a8e

lexnv approved these changes May 29, 2026

View reviewed changes

		@@ -171,6 +199,11 @@ impl SubstreamHandle {
		// This ensures that if a FIN message contains data, we deliver it before closing.

		// In practice this should never happen because SCTP guarantees the order
		// of messages, thus no other message is expected after a Reset.

Conversation

gab8i commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gab8i May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lexnv commented May 21, 2026

Uh oh!

gab8i commented May 21, 2026

Uh oh!

gab8i commented May 21, 2026

Uh oh!

lexnv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gab8i May 19, 2026 •

edited

Loading