Skip to content

Fix DCEP panics and SCTP stream ID reuse race condition#883

Merged
algesten merged 2 commits into
algesten:mainfrom
ChainSafe:tim/fix-dcep-panics-stream-reuse
Mar 7, 2026
Merged

Fix DCEP panics and SCTP stream ID reuse race condition#883
algesten merged 2 commits into
algesten:mainfrom
ChainSafe:tim/fix-dcep-panics-stream-reuse

Conversation

@timwu20
Copy link
Copy Markdown
Contributor

@timwu20 timwu20 commented Mar 4, 2026

Summary

  • Handle write_with_ppi errors gracefully instead of panicking — Replace .expect() calls on DCEP open/ack writes with match blocks that log a warning and close the stream on failure, preventing panics from ErrPayloadDataStateNotExist when the SCTP stream is in an unexpected state.
  • Prevent SCTP stream ID reuse before RE-CONFIG completion — Add a cooldown period (2s) for recently closed stream IDs so they are not immediately reallocated. This avoids a race condition where a new data channel is assigned a stream ID that is still pending RE-CONFIG acknowledgment from the remote peer, which causes ErrStreamAlreadyExist errors.

Problem

When data channels are rapidly closed and reopened, two issues can occur:

  1. Panic on DCEP write: If the SCTP stream enters an unexpected state (e.g., due to a concurrent close), write_with_ppi returns an error that was previously unwrapped with .expect(), causing a panic.

  2. Stream ID reuse race: When a data channel is closed, str0m sends an SCTP RE-CONFIG to reset the stream. However, the stream ID was immediately available for reuse. If a new channel was allocated the same stream ID before the remote peer processed the RE-CONFIG, sctp-proto would return ErrStreamAlreadyExist, leading to the error in (1).

Solution

  • Commit 1 (bfbce6f): Replaces panicking .expect() calls with proper error handling that logs the failure and cleanly closes the stream.
  • Commit 2 (1a762ec): Introduces a closed_stream_ids cooldown list in ChannelHandler. Closed stream IDs are excluded from allocation for STREAM_ID_COOLDOWN (2 seconds), giving the remote peer time to process the RE-CONFIG before the ID is reused.

Test plan

  • Existing unit tests updated and passing
  • New test stream_id_not_reused_during_cooldown verifies cooldown behavior
  • cargo test passes (all except data_channel_flood which is unrelated)
  • cargo fmt and cargo clippy clean

timwu20 added 2 commits March 3, 2026 11:16
Replace .expect() on DCEP open/ack writes with match blocks that
log a warning and close the stream on failure, preventing panics
from ErrPayloadDataStateNotExist.
@algesten
Copy link
Copy Markdown
Owner

algesten commented Mar 4, 2026

@timwu20 did you do a deeper investigation on this? Or is this more a shooting from the hip for something you observe?

Can we prove the problem and the fix with fuzz testing?

@timwu20
Copy link
Copy Markdown
Contributor Author

timwu20 commented Mar 5, 2026

@timwu20 did you do a deeper investigation on this? Or is this more a shooting from the hip for something you observe?

@algesten Yes, this came from a real investigation while testing with a Polkadot node, not shooting from the hip. Here's the full context:

Background

We're using str0m in litep2p to implement libp2p protocols over WebRTC. In the libp2p world, request/response protocols routinely open a data channel, send a request, receive a response, and immediately close the stream. This pattern of rapidly closing and reopening data channels is what triggers the stream ID reuse problem.

What I observed

On the browser side, I could see DCEP opens being rejected — the remote peer was receiving a DCEP open for a stream ID that it hadn't finished processing the RE-CONFIG close for yet. This led to ErrStreamAlreadyExist in sctp-proto, which then caused the panic via .expect() on the subsequent write_with_ppi call.

Investigation into sctp-proto

I looked into whether we could poll sctp-proto to know when a stream's RE-CONFIG has been fully acknowledged, so we'd know exactly when it's safe to reuse that stream ID. However, that would require:

  1. Changing the public API of sctp-proto to expose when streams have been internally removed (i.e., when the RE-CONFIG acknowledgment has been received from the remote peer).
  2. Implementing a cooldown within sctp-proto itself, because even after we send the RE-CONFIG, we need to give the remote peer time to receive and process our RE-CONFIG ack before we can reuse the stream ID.

Since both of those are more invasive changes to sctp-proto's API, the cooldown approach in str0m is a naive but pragmatic fix that solves the problem at the str0m level. The 2-second cooldown is conservative enough to cover the RE-CONFIG round-trip in practice.

Can we prove the problem and the fix with fuzz testing?

I'm open to adding fuzz testing for this. The scenario to fuzz would be rapid open/close/reopen cycles of data channels — that's what reliably triggers the issue in our libp2p usage. The unit test stream_id_not_reused_during_cooldown validates the cooldown mechanism itself, but a fuzz test exercising the full SCTP lifecycle could add more confidence.

@algesten algesten merged commit 3ddc3fc into algesten:main Mar 7, 2026
55 checks passed
@algesten
Copy link
Copy Markdown
Owner

algesten commented Mar 7, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants