Skip to content

gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123

Open
Roasbeef wants to merge 9 commits into
masterfrom
fix-gbn-transport-timeouts
Open

gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123
Roasbeef wants to merge 9 commits into
masterfrom
fix-gbn-transport-timeouts

Conversation

@Roasbeef

@Roasbeef Roasbeef commented Mar 5, 2026

Copy link
Copy Markdown
Member

In this PR, we fix a class of LNC transport failures where the GBN
keepalive mechanism was too aggressively timing out connections that
traverse the aperture mailbox relay. Users were hitting "cannot send, gbn
exited" errors in the Terminal Web UI, particularly on Pool RPCs, because
the 3s pong timeout couldn't accommodate the variable latency introduced
by the relay hop.

We address this with three changes: bumping the static timeout defaults to
more reasonable values, adding a dynamic pong timeout that adapts to
observed RTT, and fixing a pre-existing deserialize bug found via
property-based testing.

Dynamic Pong Timeout

The core idea is simple: rather than a fixed pong deadline, we compute it
as max(basePongTime, pongMultiplier * latestRTT), capped at
maxPongTime. The RTT is already tracked by the TimeoutManager from ACK
response measurements, so no additional round trips are needed. A new
WithDynamicPongTimeout config option controls the multiplier (default 3x)
and ceiling (default 15s). When dynamic mode is disabled or no RTT data
exists yet, we fall back to the static base pong time, so this is fully
backwards compatible with existing deployments.

In gbn_conn.go, the pong ticker now calls ResetWithInterval on each
ping send to pick up the latest dynamic value rather than reusing the
original static interval.

Updated Defaults

The mailbox client and server connections get new defaults tuned for the
relay transport:

  • Client ping: 7s -> 10s
  • Server ping: 5s -> 8s
  • Pong timeout: 3s -> 5s (base, scaling up to 15s via dynamic RTT)

These are backwards compatible since GBN timeouts are configured
independently per side with no negotiation. A new client connecting to an
old server (or vice versa) works fine because each side's pings arrive well
within the other's receive window.

Deserialize Bug Fix

While writing property-based tests with pgregory.net/rapid, we found an
off-by-one in Deserialize for DATA packets: it checked len(b) < 3 but
then accessed b[3] for the IsPing field, panicking on 3-byte inputs like
[]byte{0x02, 0x00, 0x00}. The fix is simply len(b) < 4.

Property-Based Tests

We add eight property test groups using rapid that exercise GBN invariants
under randomized inputs: message roundtrip, deserialize safety (which
caught the bug above), modular sequence arithmetic, queue size bounds after
random add/ACK/NACK sequences, dynamic pong timeout floor/ceiling
invariants, and timeout booster behavior. We also add two backwards
compatibility integration tests that verify bidirectional data transfer
between mixed old/new timeout configurations.

See each commit message for a detailed description w.r.t the incremental
changes.

In this commit, we fix an off-by-one error in the Deserialize function
for DATA packets. The previous bounds check required `len(b) >= 3` but
then accessed `b[3]` for the IsPing field, which would panic on exactly
3-byte inputs. The minimum length for a DATA packet is 4 bytes: type
byte, sequence number, FinalChunk flag, and IsPing flag (with payload
being optional and starting at `b[4:]`).

This bug was discovered via property-based testing with pgregory.net/rapid,
which generated the crashing input `[]byte{0x02, 0x00, 0x00}`.

@ViktorT-11 ViktorT-11 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @Roasbeef 🔥!! I'm leaving a few comments below, which I believe are worth addressing.

Comment thread gbn/gbn_conn.go
// to pick up any dynamic pong timeout changes
// based on observed RTT.
pongTime := g.timeoutManager.GetPongTime()
g.pongTicker.ResetWithInterval(pongTime)

@ViktorT-11 ViktorT-11 Mar 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an issue here now as g.pongTicker can now exceed the value of g.pingTicker. If that happens, the <-g.pingTicker.Ticks() case in the outer select may be triggered prior to <-g.pongTicker.Ticks(). This could result in the connection remaining alive when it should have been closed.

If we’re dynamically increasing pongTicker, I believe we’ll need to apply the same logic to g.pingTicker. Otherwise, we may need to refactor the sendPacketsForever function to handle this more reliably.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to clamp this value, added some additional tests here as well.

Comment thread gbn/timeout_manager.go Outdated
m.hasSetDynamicTimeout = true

// Store the latest observed RTT for use by the dynamic pong timeout.
m.latestRTT = responseTime

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: we don’t update responseTime for every packet we receive. Instead, we only update it after receiving reachedFrequency packets, and we use the response time of that specific packet.

This means latestRTT can fluctuate significantly over time. If we’re unlucky, it could reflect an unusually short response time for that particular packet, even though the overall average response time is currently much higher.

For resendTimeout, this is less problematic as the worst-case scenario is that we resend the packet queue. However, for pongTimeout, the consequences of this is that the connection will be closed.

For pongTimeout (and RTT calculation in general), I think a better approach would be to use the average response time over the last X packets, rather than relying on the response time of the single packet that triggers updateResendTimeoutUnsafe.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh agreed better to smooth it out with some sort of aggregation metric.

Comment thread gbn/rapid_test.go
@@ -0,0 +1,424 @@
package gbn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥!

@Roasbeef Roasbeef force-pushed the fix-gbn-transport-timeouts branch from 5d316be to eb1ddd3 Compare April 1, 2026 19:06
@Roasbeef

Roasbeef commented Apr 1, 2026

Copy link
Copy Markdown
Member Author

@ViktorT-11 addressed your review feedback — here's what changed:

1. Pong timeout can exceed ping interval (gbn_conn.go:431)

  • Added a pingTime cap in GetPongTime() so the dynamic pong timeout never exceeds the ping interval. This prevents the scenario where ping fires and resets the pong timer before it expires, keeping a dead connection alive indefinitely.

2. Single-sample RTT is unreliable (timeout_manager.go:438)

  • Replaced the single-sample latestRTT with an EWMA-smoothed RTT (alpha=0.25). This resists single-sample outliers — an unlucky low measurement can't make the pong timeout aggressive enough to kill a healthy connection.
  • The smoothed RTT is now updated on every ACK (not just every N packets), giving the pong timeout a much more responsive and stable signal.

New tests added:

  • TestDynamicPongCappedByPingTime — verifies the pingTime cap works correctly
  • TestEWMASmoothing — verifies EWMA convergence and outlier resistance
  • TestDefaultPongMultiplierAndMaxPongTime — regression test for constructor defaults
  • TestRapidEWMAProperties — property-based: smoothedRTT always bounded by [min,max] of samples; constant samples converge exactly
  • New rapid property: pong <= pingTime always holds under random inputs

Ready for re-review when you get a chance.

@ViktorT-11 ViktorT-11 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🔥! Sorry for the delay in the response here. uTACK LGTM 🔥!

I'm leaving a few suggestions below that I think is worth addressing though, and especially the test coverage feedback.

Comment thread mailbox/client_conn.go Outdated
Comment thread gbn/gbn_conn.go Outdated
Comment thread gbn/gbn_conn_test.go Outdated
@lightninglabs-deploy

Copy link
Copy Markdown

@ellemouton: review reminder

Roasbeef added 7 commits June 19, 2026 01:35
In this commit, we introduce adaptive pong timeout behavior for the GBN
keepalive mechanism. Previously the pong timeout was a fixed static value,
which could be too aggressive for relay-based transports like the LNC
mailbox where round-trip times through the intermediary can vary
significantly with network conditions.

The dynamic pong timeout is computed as:

    pongTimeout = max(basePongTime, pongMultiplier * latestRTT)

capped at a configurable maximum (maxPongTime). The RTT is tracked by
the TimeoutManager from existing ACK response measurements, so no
additional round trips are needed. A new WithDynamicPongTimeout config
option controls the multiplier and ceiling. When dynamic mode is disabled
or no RTT data is available yet, the behavior falls back to the static
base pong time, preserving full backwards compatibility.

The pong ticker in gbn_conn.go now uses ResetWithInterval on each ping
send to pick up the latest dynamic value rather than reusing the original
static interval.
In this commit, we add four focused tests for the new dynamic pong
timeout behavior introduced in the TimeoutManager:

- TestDynamicPongTimeout: exercises the core computation verifying
  the floor (basePongTime), ceiling (maxPongTime), and RTT-proportional
  scaling via the pong multiplier.
- TestDynamicPongTimeoutDisabled: confirms static fallback behavior
  when dynamic mode is not enabled.
- TestDynamicPongTimeoutWithDataPackets: verifies that RTT measurements
  from data packet ACKs feed into the dynamic pong computation.
- TestGetLatestRTT: validates the RTT tracking accessor.
In this commit, we adjust the default GBN keepalive timeouts for the
mailbox client and server connections and wire up the new dynamic pong
timeout feature. The previous values were tuned for direct connections
but proved too aggressive for the relay-based LNC transport, where
messages traverse the aperture mailbox server, adding variable latency.

The new defaults are:
- Client ping: 7s -> 10s
- Server ping: 5s -> 8s
- Pong timeout: 3s -> 5s base, with dynamic scaling based on observed
RTT

Both client and server connections now pass WithDynamicPongTimeout with
a 3x RTT multiplier. The dynamic pong timeout is capped by the local
ping interval, so the effective cap is 10s on the client side and 8s on
the server side. This allows the pong deadline to adapt to observed
network conditions while still catching genuinely dead connections.

These changes are backwards compatible: GBN timeouts are configured
independently per side with no negotiation, so a new client connecting
to an old server (or vice versa) will function correctly since each
side's pings arrive well within the other's receive window.
Add mixed-version keepalive compatibility coverage for the timeout
changes. Beyond the original bidirectional mixed-version connection
tests, this now exercises sender-driven keepalive behavior under
injected latency, all-upgraded baseline behavior, and simultaneous
keepalive ping races.

The new tests verify that:
- old-side static pong timeouts still govern mixed-version failure
  when the old peer sends the keepalive ping
- upgraded peers tolerate RTT below their local dynamic-pong cap
- upgraded peers still fail once RTT exceeds that cap
- in simultaneous-ping races, the old-side peer still initiates
  closure first

The test helpers also now observe keepalive pings and FIN packets on
the transport so the assertions are tied to actual timing behavior,
not just eventual connection closure.
In this commit, we introduce property-based tests using pgregory.net/rapid
to exercise GBN protocol invariants under randomized inputs. These tests
complement the existing example-based suite by exploring the input space
more broadly, and in fact caught the DATA deserialize bounds-check bug
fixed in a prior commit.

The test suite covers eight property groups:

- MessageRoundTrip: any valid message survives serialize/deserialize.
- DeserializeNeverPanics: arbitrary byte slices never cause a panic.
- ContainsSequenceProperties: modular arithmetic invariants for the
  half-open interval [base, top) used by the send queue.
- ContainsSequenceBruteForce: validates containsSequence against a
  reference implementation via enumeration in small modular spaces.
- DynamicPongTimeoutProperties: the dynamic pong value always respects
  the floor (basePongTime) and ceiling (maxPongTime), falls back to
  static when disabled or when no RTT data exists.
- QueueSizeInvariants: the send queue size never exceeds the window
  after random sequences of add/ACK/NACK operations.
- TimeoutBoosterProperties: boosted timeouts are always >= original,
  and reset correctly returns to a new base value.
- PingPongZeroDisablesPing: zero ping time yields MaxInt64 (disabled).
In this commit, we set the pongMultiplier and maxPongTime fields to
their default constant values in NewTimeOutManager. Previously these
fields relied on the zero-value being safe (guarded by the
dynamicPongTime boolean), but initializing them explicitly provides
defense in depth against future code paths that might enable dynamic
mode without going through WithDynamicPongTimeout.
In this commit, we update the reference implementation used in the
brute-force property test to operate in the full uint8 space with
natural wrapping, matching the production containsSequence exactly.
The previous version used a mod-s space which happened to agree for
small values but was a non-obvious equivalence that could silently
break if the test ranges were widened.
@ViktorT-11 ViktorT-11 force-pushed the fix-gbn-transport-timeouts branch from eb1ddd3 to e0feaa6 Compare June 18, 2026 23:36
@ViktorT-11

Copy link
Copy Markdown
Contributor

Addressed my own feedback above with the latest push!

Protect IntervalAwareForceTicker lifecycle operations with a mutex so
Reset, ResetWithInterval, Stop, and ForceTick cannot replace or stop the
underlying time.Ticker concurrently.

The keepalive timeout compatibility tests exercise concurrent send and
receive paths that both reset the ping/pong tickers. In normal
operation, sendPacketsForever can reset the pong ticker when starting a
keepalive ping while receivePacketsForever resets the ping ticker when a
packet arrives. The interval-aware ticker implementation mutated its
ticker, quit channel, and interval fields without synchronization, so
the race detector could observe concurrent reads and writes in the reset
path.

This change adds a dedicated lock around ticker lifecycle updates and
uses a locked helper for interval resets. That preserves the existing
behavior while making the ticker safe under the concurrent reset pattern
used by the keepalive logic.

Why the fix was needed:

The new keepalive compatibility tests made two
goroutines hit the ticker harder:

- sendPacketsForever() resets the pong ticker when it sends a keepalive
  ping.
- receivePacketsForever() resets the ping ticker when packets arrive.

IntervalAwareForceTicker internally stops and recreates its time.Ticker,
replaces quit, and updates interval.
Before the fix, those fields were changed without a mutex, so
overlapping resets could race. go test -race caught exactly that in
ticker.go:143-162.

So the fix was needed because:

- the ticker was not safe under concurrent reset/stop operations
- the keepalive code legitimately does concurrent resets
- race CI was correctly flagging unsynchronized internal state mutation

The mutex makes those lifecycle operations atomic and keeps the
implementation behavior the same.
@ViktorT-11

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants