Skip to content

Stability fixes for master 4505 publish port (#66282)#69478

Open
dwoz wants to merge 3 commits into
saltstack:3006.xfrom
dwoz:fix/issue-66282
Open

Stability fixes for master 4505 publish port (#66282)#69478
dwoz wants to merge 3 commits into
saltstack:3006.xfrom
dwoz:fix/issue-66282

Conversation

@dwoz

@dwoz dwoz commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Two narrowly-scoped stability fixes for the salt-master publisher
path that together address the symptoms reported in #66282 (the
4505 publish port becoming unresponsive under prolonged load).

  1. TCP PubServer broadcasts concurrently. publish_payload
    used to yield each client.stream.write(payload) serially, so
    a single slow subscriber stalled delivery to every other client.
    That matches the EventPublisher subprocess ballooning to hundreds
    of MB reported by tjyang. Writes are now scheduled on the
    IOLoop up-front and yielded in order; fast subscribers no longer
    wait behind slow ones.
  2. ZMTP heartbeats on the master PUB socket. Without
    ZMQ_HEARTBEAT_IVL / ZMQ_HEARTBEAT_TIMEOUT, dead SUB peers
    linger until the kernel TCP keepalive expires (~2 h 15 min on
    default Linux), during which the PUB socket buffers for them
    and netstat accumulates CLOSE_WAIT entries on 4505. Default
    to 10 s interval / 30 s timeout; tunable via zmq_heartbeat_ivl
    / zmq_heartbeat_timeout.

What issues does this PR fix or reference?

Fixes #66282

Previous Behavior

After hours-to-days of real production load on 3007.x and 3006.x,
the master's 4505 PUB port becomes unresponsive (nc -zv master 4505TIMEOUT). The EventPublisher subprocess grows to
hundreds of MB. netstat shows accumulating CLOSE_WAIT entries
against 4505. Restarting salt-master is the only remedy; in some
cases the port refuses to re-bind and the host needs a reboot.
Multiple users (#66282, see also #66288, #66715, #65265) reported
having to downgrade to 3006.7 / 3005.x to escape it.

New Behavior

  • Slow IPC subscribers no longer stall the TCP publisher loop;
    per-client write buffers stop growing in lockstep with the
    slowest peer.
  • Dead ZMQ SUB peers are reaped in seconds (heartbeat timeout)
    instead of hours (kernel TCP keepalive), so the PUB socket's
    per-peer state and the kernel's CLOSE_WAIT table stop
    accumulating.

Scope and what is NOT in this PR

Issue #66282 is a composite of symptoms. A third candidate —
PublishServer.publisher() on 3007.x+ has a while True: loop
with a bare except Exception: and no shutdown gate — does not
exist on 3006.x
(which uses callback-style pull_sock.on_recv())
and is therefore out of scope here. If merge-forward to
3007.x/3008.x/master confirms it still applies, that's a
separate follow-up.

A fourth report (pub.bind() fails on restart, needs host reboot)
is likely lingering kernel socket state or a TIME_WAIT race —
needs its own repro and is also out of scope.

Merge-forward

salt/transport/tcp.py::PubServer.publish_payload was rewritten on
3007.x+ to use async def / await; the fix translates cleanly
(kick off the awaitables, then await them in order). The
heartbeat helper port is 1:1.

Merge requirements satisfied?

  • Tests written/updated (functional test_pub_server_stability;
    unit test_zeromq_pub_stability)
  • Changelog (changelog/66282.fixed.md)
  • Docs — operator-facing knobs zmq_heartbeat_ivl /
    zmq_heartbeat_timeout could use entries in the master config
    reference once the approach is approved.

Commits signed with GPG?

No (matches recent 3006.x merge history).

@dwoz dwoz requested a review from a team as a code owner June 18, 2026 06:09
@dwoz dwoz added this to the Sulphur v3006.26 milestone Jun 18, 2026
@dwoz dwoz added the test:full Run the full test suite label Jun 18, 2026
dwoz added 3 commits June 24, 2026 17:59
PubServer.publish_payload serially yielded each
client.stream.write(payload), so a single slow subscriber stalled
delivery to every other client. With dozens to thousands of minions
connected the event publisher loop would fall behind, per-client
write buffers would grow (matching reporter observations of the
EventPublisher subprocess ballooning to hundreds of MB before
restart), and the master would eventually appear wedged on its
publish port.

Schedule every write on the IOLoop first, then yield on the
resulting futures in order. tornado's @gen.coroutine runs the body
when called (not when awaited), so kicking off the writes up-front
lets the IOLoop interleave them: fast subscribers receive their
payload immediately even while a slow subscriber's write is still
draining.

A new regression test installs two fake subscribers with a 3 s slow
write and a 0 s fast write, then asserts the fast subscriber sees
its payload within 1 s of publish_payload being called. Without the
fix it does not.

Refs saltstack#66282
Without ZMQ_HEARTBEAT_IVL / ZMQ_HEARTBEAT_TIMEOUT configured, the
PUB socket only notices a SUB peer that vanished without sending
FIN (host reboot, kernel panic, dropped firewall rule) once kernel
TCP keepalive expires. On Linux that's ~2 h 15 min by default,
during which the PUB socket keeps buffering for the dead peer and
the kernel accumulates CLOSE_WAIT entries on port 4505. Eventually
the master stops accepting new connections — a state several users
have reported in issue saltstack#66282.

Add a _set_zmq_heartbeat helper, default it to 10 s interval / 30 s
timeout, and call it alongside _set_tcp_keepalive when the
PublishServer's PUB socket is set up. Operators can tune via
zmq_heartbeat_ivl / zmq_heartbeat_timeout (milliseconds, matching
the unit ZMQ uses).

Refs saltstack#66282
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant