Stability fixes for master 4505 publish port (#66282) by dwoz · Pull Request #69478 · saltstack/salt

dwoz · 2026-06-18T06:09:11Z

What does this PR do?

Two narrowly-scoped stability fixes for the salt-master publisher
path that together address the symptoms reported in #66282 (the
4505 publish port becoming unresponsive under prolonged load).

TCP PubServer broadcasts concurrently. publish_payload
used to yield each client.stream.write(payload) serially, so
a single slow subscriber stalled delivery to every other client.
That matches the EventPublisher subprocess ballooning to hundreds
of MB reported by tjyang. Writes are now scheduled on the
IOLoop up-front and yielded in order; fast subscribers no longer
wait behind slow ones.
ZMTP heartbeats on the master PUB socket. Without
ZMQ_HEARTBEAT_IVL / ZMQ_HEARTBEAT_TIMEOUT, dead SUB peers
linger until the kernel TCP keepalive expires (~2 h 15 min on
default Linux), during which the PUB socket buffers for them
and netstat accumulates CLOSE_WAIT entries on 4505. Default
to 10 s interval / 30 s timeout; tunable via zmq_heartbeat_ivl
/ zmq_heartbeat_timeout.

What issues does this PR fix or reference?

Fixes #66282

Previous Behavior

After hours-to-days of real production load on 3007.x and 3006.x,
the master's 4505 PUB port becomes unresponsive (nc -zv master 4505 → TIMEOUT). The EventPublisher subprocess grows to
hundreds of MB. netstat shows accumulating CLOSE_WAIT entries
against 4505. Restarting salt-master is the only remedy; in some
cases the port refuses to re-bind and the host needs a reboot.
Multiple users (#66282, see also #66288, #66715, #65265) reported
having to downgrade to 3006.7 / 3005.x to escape it.

New Behavior

Slow IPC subscribers no longer stall the TCP publisher loop;
per-client write buffers stop growing in lockstep with the
slowest peer.
Dead ZMQ SUB peers are reaped in seconds (heartbeat timeout)
instead of hours (kernel TCP keepalive), so the PUB socket's
per-peer state and the kernel's CLOSE_WAIT table stop
accumulating.

Scope and what is NOT in this PR

Issue #66282 is a composite of symptoms. A third candidate —
PublishServer.publisher() on 3007.x+ has a while True: loop
with a bare except Exception: and no shutdown gate — does not
exist on 3006.x (which uses callback-style pull_sock.on_recv())
and is therefore out of scope here. If merge-forward to
3007.x/3008.x/master confirms it still applies, that's a
separate follow-up.

A fourth report (pub.bind() fails on restart, needs host reboot)
is likely lingering kernel socket state or a TIME_WAIT race —
needs its own repro and is also out of scope.

Merge-forward

salt/transport/tcp.py::PubServer.publish_payload was rewritten on
3007.x+ to use async def / await; the fix translates cleanly
(kick off the awaitables, then await them in order). The
heartbeat helper port is 1:1.

Merge requirements satisfied?

Tests written/updated (functional test_pub_server_stability;
unit test_zeromq_pub_stability)
Changelog (changelog/66282.fixed.md)
Docs — operator-facing knobs zmq_heartbeat_ivl /
zmq_heartbeat_timeout could use entries in the master config
reference once the approach is approved.

Commits signed with GPG?

No (matches recent 3006.x merge history).

PubServer.publish_payload serially yielded each client.stream.write(payload), so a single slow subscriber stalled delivery to every other client. With dozens to thousands of minions connected the event publisher loop would fall behind, per-client write buffers would grow (matching reporter observations of the EventPublisher subprocess ballooning to hundreds of MB before restart), and the master would eventually appear wedged on its publish port. Schedule every write on the IOLoop first, then yield on the resulting futures in order. tornado's @gen.coroutine runs the body when called (not when awaited), so kicking off the writes up-front lets the IOLoop interleave them: fast subscribers receive their payload immediately even while a slow subscriber's write is still draining. A new regression test installs two fake subscribers with a 3 s slow write and a 0 s fast write, then asserts the fast subscriber sees its payload within 1 s of publish_payload being called. Without the fix it does not. Refs saltstack#66282

Without ZMQ_HEARTBEAT_IVL / ZMQ_HEARTBEAT_TIMEOUT configured, the PUB socket only notices a SUB peer that vanished without sending FIN (host reboot, kernel panic, dropped firewall rule) once kernel TCP keepalive expires. On Linux that's ~2 h 15 min by default, during which the PUB socket keeps buffering for the dead peer and the kernel accumulates CLOSE_WAIT entries on port 4505. Eventually the master stops accepting new connections — a state several users have reported in issue saltstack#66282. Add a _set_zmq_heartbeat helper, default it to 10 s interval / 30 s timeout, and call it alongside _set_tcp_keepalive when the PublishServer's PUB socket is set up. Operators can tune via zmq_heartbeat_ivl / zmq_heartbeat_timeout (milliseconds, matching the unit ZMQ uses). Refs saltstack#66282

Fixes saltstack#66282

dwoz requested a review from a team as a code owner June 18, 2026 06:09

dwoz added this to the Sulphur v3006.26 milestone Jun 18, 2026

dwoz added the test:full Run the full test suite label Jun 18, 2026

dwoz had a problem deploying to ci June 18, 2026 06:09 — with GitHub Actions Error

dwoz temporarily deployed to ci June 18, 2026 06:09 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 18, 2026 06:20 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 18, 2026 06:35 — with GitHub Actions Inactive

dwoz mentioned this pull request Jun 18, 2026

Guard PublishClient.recv against torn-down stream socket (#66435) #69479

Open

3 tasks

dwoz added 3 commits June 24, 2026 17:59

Add changelog entry for saltstack#66282

f934a3a

Fixes saltstack#66282

dwoz force-pushed the fix/issue-66282 branch from e292eef to f934a3a Compare June 25, 2026 01:01

dwoz temporarily deployed to ci June 25, 2026 01:01 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 25, 2026 01:17 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 25, 2026 02:33 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stability fixes for master 4505 publish port (#66282)#69478

Stability fixes for master 4505 publish port (#66282)#69478
dwoz wants to merge 3 commits into
saltstack:3006.xfrom
dwoz:fix/issue-66282

dwoz commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dwoz commented Jun 18, 2026

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Scope and what is NOT in this PR

Merge-forward

Merge requirements satisfied?

Commits signed with GPG?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant