Stability fixes for master 4505 publish port (#66282)#69478
Open
dwoz wants to merge 3 commits into
Open
Conversation
3 tasks
PubServer.publish_payload serially yielded each client.stream.write(payload), so a single slow subscriber stalled delivery to every other client. With dozens to thousands of minions connected the event publisher loop would fall behind, per-client write buffers would grow (matching reporter observations of the EventPublisher subprocess ballooning to hundreds of MB before restart), and the master would eventually appear wedged on its publish port. Schedule every write on the IOLoop first, then yield on the resulting futures in order. tornado's @gen.coroutine runs the body when called (not when awaited), so kicking off the writes up-front lets the IOLoop interleave them: fast subscribers receive their payload immediately even while a slow subscriber's write is still draining. A new regression test installs two fake subscribers with a 3 s slow write and a 0 s fast write, then asserts the fast subscriber sees its payload within 1 s of publish_payload being called. Without the fix it does not. Refs saltstack#66282
Without ZMQ_HEARTBEAT_IVL / ZMQ_HEARTBEAT_TIMEOUT configured, the PUB socket only notices a SUB peer that vanished without sending FIN (host reboot, kernel panic, dropped firewall rule) once kernel TCP keepalive expires. On Linux that's ~2 h 15 min by default, during which the PUB socket keeps buffering for the dead peer and the kernel accumulates CLOSE_WAIT entries on port 4505. Eventually the master stops accepting new connections — a state several users have reported in issue saltstack#66282. Add a _set_zmq_heartbeat helper, default it to 10 s interval / 30 s timeout, and call it alongside _set_tcp_keepalive when the PublishServer's PUB socket is set up. Operators can tune via zmq_heartbeat_ivl / zmq_heartbeat_timeout (milliseconds, matching the unit ZMQ uses). Refs saltstack#66282
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Two narrowly-scoped stability fixes for the salt-master publisher
path that together address the symptoms reported in #66282 (the
4505 publish port becoming unresponsive under prolonged load).
PubServerbroadcasts concurrently.publish_payloadused to
yieldeachclient.stream.write(payload)serially, soa single slow subscriber stalled delivery to every other client.
That matches the EventPublisher subprocess ballooning to hundreds
of MB reported by tjyang. Writes are now scheduled on the
IOLoop up-front and yielded in order; fast subscribers no longer
wait behind slow ones.
ZMQ_HEARTBEAT_IVL/ZMQ_HEARTBEAT_TIMEOUT, dead SUB peerslinger until the kernel TCP keepalive expires (~2 h 15 min on
default Linux), during which the PUB socket buffers for them
and
netstataccumulatesCLOSE_WAITentries on 4505. Defaultto 10 s interval / 30 s timeout; tunable via
zmq_heartbeat_ivl/
zmq_heartbeat_timeout.What issues does this PR fix or reference?
Fixes #66282
Previous Behavior
After hours-to-days of real production load on 3007.x and 3006.x,
the master's 4505 PUB port becomes unresponsive (
nc -zv master 4505→TIMEOUT). TheEventPublishersubprocess grows tohundreds of MB.
netstatshows accumulatingCLOSE_WAITentriesagainst 4505. Restarting
salt-masteris the only remedy; in somecases the port refuses to re-bind and the host needs a reboot.
Multiple users (#66282, see also #66288, #66715, #65265) reported
having to downgrade to 3006.7 / 3005.x to escape it.
New Behavior
per-client write buffers stop growing in lockstep with the
slowest peer.
instead of hours (kernel TCP keepalive), so the PUB socket's
per-peer state and the kernel's
CLOSE_WAITtable stopaccumulating.
Scope and what is NOT in this PR
Issue #66282 is a composite of symptoms. A third candidate —
PublishServer.publisher()on 3007.x+ has awhile True:loopwith a bare
except Exception:and no shutdown gate — does notexist on 3006.x (which uses callback-style
pull_sock.on_recv())and is therefore out of scope here. If merge-forward to
3007.x/3008.x/masterconfirms it still applies, that's aseparate follow-up.
A fourth report (
pub.bind()fails on restart, needs host reboot)is likely lingering kernel socket state or a TIME_WAIT race —
needs its own repro and is also out of scope.
Merge-forward
salt/transport/tcp.py::PubServer.publish_payloadwas rewritten on3007.x+ to use
async def/await; the fix translates cleanly(kick off the awaitables, then await them in order). The
heartbeat helper port is 1:1.
Merge requirements satisfied?
test_pub_server_stability;unit
test_zeromq_pub_stability)changelog/66282.fixed.md)zmq_heartbeat_ivl/zmq_heartbeat_timeoutcould use entries in the master configreference once the approach is approved.
Commits signed with GPG?
No (matches recent 3006.x merge history).