Skip to content

Guard PublishClient.recv against torn-down stream socket (#66435)#69479

Open
dwoz wants to merge 2 commits into
saltstack:3007.xfrom
dwoz:fix/issue-66435
Open

Guard PublishClient.recv against torn-down stream socket (#66435)#69479
dwoz wants to merge 2 commits into
saltstack:3007.xfrom
dwoz:fix/issue-66435

Conversation

@dwoz

@dwoz dwoz commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a small None-check in salt.transport.tcp.PublishClient.recv(timeout=0) so that a stream whose underlying socket has been concurrently torn down no longer crashes the caller. The non-blocking peek now treats a missing socket as "no events pending" and returns None, letting the existing reconnect loop take over.

What issues does this PR fix or reference?

Fixes #66435

Related (different bugs in the same module, do not fix this one):

Previous Behavior

Under load — and reliably on FreeBSD 14 with the 3007.x packages and on RHEL 9.2 / Debian 12 with ipc_mode: ipc — every salt and salt-master invocation crashed out of PublishClient.recv with one of:

TypeError: argument must be an int, or have a fileno() method.

(in 3007.0, from select.select([self._stream.socket], [], [], 0)) or, after #68136 swapped in selectors.DefaultSelector,

ValueError: Invalid file object: None

from selectors._fileobj_to_fd. In both cases the root cause is the same: between the while self._stream is None: await self.connect() check at the top of recv() and the selector peek a few lines later, the Tornado IOStream for the publish IPC socket can be closed by another task. Tornado sets IOStream.socket to None on close, so the peek tries to register None with the selector and dies with an unhandled exception. The error escaped all the way through salt.utils.asynchronous.SyncWrapper to the salt CLI, breaking every command.

New Behavior

recv(timeout=0) snapshots self._stream.socket once. If it's None, the method returns None immediately — the same return value the caller already handles when no events are pending — and the existing reconnect path takes over without crashing.

A regression test (tests/pytests/unit/transport/test_publish_client.py::test_recv_timeout_zero_stream_socket_none) constructs a PublishClient, sets its _stream to a mock whose .socket is None, and asserts recv(timeout=0) returns None without raising. It fails on unmodified 3007.x with ValueError: Invalid file object: None and passes with the fix.

3008.x and master are unaffected: that recv path has already been rewritten to use asyncio.ensure_future(self._read_into_unpacker()) instead of a kernel-level socket peek, so no merge-forward port is needed beyond what the merge bots will do.

Merge requirements satisfied?

  • Docs (no documented behavior changes)
  • Changelog (changelog/66435.fixed.md)
  • Tests written/updated (tests/pytests/unit/transport/test_publish_client.py)

Commits signed with GPG?

No (matches surrounding non-merge commits on this branch).

PublishClient.recv(timeout=0) was passing self._stream.socket straight
to selectors.DefaultSelector().register() without checking whether the
IOStream's underlying socket had been concurrently torn down. Tornado
sets IOStream.socket to None once the stream is closed, and the
non-blocking peek would then raise

    TypeError: argument must be an int, or have a fileno() method.

(or, after the fd>1023 cleanup in saltstack#68136, a ValueError from the
selectors backend) escaping all the way out to the salt CLI, breaking
every salt and salt-master invocation on hosts where the
publisher-side stream closed underneath the client.

Treat a missing socket as "no events pending" and return None so the
caller re-enters its connect/reconnect loop instead of crashing.

Fixes saltstack#66435
@dwoz dwoz requested a review from a team as a code owner June 18, 2026 07:49
@dwoz dwoz added this to the Chlorine v3007.15 milestone Jun 18, 2026
@dwoz dwoz added the test:full Run the full test suite label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant