Skip to content

fix(sandbox): make interactive connect resilient on stopped/resumed sandboxes#215

Draft
marc-vercel wants to merge 1 commit into
mainfrom
marc-vercel/fix-interactive-connect-resume
Draft

fix(sandbox): make interactive connect resilient on stopped/resumed sandboxes#215
marc-vercel wants to merge 1 commit into
mainfrom
marc-vercel/fix-interactive-connect-resume

Conversation

@marc-vercel
Copy link
Copy Markdown
Collaborator

Problem

sandbox connect (and interactive sandbox exec) could hang indefinitely on Waiting for connection..., or fail in a confusing way, when run against a stopped sandbox that has to be resumed. It worked reliably against an already-running sandbox, which is why it only showed up intermittently after a stop/resume.

Several independent issues combined to produce this:

  1. Real connection errors were swallowed. Once the connection handshake landed, the abort signal that stops the "did the command exit early?" check was also used to filter errors from attach(). So any failure that happened after the handshake (for example, the resumed session not yet exposing a route for the interactive port) was silently discarded instead of surfaced.

  2. The spinner kept the process alive. The progress spinner's teardown called ora.clear(), which only erases the current frame but leaves its render interval running. That timer keeps Node's event loop alive, so on any early teardown the CLI would sit forever on the spinner instead of exiting.

  3. Early server exits were opaque. When the in-sandbox interactive server exited before connecting, the CLI showed a generic "may have timed out" hint with no detail.

  4. The in-sandbox server trusted a stale config. pty-tunnel-server decided whether a server was already running purely from a leftover config file and a liveness check on its recorded PID. Across a snapshot/resume that config is restored from the snapshot while the original process is gone, so a coincidentally-reused PID made it connect to a dead socket and exit.

Solution

  • Stop funneling attach() through the connection-established abort filter, so genuine connection failures propagate instead of being swallowed.
  • Always stop() the spinner on teardown (not just clear()), so a failure before the connection is established can no longer hang the process.
  • Include the in-sandbox server's stderr in the error when it exits before connecting, so the real cause is visible.
  • Have pty-tunnel-server health-check a server before reusing it, and remove any leftover config before spawning a new one, so a stale config restored from a snapshot can no longer cause a connection to a dead socket.

Together these turn the previous silent hang into either a working connection or a fast, legible error.

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Jun 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
sandbox Ready Ready Preview, Comment, Open in v0 Jun 1, 2026 7:55pm
sandbox-cli Ready Ready Preview, Comment Jun 1, 2026 7:55pm
sandbox-sdk Ready Ready Preview, Comment Jun 1, 2026 7:55pm
sandbox-sdk-ai-example Ready Ready Preview, Comment Jun 1, 2026 7:55pm
workflow-code-runner Ready Ready Preview, Comment Jun 1, 2026 7:55pm

Request Review

…andboxes

`sandbox connect` could hang on "Waiting for connection..." or fail when run
against a stopped/resumed sandbox. Three independent issues:

- The CLI swallowed real `attach()` failures: once the connection handshake
  landed, the same abort signal used to stop the premature-exit check also
  discarded any later `attach()` error, so failures were never surfaced.
- The spinner's disposer called `ora.clear()` instead of `stop()`, leaving the
  render interval running and keeping the event loop (and the CLI) alive
  indefinitely on teardown.
- When the interactive server exited early, the generic error hid the actual
  cause; we now include the server's stderr.
- The in-sandbox server (pty-tunnel-server) trusted a leftover
  /tmp/vercel/interactive/config.json restored from a snapshot whenever its
  recorded PID happened to be alive, connecting to a dead socket. It now
  health-checks a reused server and removes the stale config before spawning a
  fresh one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant