Skip to content

fix: balance temporary sandbox leases for uploads#3261

Open
BXL1015 wants to merge 3 commits into
bytedance:mainfrom
BXL1015:fix/upload-sandbox-lifecycle
Open

fix: balance temporary sandbox leases for uploads#3261
BXL1015 wants to merge 3 commits into
bytedance:mainfrom
BXL1015:fix/upload-sandbox-lifecycle

Conversation

@BXL1015
Copy link
Copy Markdown

@BXL1015 BXL1015 commented May 26, 2026

What

  • Release the temporary sandbox lease acquired by the HTTP upload route after file sync completes or fails.
  • Track active AIO sandbox acquire leases so a temporary upload/Feishu sync release cannot move a sandbox to the warm pool while another in-process caller is still using it.
  • Balance the same temporary acquire pattern in Feishu file sync.

Why

In remote/provisioner deployments, acquiring a sandbox may create or reuse a Kubernetes Pod plus NodePort Service. The upload route temporarily acquires the thread sandbox so it can sync uploaded files into non-mounted sandboxes, but the lease was not released on success, 413, or failure paths.

A simple unconditional finally: release() is risky because the same thread may already have an active sandbox from the agent runtime. In that case, releasing the temporary upload acquire should only decrement that temporary lease, not park the actively used sandbox in the warm pool.

How

  • Add per-sandbox lease counts in AioSandboxProvider.
  • Increment the lease count when an active in-process sandbox is reused.
  • Initialize lease counts when sandboxes are created, discovered, or reclaimed from the warm pool.
  • Make release() decrement first and only move the sandbox to the warm pool when the final lease is released.
  • Add focused regression coverage for upload, Feishu sync, provider lease reuse, and orphan reconciliation compatibility.

Testing

  • uv run ruff check app/gateway/routers/uploads.py app/channels/feishu.py packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py tests/test_uploads_router.py tests/test_aio_sandbox_provider.py tests/test_sandbox_orphan_reconciliation.py tests/test_feishu_parser.py
  • uv run ruff format --check app/gateway/routers/uploads.py app/channels/feishu.py packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py tests/test_uploads_router.py tests/test_aio_sandbox_provider.py tests/test_sandbox_orphan_reconciliation.py tests/test_feishu_parser.py
  • git diff --check
  • uv run pytest tests/test_uploads_router.py::test_upload_files_acquires_non_local_sandbox_before_writing tests/test_uploads_router.py::test_upload_files_fails_before_writing_when_non_local_sandbox_unavailable tests/test_uploads_router.py::test_upload_files_releases_non_local_sandbox_when_get_fails tests/test_uploads_router.py::test_upload_files_does_not_sync_non_local_sandbox_when_total_size_exceeds_limit tests/test_uploads_router.py::test_upload_files_does_not_sync_non_local_sandbox_when_conversion_fails tests/test_uploads_router.py::test_upload_files_releases_non_local_sandbox_when_sync_fails tests/test_aio_sandbox_provider.py::test_reused_active_sandbox_requires_matching_releases tests/test_feishu_parser.py::test_feishu_receive_single_file_releases_sandbox_after_sync tests/test_feishu_parser.py::test_feishu_receive_single_file_releases_sandbox_when_sync_fails
  • uv run pytest tests/test_uploads_router.py -k "not symlink"
  • uv run pytest tests/test_aio_sandbox_provider.py tests/test_sandbox_orphan_reconciliation.py tests/test_feishu_parser.py
  • uv run pytest tests/test_sandbox_middleware.py
  • Production-like k3d validation with a real provisioner-backed AIO sandbox: created a Pod + NodePort Service, uploaded a file through the upload route, verified the upload lease was released into the warm pool, verified repeated acquire requires matching releases, downloaded the synced file from the sandbox API, and verified shutdown removed the provisioner sandbox.

Copilot AI review requested due to automatic review settings May 26, 2026 15:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR ensures non-local sandboxes are properly released after file upload / sync flows, and adds lease counting so repeated acquires of the same sandbox require matching releases.

Changes:

  • Add sandbox release in upload_files() and Feishu file-receive sync paths (including failure paths).
  • Introduce _lease_counts in AioSandboxProvider to prevent a “temporary” acquire/release from tearing down another caller’s active sandbox.
  • Expand/adjust tests to validate release/lease behavior across success and failure scenarios.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backend/app/gateway/routers/uploads.py Releases acquired sandboxes in more error/success paths during upload + sync.
backend/app/channels/feishu.py Ensures sandbox acquired for file sync is released via finally.
backend/packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py Adds lease counting to make sandbox lifecycle safe across multiple acquires.
backend/tests/test_uploads_router.py Adds assertions/tests to verify sandbox release behavior for uploads.
backend/tests/test_feishu_parser.py Adds tests to verify Feishu file receive releases sandboxes on success/failure.
backend/tests/test_aio_sandbox_provider.py Adds test coverage for lease-counted acquire/release behavior.
backend/tests/test_sandbox_orphan_reconciliation.py Updates provider test fixture initialization to include _lease_counts.

Comment thread backend/app/channels/feishu.py
Comment thread backend/app/gateway/routers/uploads.py Outdated
@WillemJiang
Copy link
Copy Markdown
Collaborator

@BXL1015, thanks for your contribution. Please create an issue to track the idea of this change, it looks more like a feature than a bug fix for me.

@BXL1015
Copy link
Copy Markdown
Author

BXL1015 commented May 28, 2026

Thanks for the guidance. Issue #3296 has been created.

@WillemJiang WillemJiang added the reviewing The PR is in reviewing status label May 29, 2026
@WillemJiang
Copy link
Copy Markdown
Collaborator

@BXL1015, here are more comments for the code

  1. Race condition: _lease_counts mutated without self._lock in _reuse_in_process_sandbox (Medium)

The Copilot review flagged this correctly. In _reuse_in_process_sandbox, the lease count increment happens outside self._lock:

# _reuse_in_process_sandbox — no lock held here
lease_counts = self._lease_counts_map()
lease_counts[existing_id] = lease_counts.get(existing_id, 0) + 1

Butrelease()decrements under self._lock. This is a read-modify-write without synchronization — concurrent threads calling acquire() for the same thread could lose increments.

The existing code already mutates _last_activity and reads _sandboxes without self._lock in this same method, so this may be intentionally lock-free (relying on the per-thread _thread_locks for serialization). However, the PR should clarify this intent, and if the per-thread lock already serializes access, a comment would help future readers.

  1. release() fallback logic is fragile (Low)

active_leases = lease_counts.get(sandbox_id, 1 if sandbox_id in self._sandboxes else 0)

When a sandbox_id isn't in _lease_counts but is in _sandboxes, the code assumes a count of 1 and tears down the sandbox. This is a reasonable fallback for legacy sandboxes created before this PR, but it means a missing _lease_counts entry is silently treated as "safe to release."
Consider logging when this fallback fires, so it's detectable if the counting logic has bugs.

  1. _lease_counts_map() lazy initialization (Low)

The _lease_counts_map() method lazily creates the dict for test-constructed providers. This is fine for backward compatibility, but every other dict (_sandboxes, _sandbox_infos) is initialized in init — the _lease_counts is too. This method adds complexity to a case that could be caught by a simpler assertion. Minor, but worth noting.

@WillemJiang WillemJiang added the question Further information is requested label May 30, 2026
@WillemJiang WillemJiang added this to the 2.0.0 milestone May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested reviewing The PR is in reviewing status

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants