Skip to content

benchmarks: Harbor Terminal-Bench harness for Buzz agent-team orchestration (harbor-buzz-orchestra)#1504

Open
tlongwell-block wants to merge 25 commits into
mainfrom
wren/harbor-buzz-orchestra
Open

benchmarks: Harbor Terminal-Bench harness for Buzz agent-team orchestration (harbor-buzz-orchestra)#1504
tlongwell-block wants to merge 25 commits into
mainfrom
wren/harbor-buzz-orchestra

Conversation

@tlongwell-block

Copy link
Copy Markdown
Collaborator

What

Adds benchmarks/harbor-buzz-orchestra/ — a Harbor (Terminal-Bench) benchmark harness that runs a team of Buzz agents (one orchestrator + N workers, defined entirely by a manifest YAML) against TB tasks, using the real production Buzz stack: the shipped relay compose bundle + a single Postgres, with per-trial isolated channels/keys and the pinned production buzz CLI for all agent communication.

Purpose: benchmark the "LLM auto-switching" orchestration strategy — a frontier-model orchestrator coordinating cheaper/faster worker models — measuring reward, cost, speed, and coordination overhead against a frontier-only baseline.

Structure

Piece Lines
Harbor adapter (src/: agent, manifest, subprocess runtime, terminal broker/MCP, verifier prep) ~1,650 py
Testbed (testbed/src/: trial provisioner, keys, CLI wrapper) ~380 py
Tests ~1,330 py
Personas, manifests, SQL schema, endpoint/compose config ~220

(Remaining insertions are generated: hash-locked verifier wheel manifests, vendored .whls for network-locked grading, and uv.lock.)

Coupling

  • Harbor is unmodified. Consumed as a plain pip dependency (harbor>=0.16.1,<0.18) via its documented custom-agent interface (BaseAgent/BaseEnvironment/AgentContext, ~10 imports across 4 files). TB graders/verifiers are byte-untouched; verifier prep only pre-bakes hash-locked wheels into the task image for offline grading.
  • Zero coupling into the product. No Rust crate, relay code, or migration is touched — the diff is 100% additive under benchmarks/. The benchmark consumes Buzz as a black box through the shipped binaries, exactly like a customer.

Validation

M1 gate closed with a valid scored result on this tip (2a23014d): run m1-laptop-advisory-20260703T161605Z, TB hello-world, reward 1.0, 1/1 passed, 48.48s wall, 25,675 prompt / 915 gen tokens, 0 recoveries/errors — full receipt with endpoint-side capture cross-check in the team work logs. Grader ran unchanged; offline-grading attestation via docker --network none.

Built by @wren (adapter/runtime/verifier-prep) and Eva (provisioner/personas/measurement lane).

npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 and others added 25 commits July 3, 2026 08:14
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Implements the TrialHandle v1.1 contract against a live local Buzz stack:
channel-per-trial private provisioning, fresh per-agent Nostr keys with
NIP-OA owner attestation, advisory-locked idempotency on (run_id,
trial_id), archive-only teardown, and a compose override publishing
relay metrics and Postgres for the benchmark harness. Isolation
(cross-trial reads blocked by membership) is asserted by a live test
suite gated on BUZZ_TESTBED_LIVE=1.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
- personas/: orchestrator + worker prompts for the M1 hello-world gate
- manifests/m1-hello-world.yaml: 1+1 roster with pinned prompt hashes,
  local placeholder endpoints, zero prices (wiring proof, not accounting)
- testbed/sql/benchmark_schema.sql: idempotent harness-owned schema —
  trial_manifest, llm_receipts (post-run gateway ingestion, unique on
  (source, request_id)), and spans with queue-wait recorded separately
  from execution per the M1 serialized-broker policy

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Deployment-time mapping (outside the immutable manifest) from both
M1 manifest endpoint names to one local OpenAI-compatible llama-server
at 127.0.0.1:8091, using OPENAI_COMPAT_API_KEY / OPENAI_COMPAT_BASE_URL
per the pinned buzz-agent env contract at 6bb5208.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Tyler's directive: agents must use the real buzz CLI, not a bespoke
messaging tool. Both M1 personas now name their exec surfaces —
orchestrator gets buzz_exec only; workers get exec (Harbor task
container) plus buzz_exec (host-side, per-agent identity) — and state
that a turn is not complete until the message is published via
'messages send'. Channel id arrives in the task seed, per Wren's
runtime boundary.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Follow-up to 63f4497, which changed both persona bodies without
re-pinning them. Pins now match the buzz_exec persona texts
(orchestrator 8c263914…, worker 78ffff9e…), verified with shasum
against the working tree.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
M1 run m1-buzz-cli-20260703T135823Z proved the full real-CLI path
(delegation published, task executed, report published) but stalled
because the worker's report mentioned nobody — the mentions-only
orchestrator never woke to verify and publish DONE. Worker persona now
requires every report to open with an @mention of the assigning agent
and to thread via --reply-to when the assignment event id is visible.
Manifest worker pin updated in the same commit (2c7fac21…);
orchestrator persona and pin unchanged.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
…task workdir

Trial e0f4ee58 (run m1-laptop-advisory-20260703T150426Z) failed
independently of the endpoint outage: the orchestrator invented a
host-shaped absolute path for hello.txt, and the worker's mkdir -p
masked the mismatch, landing the file where the grader never looks.

Orchestrator: reference files by bare relative name unless the task
names a path. Worker: create files in the terminal working directory
and report suspicious absolute paths instead of mkdir -p'ing them.
Manifest pins re-hashed in the same commit (atomic persona+pin rule).

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Run m1-laptop-advisory-20260703T152743Z failed the exact-byte probe
because the orchestrator added '(no trailing newline)' to its
delegation — a constraint the task never stated — and the worker
obeyed. hello-world's grader wants 'Hello, world!\n'; printf without
newline lost to echo's default.

Generalize the path rule into a fabrication rule: relay the task's
requirements verbatim, add no invented constraints (paths, encodings,
byte-level rules), and let standard tool defaults apply where the task
is silent. Orchestrator pin re-hashed in the same commit.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Preparing cobol-modernization surfaced four hello-world assumptions in the
verifier prep pass: the FROM line is not always first, tasks may omit the
[verifier.env] table, the uv install shim was pinned to one version, and a
prebuilt docker_image pin silently made Harbor skip the prepared Dockerfile
entirely. Fix all four and commit the cobol-modernization wheel lock
alongside hello-world's.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Manifest, sha256-pinned Terminal-Bench personas, and an Anthropic endpoint
config for the 1x claude-sonnet-4-6 orchestrator + 2x claude-haiku-4-5
worker team. The orchestrator must assign verification to a different
worker than the one whose work is being verified: independent review, and
it keeps every roster member engaged. Scored 1.0 on cobol-modernization
end-to-end over the live Anthropic API.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
scripts/run_leaderboard.py takes a problem set (registry dataset or local
path), attempts per problem, and a team manifest, and produces a
leaderboard-ready job directory: it does not accept or forward any timeout
or resource override that Harbor's static validation rejects, derives a
schema-valid metadata.yaml from the manifest roster, and prints the
upload/submit commands. Tests pin the no-overrides invariant and the
metadata schema against Harbor's own loader.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant