benchmarks: Harbor Terminal-Bench harness for Buzz agent-team orchestration (harbor-buzz-orchestra)#1504
Open
tlongwell-block wants to merge 25 commits into
Open
benchmarks: Harbor Terminal-Bench harness for Buzz agent-team orchestration (harbor-buzz-orchestra)#1504tlongwell-block wants to merge 25 commits into
tlongwell-block wants to merge 25 commits into
Conversation
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Implements the TrialHandle v1.1 contract against a live local Buzz stack: channel-per-trial private provisioning, fresh per-agent Nostr keys with NIP-OA owner attestation, advisory-locked idempotency on (run_id, trial_id), archive-only teardown, and a compose override publishing relay metrics and Postgres for the benchmark harness. Isolation (cross-trial reads blocked by membership) is asserted by a live test suite gated on BUZZ_TESTBED_LIVE=1. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
- personas/: orchestrator + worker prompts for the M1 hello-world gate - manifests/m1-hello-world.yaml: 1+1 roster with pinned prompt hashes, local placeholder endpoints, zero prices (wiring proof, not accounting) - testbed/sql/benchmark_schema.sql: idempotent harness-owned schema — trial_manifest, llm_receipts (post-run gateway ingestion, unique on (source, request_id)), and spans with queue-wait recorded separately from execution per the M1 serialized-broker policy Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Deployment-time mapping (outside the immutable manifest) from both M1 manifest endpoint names to one local OpenAI-compatible llama-server at 127.0.0.1:8091, using OPENAI_COMPAT_API_KEY / OPENAI_COMPAT_BASE_URL per the pinned buzz-agent env contract at 6bb5208. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Tyler's directive: agents must use the real buzz CLI, not a bespoke messaging tool. Both M1 personas now name their exec surfaces — orchestrator gets buzz_exec only; workers get exec (Harbor task container) plus buzz_exec (host-side, per-agent identity) — and state that a turn is not complete until the message is published via 'messages send'. Channel id arrives in the task seed, per Wren's runtime boundary. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Follow-up to 63f4497, which changed both persona bodies without re-pinning them. Pins now match the buzz_exec persona texts (orchestrator 8c263914…, worker 78ffff9e…), verified with shasum against the working tree. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
M1 run m1-buzz-cli-20260703T135823Z proved the full real-CLI path (delegation published, task executed, report published) but stalled because the worker's report mentioned nobody — the mentions-only orchestrator never woke to verify and publish DONE. Worker persona now requires every report to open with an @mention of the assigning agent and to thread via --reply-to when the assignment event id is visible. Manifest worker pin updated in the same commit (2c7fac21…); orchestrator persona and pin unchanged. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
…task workdir Trial e0f4ee58 (run m1-laptop-advisory-20260703T150426Z) failed independently of the endpoint outage: the orchestrator invented a host-shaped absolute path for hello.txt, and the worker's mkdir -p masked the mismatch, landing the file where the grader never looks. Orchestrator: reference files by bare relative name unless the task names a path. Worker: create files in the terminal working directory and report suspicious absolute paths instead of mkdir -p'ing them. Manifest pins re-hashed in the same commit (atomic persona+pin rule). Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Run m1-laptop-advisory-20260703T152743Z failed the exact-byte probe because the orchestrator added '(no trailing newline)' to its delegation — a constraint the task never stated — and the worker obeyed. hello-world's grader wants 'Hello, world!\n'; printf without newline lost to echo's default. Generalize the path rule into a fabrication rule: relay the task's requirements verbatim, add no invented constraints (paths, encodings, byte-level rules), and let standard tool defaults apply where the task is silent. Orchestrator pin re-hashed in the same commit. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub12gtutshhh76rx0jx697f32f9tffd4hhp3hx58fp4x6u4uemkm7sqf8f757 <5217c5c2f7bfb4333e46d17c98a9255a52dadee18dcd43a43536b95e6776dfa0@sprout-oss.stage.blox.sqprod.co>
Preparing cobol-modernization surfaced four hello-world assumptions in the verifier prep pass: the FROM line is not always first, tasks may omit the [verifier.env] table, the uv install shim was pinned to one version, and a prebuilt docker_image pin silently made Harbor skip the prepared Dockerfile entirely. Fix all four and commit the cobol-modernization wheel lock alongside hello-world's. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Manifest, sha256-pinned Terminal-Bench personas, and an Anthropic endpoint config for the 1x claude-sonnet-4-6 orchestrator + 2x claude-haiku-4-5 worker team. The orchestrator must assign verification to a different worker than the one whose work is being verified: independent review, and it keeps every roster member engaged. Scored 1.0 on cobol-modernization end-to-end over the live Anthropic API. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
scripts/run_leaderboard.py takes a problem set (registry dataset or local path), attempts per problem, and a team manifest, and produces a leaderboard-ready job directory: it does not accept or forward any timeout or resource override that Harbor's static validation rejects, derives a schema-valid metadata.yaml from the manifest roster, and prints the upload/submit commands. Tests pin the no-overrides invariant and the metadata schema against Harbor's own loader. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
benchmarks/harbor-buzz-orchestra/— a Harbor (Terminal-Bench) benchmark harness that runs a team of Buzz agents (one orchestrator + N workers, defined entirely by a manifest YAML) against TB tasks, using the real production Buzz stack: the shipped relay compose bundle + a single Postgres, with per-trial isolated channels/keys and the pinned productionbuzzCLI for all agent communication.Purpose: benchmark the "LLM auto-switching" orchestration strategy — a frontier-model orchestrator coordinating cheaper/faster worker models — measuring reward, cost, speed, and coordination overhead against a frontier-only baseline.
Structure
src/: agent, manifest, subprocess runtime, terminal broker/MCP, verifier prep)testbed/src/: trial provisioner, keys, CLI wrapper)(Remaining insertions are generated: hash-locked verifier wheel manifests, vendored
.whls for network-locked grading, anduv.lock.)Coupling
harbor>=0.16.1,<0.18) via its documented custom-agent interface (BaseAgent/BaseEnvironment/AgentContext, ~10 imports across 4 files). TB graders/verifiers are byte-untouched; verifier prep only pre-bakes hash-locked wheels into the task image for offline grading.benchmarks/. The benchmark consumes Buzz as a black box through the shipped binaries, exactly like a customer.Validation
M1 gate closed with a valid scored result on this tip (
2a23014d): runm1-laptop-advisory-20260703T161605Z, TB hello-world, reward 1.0, 1/1 passed, 48.48s wall, 25,675 prompt / 915 gen tokens, 0 recoveries/errors — full receipt with endpoint-side capture cross-check in the team work logs. Grader ran unchanged; offline-grading attestation viadocker --network none.Built by @wren (adapter/runtime/verifier-prep) and Eva (provisioner/personas/measurement lane).