Skip to content

feat(agents): tool loader Part 2 — explicit escape hatch + tuning (#1450)#1762

Open
alexey-tyurin wants to merge 4 commits into
amd:mainfrom
alexey-tyurin:feat/tool-loader-escape-hatch
Open

feat(agents): tool loader Part 2 — explicit escape hatch + tuning (#1450)#1762
alexey-tyurin wants to merge 4 commits into
amd:mainfrom
alexey-tyurin:feat/tool-loader-escape-hatch

Conversation

@alexey-tyurin

Copy link
Copy Markdown
Contributor

Summary

Adds the explicit escape hatch to the dynamic tool loader: a native tool-calling model can now recover a tool the per-turn selector didn't surface by calling an always-available load_tools(bundle) meta-tool, and the loader emits an escape-hatch activation-rate signal for tuning the semantic threshold. This is Part 2 of the tool-loader milestone (#688), builds on Part 1 (#1449), and is off by default.

Why

Part 1 trims each turn's tool prompt to a semantically-matched subset to cut first-turn TTFT — but that left one real gap. A native tool-calling model (the default doc model, Gemma-4-E4B) is physically limited to the schemas in tools=, so if the selector didn't surface a tool the model literally cannot call it, and a task can dead-end because a needed tool is permanently unreachable. Concretely: "How many PTO days?" with nothing indexed scores the search/index tools below the threshold, so before this change the native model had no way to reach them and the task failed; after it, the model loads the bundle it needs and calls the tool on its next step — hard recall failures drop to zero (that scenario now recovers and PASSes 9.7/10 with the loader on). Part 1's free-recovery net only ever helped non-tool-calling models.

Linked issue

Closes #1450

Changes

  • Always-on load_tools(bundle) escape hatch for native models — added to the doc profile's CORE set so it renders in both the text prompt and the native tools= schema every active turn; registered only when the loader is active, so the default-off doc path stays byte-identical.
  • Native-only bundle menu in the stable system-prompt prefix (no KV-cache thrash) listing the loadable bundles + one-line descriptions; non-native models already self-recover, so they aren't taxed with it.
  • Recall merge gate now counts native recoverygaia.eval.tool_recall unions the mid-loop load_tools line into its turn and drops Part 1's native "known gap" exemption, so a genuine miss fails the gate on every model.
  • Escape-hatch activation-rate (τ-tuning) signal — the loader counts both recovery paths; the recall gate reports a per-turn rate derived from raw log events (so it works on eval runs), and a per-session TOOL_LOADER_SESSION summary also emits on the gaia chat/CLI path.
  • Removed the now-obsolete "no escape hatch until Part 2" warning (Part 2 closes the gap it announced).

Test plan

  • python util/lint.py --all → all checks PASS (Black/isort/Pylint/Flake8/imports/dependabot).
  • python -m pytest tests/unit/test_tool_loader_selection.py tests/unit/test_tool_recall.py tests/unit/test_chat_dynamic_tools.py tests/unit/test_chat_tool_bundles.py tests/unit/test_tool_loader_token_budget.py86 passed.
  • Off-state is byte-identical (default): with the toggle off, the doc agent registers no load_tools and renders the 37-tool prompt unchanged — pinned by test_tool_loader_token_budget.py and asserted by test_load_tools_registered_only_when_loader_active.
  • Loader-on e2e (needs Lemonade + the doc model): start the backend with GAIA_DYNAMIC_TOOLS=1 on :4200 (… ui.server … 2>&1 | tee /tmp/server.log), then gaia eval agent --scenario smart_discovery --agent-type doc --timeout 1400. Confirm grep '"event": "load_tools"' /tmp/server.log shows the recovery, and python -m gaia.eval.tool_recall <run_dir> --log /tmp/server.log prints recall 100% + the τ-rate.

Tool Loader Part 2: Verification & Proof Report

Condensed; full evidence in GAIA-1450-verification.md. Capstone: one PASSing loader-on run (eval-20260619-010110, smart_discovery, 9.7/10, 484s) where the native model recovered a missed tool via load_tools and completed the task — SC1 + SC3 + SC4 all fall out of it.

# Success criterion Status
SC1 Native model recovers a semantically-missed tool via load_tools within one extra turn (e2e) ✅ recovery + task PASS 9.7/10
SC2 Non-tool-calling free-recovery path verified (unlisted named tool still executes) ✅ mechanism proven (invariant + real-loader log); live demo needs a non-tool-calling model
SC3 Hard recall failures = 0 ✅ recall gate 100%, exit 0
SC4 Escape-hatch activation rate logged per session, usable as the τ-tuning signal ✅ emitted by the loader; recall gate derives it from eval logs

SC1 evidence (the PASS run): turn 0 agent_tools = ['load_tools', 'search_file', 'query_specific_file'] — the model hit a miss (file-search tools not in the turn-1 loaded set), called load_tools("file_search"), the loader admitted the bundle cap-aware (LRU-evicted 3 never-called tools; held max_tools=14), then searched/indexed/queried and answered correctly.

SC3/SC4 evidence (recall gate on that run): recall: 100.0% — All called tools were loaded when called ✅; τ rate/turn: 0.500 (load_tools: 1).

Bot comment #4625281483 anchors all resolved: load_tools registered on the native path (via CORE); free-recovery left on the prompt-inlined path; escape-hatch rate via a dedicated counter logged alongside τ (not gaia stats).

Deviations from the approved sketch (flagged per CLAUDE.md)

  1. load_tools schema presence — delivered via CORE membership (no separate always-on meta-tool mechanism), registered only when the loader is active so default-off doc stays byte-identical.
  2. Bundle menu — new ToolLoader.format_bundle_menu(), injected into the stable prompt prefix, native-models only (protects the non-native TTFT path).
  3. Recovery timing — lands on the next model step (stronger than the sketch's "next turn"): the handler swaps the filter + recomputes the prompt immediately via a new base Agent._apply_tool_filter.
  4. Native "known gap" exemption removed — but only after teaching the recall parser to union mid-loop load_tools lines; removing it alone would have false-failed every successful recovery.
  5. Escape-hatch rate — dedicated per-session loader counter + TOOL_LOADER_SESSION, aggregated from logs (no gaia stats/UI-DB touch).
  6. Cap stays 14 = 11 CORE + 3 dynamic slots (Part 1's rationale was 10 CORE + 4). Open question for the eval: whether 3 dynamic slots suffice or the default should bump to 15 — gated on recall/escape-hatch-rate, not changed here.

Additional divergences worth a reviewer's eye:

  • Menu lists DOC_BUNDLES, not the global KNOWN_TOOLS map the bot comment suggested — because load_tools loads those bundles, so the menu must list exactly what's loadable.
  • τ-signal derivation (a fix): the plan assumed the eval aggregates TOOL_LOADER_SESSION lines, but the eval/UI-server path never calls reset_session(), so those lines don't appear in eval logs. The recall gate now derives the rate from the raw per-turn events (the source of truth), so SC4 works from eval runs as intended. Wiring reset_session() into the UI runtime is intentionally left out of scope.
  • test_tool_loader_token_budget.py left unchanged — the filtered pins don't move because the loader-off fixture never registers load_tools; the unfiltered 37-tool baseline is untouched.

SC2 / Part 1 note

SC2 verifies a deliberate Part-1 deliverable ("leave _execute_tool on the full registry"), not a Part-1 miss — _execute_tool is untouched in this diff. Part 2 adds the per-session counter on top of Part 1's existing TOOL_LOADER_ESCAPE_HATCH log.

Deferred / follow-ups (out of scope)

  • Full category task-success vs. baseline is hardware-confounded locally (baseline is AMD Ryzen; this machine is Metal at ~8–21 min/scenario) — run gaia eval agent --category tool_selection|rag_quality --compare … on target hardware before relying on the score. The canonical recovery scenario already PASSes here.
  • Pre-existing search_past_conversations TypeError (int/str) surfaced during the eval — unrelated to Tool loader — Part 2: explicit escape hatch + tuning #1450, worth its own issue.
  • UI-server reset_session() wiring — only needed if a per-session summary is wanted from the Agent UI runtime; the τ-rate is already derivable from per-turn events.

Checklist

  • Linked a GitHub issue above (Closes #1450).
  • Described why (before → after impact), not just what changed.
  • Ran linting and tests locally (python util/lint.py --all → PASS; 86 unit tests pass).
  • Updated documentation — docs/plans/tool-loader.mdx Part 2 marked shipped with an implementation reference.

@github-actions github-actions Bot added documentation Documentation changes eval Evaluation framework changes tests Test changes performance Performance-critical changes agents labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Code Review — Tool loader Part 2 (#1762)

Summary

Clean, well-scoped, and unusually well-tested — this is ready to merge with at most a couple of cosmetic touch-ups. The escape hatch is wired exactly the way the architecture wants it: load_tools is CORE-only and registered only when the loader is active, so the default-off doc path stays byte-identical (verified by test_load_tools_registered_only_when_loader_active + the unchanged token-budget pins). load_bundle faithfully mirrors select's cap-aware LRU admission, so max_tools holds during mid-loop recovery, and the recall parser was correctly taught to union mid-loop load_tools lines before the native exemption was dropped — the right order, since dropping it alone would have false-failed every successful recovery. The single most important thing: the LLM-affecting surfaces here (tool registration + doc system prompt) only activate behind GAIA_DYNAMIC_TOOLS=1, which is what makes the deferred full-category eval acceptable (see below).

Issues

🟡 Full-category eval deferred to target hardware — non-blocking, but gate it before the loader ever ships on by default. CLAUDE.md requires an gaia eval agent category run vs. the committed baseline whenever tool registration or a system prompt changes, and both changed here. The PR runs only the smart_discovery scenario (PASS 9.7/10) and defers the category --compare citing Metal-vs-Ryzen hardware confounding. That's honestly disclosed and genuinely low-risk because the change is off by default — the shipped default path is byte-identical and pinned, so default users can't regress. The ask is just: run gaia eval agent --category tool_selection|rag_quality --compare … on AMD hardware before dynamic_tools is flipped on by default in any later PR, not now.

🟢 load_bundle log records the bare tool name, not the resolved bundle (tool_loader.py:375). resolved_name = bundle is never updated when _resolve_bundle_members resolves a bare tool name to its owning bundle(s) — so a load_tools("search_file") call logs "bundle": "search_file" rather than the bundle it actually pulled. Harmless to behavior, but the variable name implies a resolution that doesn't happen, and it makes the τ-tuning log slightly misleading. Either rename it to bundle or have _resolve_bundle_members also return the owning bundle name(s) for the log line.

🟢 Non-native models carry load_tools in CORE but get no menu (chat/agent.py:881). The menu is native-only (correct — non-native models self-recover for free), but load_tools is in DOC_CORE_TOOLS for every model, so a non-native model sees the tool with a docstring that points at a "Loadable tool bundles" menu it was never given. Purely cosmetic — those models never need it — but worth a one-line note in the docstring or a follow-up.

🟢 count_recovery_events_from_log returns a bare tuple (tool_recall.py). Minor type-hint precision — prefer Tuple[int, int] (already importable from typing) so the (free, loads) shape is documented at the signature.

Strengths

  • Invariant centralized correctly. Extracting Agent._apply_tool_filter so the "filter and cached prompt move together" rule lives in exactly one place — shared by the per-turn refresh and the mid-loop handler — is the clean way to make recovery land on the next step rather than the next turn. test_apply_tool_filter_swaps_filter_and_recomputes_prompt pins it.
  • Parser correctness is genuinely well-guarded. _SESSION_RE vs _TOOL_LOADER_RE can't cross-contaminate (the TOOL_LOADER_ vs TOOL_LOADER space boundary), event-less lines alone move the scenario cursor, and there are explicit tests for consecutive single-turn scenarios and recovered-vs-unrecovered misses. The τ-rate derivation from raw per-turn events (because the eval/UI path never calls reset_session()) is a real bug caught and fixed, not a paper deviation.
  • Fail-loudly compliant. Unknown bundle raises KeyError internally and is translated at the tool boundary into a structured, actionable error that lists the valid bundle names — exactly the allowed boundary-translation pattern, no silent fallback.
  • Deviations from the sketch are each flagged with a why, and the docs plan (tool-loader.mdx) is updated to match the shipped reality.

Verdict

Approve with suggestions. No blocking issues — the default path is byte-identical and well-pinned, and the recovery mechanism is correct and thoroughly tested. The 🟡 is advisory (run the category eval on AMD hardware before this is ever defaulted on); the 🟢 items are cosmetic and safe to fold in or defer. Nice work.

@alexey-tyurin

Copy link
Copy Markdown
Contributor Author

Addressed the three 🟢 review suggestions and fixed the failing Unit Tests checks; branch is updated with main (conflict-free). Notes on the API Tests check below.

Review suggestions

  • load_bundle logged the bare tool name, not the resolved bundle. _resolve_bundle_members now returns (members, resolved_name), so load_tools("search_file") logs "bundle": "file_search" instead of "search_file" — the τ-tuning log is no longer misleading. Added an assertion for it.
  • Non-native models saw load_tools with a docstring pointing at a menu they never get. Reworded the docstring to be menu-optional: "If a 'Loadable tool bundles' menu is shown in your instructions, pick from it; otherwise pass the specific tool name."
  • count_recovery_events_from_log returned a bare tuple → now Tuple[int, int].

Why the Unit Tests checks were red (and the fix)

CI runs the full tests/unit/ suite; two Part-2 changes broke 3 tests outside the loader files:

  • Extracting Agent._apply_tool_filtertest_dynamic_tool_filtering's _SpyAgent double borrowed _refresh_active_tool_filter but not the new method it delegates to, so it raised AttributeError. Fixed by binding both real methods on the spy.
  • The doc-prompt menu read self.tool_loader directly — test_chat_system_prompt_budget[doc-1] builds a partial ChatAgent without that attribute. Switched to getattr(self, "tool_loader", None) (the same defensive pattern already used for model_id).

API Tests check

This failure is not caused by this PR's diff:

  • API Tests passed on the pre-fix commit ecf62c44; it only went red after the main merge. The paths it gates on (src/gaia/api/**, src/gaia/llm/**) changed in main, not in this PR.
  • The job warns No files were found … gaia.api.log, i.e. the API/Lemonade server didn't start — an infra/startup issue, not a test assertion.
  • tests/test_api.py collects cleanly against this branch (89 tests, no import errors).

Re-running the job should clear it if it's a flake. If it fails deterministically, the cause is in the merged main commits and should be tracked separately from this change.

Verification

  • python util/lint.py --all → all checks PASS.
  • Previously-failing unit tests + the full loader suite: 102 passed (on the merged HEAD).
  • Branch merged with current main; none of main's commits touch files in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents documentation Documentation changes eval Evaluation framework changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tool loader — Part 2: explicit escape hatch + tuning

1 participant