feat(agents): tool loader Part 2 — explicit escape hatch + tuning (#1450)#1762
feat(agents): tool loader Part 2 — explicit escape hatch + tuning (#1450)#1762alexey-tyurin wants to merge 4 commits into
Conversation
Code Review — Tool loader Part 2 (#1762)SummaryClean, well-scoped, and unusually well-tested — this is ready to merge with at most a couple of cosmetic touch-ups. The escape hatch is wired exactly the way the architecture wants it: Issues🟡 Full-category eval deferred to target hardware — non-blocking, but gate it before the loader ever ships on by default. CLAUDE.md requires an 🟢 🟢 Non-native models carry 🟢 Strengths
VerdictApprove with suggestions. No blocking issues — the default path is byte-identical and well-pinned, and the recovery mechanism is correct and thoroughly tested. The 🟡 is advisory (run the category eval on AMD hardware before this is ever defaulted on); the 🟢 items are cosmetic and safe to fold in or defer. Nice work. |
|
Addressed the three 🟢 review suggestions and fixed the failing Unit Tests checks; branch is updated with Review suggestions
Why the Unit Tests checks were red (and the fix)CI runs the full
API Tests checkThis failure is not caused by this PR's diff:
Re-running the job should clear it if it's a flake. If it fails deterministically, the cause is in the merged Verification
|
Summary
Adds the explicit escape hatch to the dynamic tool loader: a native tool-calling model can now recover a tool the per-turn selector didn't surface by calling an always-available
load_tools(bundle)meta-tool, and the loader emits an escape-hatch activation-rate signal for tuning the semantic threshold. This is Part 2 of the tool-loader milestone (#688), builds on Part 1 (#1449), and is off by default.Why
Part 1 trims each turn's tool prompt to a semantically-matched subset to cut first-turn TTFT — but that left one real gap. A native tool-calling model (the default doc model, Gemma-4-E4B) is physically limited to the schemas in
tools=, so if the selector didn't surface a tool the model literally cannot call it, and a task can dead-end because a needed tool is permanently unreachable. Concretely: "How many PTO days?" with nothing indexed scores the search/index tools below the threshold, so before this change the native model had no way to reach them and the task failed; after it, the model loads the bundle it needs and calls the tool on its next step — hard recall failures drop to zero (that scenario now recovers and PASSes 9.7/10 with the loader on). Part 1's free-recovery net only ever helped non-tool-calling models.Linked issue
Closes #1450
Changes
load_tools(bundle)escape hatch for native models — added to the doc profile's CORE set so it renders in both the text prompt and the nativetools=schema every active turn; registered only when the loader is active, so the default-off doc path stays byte-identical.gaia.eval.tool_recallunions the mid-loopload_toolsline into its turn and drops Part 1's native "known gap" exemption, so a genuine miss fails the gate on every model.TOOL_LOADER_SESSIONsummary also emits on thegaia chat/CLI path.Test plan
python util/lint.py --all→ all checks PASS (Black/isort/Pylint/Flake8/imports/dependabot).python -m pytest tests/unit/test_tool_loader_selection.py tests/unit/test_tool_recall.py tests/unit/test_chat_dynamic_tools.py tests/unit/test_chat_tool_bundles.py tests/unit/test_tool_loader_token_budget.py→ 86 passed.load_toolsand renders the 37-tool prompt unchanged — pinned bytest_tool_loader_token_budget.pyand asserted bytest_load_tools_registered_only_when_loader_active.GAIA_DYNAMIC_TOOLS=1on :4200 (… ui.server … 2>&1 | tee /tmp/server.log), thengaia eval agent --scenario smart_discovery --agent-type doc --timeout 1400. Confirmgrep '"event": "load_tools"' /tmp/server.logshows the recovery, andpython -m gaia.eval.tool_recall <run_dir> --log /tmp/server.logprints recall 100% + the τ-rate.Tool Loader Part 2: Verification & Proof Report
load_toolswithin one extra turn (e2e)SC1 evidence (the PASS run):
turn 0 agent_tools = ['load_tools', 'search_file', 'query_specific_file']— the model hit a miss (file-search tools not in the turn-1 loaded set), calledload_tools("file_search"), the loader admitted the bundle cap-aware (LRU-evicted 3 never-called tools; heldmax_tools=14), then searched/indexed/queried and answered correctly.SC3/SC4 evidence (recall gate on that run):
recall: 100.0% — All called tools were loaded when called ✅;τ rate/turn: 0.500 (load_tools: 1).Bot comment #4625281483 anchors all resolved:
load_toolsregistered on the native path (via CORE); free-recovery left on the prompt-inlined path; escape-hatch rate via a dedicated counter logged alongside τ (notgaia stats).Deviations from the approved sketch (flagged per CLAUDE.md)
load_toolsschema presence — delivered via CORE membership (no separate always-on meta-tool mechanism), registered only when the loader is active so default-off doc stays byte-identical.ToolLoader.format_bundle_menu(), injected into the stable prompt prefix, native-models only (protects the non-native TTFT path).Agent._apply_tool_filter.load_toolslines; removing it alone would have false-failed every successful recovery.TOOL_LOADER_SESSION, aggregated from logs (nogaia stats/UI-DB touch).Additional divergences worth a reviewer's eye:
DOC_BUNDLES, not the globalKNOWN_TOOLSmap the bot comment suggested — becauseload_toolsloads those bundles, so the menu must list exactly what's loadable.TOOL_LOADER_SESSIONlines, but the eval/UI-server path never callsreset_session(), so those lines don't appear in eval logs. The recall gate now derives the rate from the raw per-turn events (the source of truth), so SC4 works from eval runs as intended. Wiringreset_session()into the UI runtime is intentionally left out of scope.test_tool_loader_token_budget.pyleft unchanged — the filtered pins don't move because the loader-off fixture never registersload_tools; the unfiltered 37-tool baseline is untouched.SC2 / Part 1 note
SC2 verifies a deliberate Part-1 deliverable ("leave
_execute_toolon the full registry"), not a Part-1 miss —_execute_toolis untouched in this diff. Part 2 adds the per-session counter on top of Part 1's existingTOOL_LOADER_ESCAPE_HATCHlog.Deferred / follow-ups (out of scope)
gaia eval agent --category tool_selection|rag_quality --compare …on target hardware before relying on the score. The canonical recovery scenario already PASSes here.search_past_conversationsTypeError(int/str) surfaced during the eval — unrelated to Tool loader — Part 2: explicit escape hatch + tuning #1450, worth its own issue.reset_session()wiring — only needed if a per-session summary is wanted from the Agent UI runtime; the τ-rate is already derivable from per-turn events.Checklist
Closes #1450).python util/lint.py --all→ PASS; 86 unit tests pass).docs/plans/tool-loader.mdxPart 2 marked shipped with an implementation reference.