Skip to content

fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives#240

Merged
poshinchen merged 2 commits into
strands-agents:mainfrom
poshinchen:fix-tool-results-implementer
Jun 6, 2026
Merged

fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives#240
poshinchen merged 2 commits into
strands-agents:mainfrom
poshinchen:fix-tool-results-implementer

Conversation

@poshinchen

@poshinchen poshinchen commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Closes #235

Problem

FaithfulnessEvaluator produced false negatives because both mappers read only content[0] from Bedrock-style toolResult.content lists, silently discarding blocks at index 1+.

Solution

  • Add join_tool_result_content() helper in src/strands_evals/mappers/utils.py that joins every block type: text → passthrough (with None/non-str coercion), json → json.dumps(sort_keys=True) (with serialization error guard), image/document/video → placeholder marker, unknown → silently skipped with a debug log. Empty-string text blocks are filtered from the join.
  • Extract _response_to_text() module-level helper in strands_in_memory_session_mapper.py to eliminate the duplicated list-vs-string ternary.
  • Replace all content[0]-only reads in strands_in_memory_session_mapper.py (_process_tool_results, _convert_inference_messages, _convert_tool_execution_span) and cloudwatch_session_mapper.py (_extract_tool_result_text).

Testing

All 198 mapper tests pass. Regression tests added for: legacy multi-text, legacy text+json, latest-convention inference span multi-text, latest-convention tool-execution span multi-text, and CloudWatch multi-block tool result. Direct unit tests in TestJoinToolResultContent cover all block types, edge cases (None, empty list, falsy non-None, empty-string text block, non-serializable json, bare dict, unknown keys, multi-key precedence).

…atives in FaithfulnessEvaluator

Multi-part Bedrock toolResult.content lists (text, json, image, document, video blocks) were
silently truncated to content[0], making values in subsequent blocks invisible to evaluators.

Adds join_tool_result_content() helper in mappers/utils.py and wires it into every affected
read site: _process_tool_results (legacy), _convert_inference_messages tool_call_response
branch (latest), _convert_tool_execution_span latest branch (both in
StrandsInMemorySessionMapper), and _extract_tool_result_text (CloudWatchSessionMapper).

Closes strands-agents#235

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/strands_evals/mappers/strands_in_memory_session_mapper.py Outdated
Comment thread tests/strands_evals/mappers/test_strands_in_memory_mapper.py Outdated
Comment thread src/strands_evals/mappers/cloudwatch_session_mapper.py
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Assessment: Comment (Approve with minor suggestions)

Solid bug fix that correctly identifies and resolves the root cause of faithfulness false negatives. The centralized join_tool_result_content helper is well-designed with good edge-case handling.

Review Categories
  • DRY: The response-to-text ternary pattern is duplicated in two places within strands_in_memory_session_mapper.py — consider extracting a small helper.
  • Dead code: _extract_tool_result_text in cloudwatch mapper is now a trivial wrapper that could be inlined.
  • Style: Mid-file import in test file; minor but easy to fix.

Testing is thorough with good coverage across all affected code paths.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Assessment: Approve

The mid-file import issue has been addressed. The addition of try/except around json.dumps and the explicit None/empty-list guards (avoiding false coercion of 0/False) are nice defensive improvements.

Outstanding suggestions (non-blocking)
  • DRY: The duplicated ternary at lines 326-330 and 377-381 in strands_in_memory_session_mapper.py still exists — extracting a small helper would clean this up.
  • Trivial wrapper: _extract_tool_result_text in cloudwatch_session_mapper.py is still a pass-through to join_tool_result_content — consider inlining when convenient.

Clean fix with thorough testing. Good to merge as-is.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Assessment: Approve

All substantive feedback from the previous review has been addressed: the duplicated ternary is now the clean _response_to_text helper, the f-string logger calls are fixed to %s style, and the test import is properly at the top.

Minor non-blocking observation
  • _response_to_text is defined before logger = logging.getLogger(__name__) (line 30 vs 36). Conventional module ordering is imports → logger → helpers → classes. Not a functional issue but worth noting for consistency.

Well-structured fix with thorough edge-case testing. Good to merge.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Assessment: Approve

Clean bugfix with comprehensive testing. All prior review feedback has been addressed — the DRY ternary is extracted to _response_to_text, f-string loggers are fixed to %s style, and imports are properly ordered.

Review Notes
  • Correctness: All 3 content[0]-only read sites are fixed across both mappers; the legacy gen_ai.choice path (line 392) correctly remains unchanged since it reads model response text, not toolResult content.
  • Testing: 73 tests pass; regression coverage hits all affected code paths (legacy, latest convention inference, latest convention tool execution, CloudWatch).
  • Style compliance: Logging uses structured %s format per STYLE_GUIDE.md; type annotations use built-in generics.
  • Non-blocking observation: _extract_tool_result_text in cloudwatch_session_mapper.py (line 319-321) is now a trivial pass-through — can be inlined in a follow-up cleanup.

Good to merge.

@strands-agents strands-agents deleted a comment from github-actions Bot Jun 2, 2026
@poshinchen

Copy link
Copy Markdown
Contributor Author

/strands which implementation would you prefer, "#239" or this.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

PR #240 vs PR #239 — Comparison & Recommendation

I've reviewed both PRs side-by-side. They solve the same bug (#235) with similar approaches (shared join_tool_result_content helper in utils.py), but differ in important ways:

Aspect PR #239 PR #240 (this)
Block separator \n " " (space)
Public API surface 2 functions (serialize_tool_result_block + join_tool_result_content) 1 function (join_tool_result_content)
json serialization json.dumps(default=str) — no crash but non-deterministic key order json.dumps(sort_keys=True) + try/except — deterministic & crash-safe
Unit tests for utility 0 17 (covers every block type, edge case, coercion path)
Unknown block types Returns "" silently Returns "" + debug log
CloudWatch refactor Removes _extract_tool_result_text entirely Keeps as thin wrapper
DRY in strands mapper Inlines directly (duplicated list-vs-str pattern stays) Extracts _response_to_text helper

I'd recommend PR #240 for the following reasons:

  1. Test coverage — The 17 unit tests in TestJoinToolResultContent exercise every code path (None, empty, non-dict, text coercion, json sort_keys, non-serializable json, image/doc/video placeholders, unknown keys, mixed blocks, empty-string filtering). PR fix: joined tool_results to prevent context lost #239 has zero direct utility tests — a regression in serialization logic would go undetected.

  2. Deterministic outputsort_keys=True matters because evaluators do string matching/comparison. Non-deterministic key ordering can cause flaky evaluations.

  3. Better encapsulationserialize_tool_result_block is an implementation detail; exposing it as public API (fix: joined tool_results to prevent context lost #239) creates unnecessary surface area to maintain.

  4. Crash safety — The try/except around json.dumps prevents unexpected TypeError if a tool returns a non-serializable value (e.g., custom object). PR fix: joined tool_results to prevent context lost #239 uses default=str which avoids crashes but produces less readable output.

One improvement to consider for #240:

Switch the separator from " " to "\n". PR #239's choice of newline is arguably better for multi-paragraph tool outputs — joining with space can create hard-to-read walls of text that make it harder for LLM judges to parse distinct tool result sections.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Assessment: Approve

Well-structured fix with comprehensive edge-case testing. All prior review feedback has been incorporated cleanly.

Review Details

Verified:

  • All 73 mapper tests pass
  • ruff check passes on all modified source files
  • Logging uses structured %s format per STYLE_GUIDE.md
  • Type annotations use built-in generics (list[str])
  • Google-style docstrings present on public utility
  • Tests mirror src/ structure correctly
  • No new dependencies, no litellm, no Jinja2

Non-blocking follow-up observation:
src/strands_evals/providers/langfuse_provider.py line ~424 in _parse_tool_result_content has the same c[0].get("text", "") pattern this PR fixes in the other two mappers. Consider applying join_tool_result_content there as well in a follow-up PR.

Good to merge.

@poshinchen poshinchen merged commit bcb1e8f into strands-agents:main Jun 6, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] FaithfulnessEvaluator drops multi-part toolResult.content → false negatives

2 participants