fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives by poshinchen · Pull Request #240 · strands-agents/evals

poshinchen · 2026-06-02T18:24:32Z

Closes #235

Problem

FaithfulnessEvaluator produced false negatives because both mappers read only content[0] from Bedrock-style toolResult.content lists, silently discarding blocks at index 1+.

Solution

Add join_tool_result_content() helper in src/strands_evals/mappers/utils.py that joins every block type: text → passthrough (with None/non-str coercion), json → json.dumps(sort_keys=True) (with serialization error guard), image/document/video → placeholder marker, unknown → silently skipped with a debug log. Empty-string text blocks are filtered from the join.
Extract _response_to_text() module-level helper in strands_in_memory_session_mapper.py to eliminate the duplicated list-vs-string ternary.
Replace all content[0]-only reads in strands_in_memory_session_mapper.py (_process_tool_results, _convert_inference_messages, _convert_tool_execution_span) and cloudwatch_session_mapper.py (_extract_tool_result_text).

Testing

All 198 mapper tests pass. Regression tests added for: legacy multi-text, legacy text+json, latest-convention inference span multi-text, latest-convention tool-execution span multi-text, and CloudWatch multi-block tool result. Direct unit tests in TestJoinToolResultContent cover all block types, edge cases (None, empty list, falsy non-None, empty-string text block, non-serializable json, bare dict, unknown keys, multi-key precedence).

…atives in FaithfulnessEvaluator Multi-part Bedrock toolResult.content lists (text, json, image, document, video blocks) were silently truncated to content[0], making values in subsequent blocks invisible to evaluators. Adds join_tool_result_content() helper in mappers/utils.py and wires it into every affected read site: _process_tool_results (legacy), _convert_inference_messages tool_call_response branch (latest), _convert_tool_execution_span latest branch (both in StrandsInMemorySessionMapper), and _extract_tool_result_text (CloudWatchSessionMapper). Closes strands-agents#235 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-02T18:28:14Z

Assessment: Comment (Approve with minor suggestions)

Solid bug fix that correctly identifies and resolves the root cause of faithfulness false negatives. The centralized join_tool_result_content helper is well-designed with good edge-case handling.

Review Categories

DRY: The response-to-text ternary pattern is duplicated in two places within strands_in_memory_session_mapper.py — consider extracting a small helper.
Dead code: _extract_tool_result_text in cloudwatch mapper is now a trivial wrapper that could be inlined.
Style: Mid-file import in test file; minor but easy to fix.

Testing is thorough with good coverage across all affected code paths.

github-actions · 2026-06-02T18:39:32Z

Assessment: Approve

The mid-file import issue has been addressed. The addition of try/except around json.dumps and the explicit None/empty-list guards (avoiding false coercion of 0/False) are nice defensive improvements.

Outstanding suggestions (non-blocking)

DRY: The duplicated ternary at lines 326-330 and 377-381 in strands_in_memory_session_mapper.py still exists — extracting a small helper would clean this up.
Trivial wrapper: _extract_tool_result_text in cloudwatch_session_mapper.py is still a pass-through to join_tool_result_content — consider inlining when convenient.

Clean fix with thorough testing. Good to merge as-is.

github-actions · 2026-06-02T18:56:24Z

Assessment: Approve

All substantive feedback from the previous review has been addressed: the duplicated ternary is now the clean _response_to_text helper, the f-string logger calls are fixed to %s style, and the test import is properly at the top.

Minor non-blocking observation

_response_to_text is defined before logger = logging.getLogger(__name__) (line 30 vs 36). Conventional module ordering is imports → logger → helpers → classes. Not a functional issue but worth noting for consistency.

Well-structured fix with thorough edge-case testing. Good to merge.

github-actions · 2026-06-02T19:49:55Z

Assessment: Approve

Clean bugfix with comprehensive testing. All prior review feedback has been addressed — the DRY ternary is extracted to _response_to_text, f-string loggers are fixed to %s style, and imports are properly ordered.

Review Notes

Correctness: All 3 content[0]-only read sites are fixed across both mappers; the legacy gen_ai.choice path (line 392) correctly remains unchanged since it reads model response text, not toolResult content.
Testing: 73 tests pass; regression coverage hits all affected code paths (legacy, latest convention inference, latest convention tool execution, CloudWatch).
Style compliance: Logging uses structured %s format per STYLE_GUIDE.md; type annotations use built-in generics.
Non-blocking observation: _extract_tool_result_text in cloudwatch_session_mapper.py (line 319-321) is now a trivial pass-through — can be inlined in a follow-up cleanup.

Good to merge.

poshinchen · 2026-06-02T19:55:04Z

/strands which implementation would you prefer, "#239" or this.

github-actions · 2026-06-02T19:58:00Z

PR #240 vs PR #239 — Comparison & Recommendation

I've reviewed both PRs side-by-side. They solve the same bug (#235) with similar approaches (shared join_tool_result_content helper in utils.py), but differ in important ways:

Aspect	PR #239	PR #240 (this)
Block separator	`\n`	`" "` (space)
Public API surface	2 functions (`serialize_tool_result_block` + `join_tool_result_content`)	1 function (`join_tool_result_content`)
json serialization	`json.dumps(default=str)` — no crash but non-deterministic key order	`json.dumps(sort_keys=True)` + try/except — deterministic & crash-safe
Unit tests for utility	0	17 (covers every block type, edge case, coercion path)
Unknown block types	Returns `""` silently	Returns `""` + debug log
CloudWatch refactor	Removes `_extract_tool_result_text` entirely	Keeps as thin wrapper
DRY in strands mapper	Inlines directly (duplicated list-vs-str pattern stays)	Extracts `_response_to_text` helper

I'd recommend PR #240 for the following reasons:

Test coverage — The 17 unit tests in TestJoinToolResultContent exercise every code path (None, empty, non-dict, text coercion, json sort_keys, non-serializable json, image/doc/video placeholders, unknown keys, mixed blocks, empty-string filtering). PR fix: joined tool_results to prevent context lost #239 has zero direct utility tests — a regression in serialization logic would go undetected.
Deterministic output — sort_keys=True matters because evaluators do string matching/comparison. Non-deterministic key ordering can cause flaky evaluations.
Better encapsulation — serialize_tool_result_block is an implementation detail; exposing it as public API (fix: joined tool_results to prevent context lost #239) creates unnecessary surface area to maintain.
Crash safety — The try/except around json.dumps prevents unexpected TypeError if a tool returns a non-serializable value (e.g., custom object). PR fix: joined tool_results to prevent context lost #239 uses default=str which avoids crashes but produces less readable output.

One improvement to consider for #240:

Switch the separator from " " to "\n". PR #239's choice of newline is arguably better for multi-paragraph tool outputs — joining with space can create hard-to-read walls of text that make it harder for LLM judges to parse distinct tool result sections.

github-actions · 2026-06-02T20:07:48Z

Assessment: Approve

Well-structured fix with comprehensive edge-case testing. All prior review feedback has been incorporated cleanly.

Review Details

Verified:

All 73 mapper tests pass
ruff check passes on all modified source files
Logging uses structured %s format per STYLE_GUIDE.md
Type annotations use built-in generics (list[str])
Google-style docstrings present on public utility
Tests mirror src/ structure correctly
No new dependencies, no litellm, no Jinja2

Non-blocking follow-up observation:
src/strands_evals/providers/langfuse_provider.py line ~424 in _parse_tool_result_content has the same c[0].get("text", "") pattern this PR fixes in the other two mappers. Consider applying join_tool_result_content there as well in a follow-up PR.

Good to merge.

poshinchen had a problem deploying to auto-approve June 2, 2026 18:24 — with GitHub Actions Failure

poshinchen temporarily deployed to auto-approve June 2, 2026 18:24 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/strands_evals/mappers/strands_in_memory_session_mapper.py Outdated

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/strands_evals/mappers/test_strands_in_memory_mapper.py Outdated

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/strands_evals/mappers/cloudwatch_session_mapper.py

github-actions Bot removed the strands-running label Jun 2, 2026

poshinchen temporarily deployed to auto-approve June 2, 2026 18:36 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

poshinchen temporarily deployed to auto-approve June 2, 2026 18:52 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

poshinchen temporarily deployed to auto-approve June 2, 2026 19:45 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

strands-agents deleted a comment from github-actions Bot Jun 2, 2026

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

poshinchen mentioned this pull request Jun 2, 2026

fix: joined tool_results to prevent context lost #239

Closed

7 tasks

fix: fix missing tool_results aggregation

1cf9af0

poshinchen force-pushed the fix-tool-results-implementer branch from 2903b0e to 1cf9af0 Compare June 2, 2026 20:04

poshinchen temporarily deployed to auto-approve June 2, 2026 20:04 — with GitHub Actions Inactive

poshinchen temporarily deployed to auto-approve June 2, 2026 20:05 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

jjbuck approved these changes Jun 5, 2026

View reviewed changes

poshinchen merged commit bcb1e8f into strands-agents:main Jun 6, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives#240

fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives#240
poshinchen merged 2 commits into
strands-agents:mainfrom
poshinchen:fix-tool-results-implementer

poshinchen commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

poshinchen commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

poshinchen commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

poshinchen commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

PR #240 vs PR #239 — Comparison & Recommendation

I'd recommend PR #240 for the following reasons:

One improvement to consider for #240:

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

poshinchen commented Jun 2, 2026 •

edited

Loading