fix(mappers): join all toolResult.content blocks to fix faithfulness false negatives#240
Conversation
…atives in FaithfulnessEvaluator Multi-part Bedrock toolResult.content lists (text, json, image, document, video blocks) were silently truncated to content[0], making values in subsequent blocks invisible to evaluators. Adds join_tool_result_content() helper in mappers/utils.py and wires it into every affected read site: _process_tool_results (legacy), _convert_inference_messages tool_call_response branch (latest), _convert_tool_execution_span latest branch (both in StrandsInMemorySessionMapper), and _extract_tool_result_text (CloudWatchSessionMapper). Closes strands-agents#235 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Assessment: Comment (Approve with minor suggestions) Solid bug fix that correctly identifies and resolves the root cause of faithfulness false negatives. The centralized Review Categories
Testing is thorough with good coverage across all affected code paths. |
|
Assessment: Approve The mid-file import issue has been addressed. The addition of Outstanding suggestions (non-blocking)
Clean fix with thorough testing. Good to merge as-is. |
|
Assessment: Approve All substantive feedback from the previous review has been addressed: the duplicated ternary is now the clean Minor non-blocking observation
Well-structured fix with thorough edge-case testing. Good to merge. |
|
Assessment: Approve Clean bugfix with comprehensive testing. All prior review feedback has been addressed — the DRY ternary is extracted to Review Notes
Good to merge. |
|
/strands which implementation would you prefer, "#239" or this. |
PR #240 vs PR #239 — Comparison & RecommendationI've reviewed both PRs side-by-side. They solve the same bug (#235) with similar approaches (shared
I'd recommend PR #240 for the following reasons:
One improvement to consider for #240:Switch the separator from |
2903b0e to
1cf9af0
Compare
|
Assessment: Approve Well-structured fix with comprehensive edge-case testing. All prior review feedback has been incorporated cleanly. Review DetailsVerified:
Non-blocking follow-up observation: Good to merge. |
Closes #235
Problem
FaithfulnessEvaluatorproduced false negatives because both mappers read onlycontent[0]from Bedrock-styletoolResult.contentlists, silently discarding blocks at index 1+.Solution
join_tool_result_content()helper insrc/strands_evals/mappers/utils.pythat joins every block type: text → passthrough (withNone/non-str coercion), json →json.dumps(sort_keys=True)(with serialization error guard), image/document/video → placeholder marker, unknown → silently skipped with a debug log. Empty-string text blocks are filtered from the join._response_to_text()module-level helper instrands_in_memory_session_mapper.pyto eliminate the duplicated list-vs-string ternary.content[0]-only reads instrands_in_memory_session_mapper.py(_process_tool_results,_convert_inference_messages,_convert_tool_execution_span) andcloudwatch_session_mapper.py(_extract_tool_result_text).Testing
All 198 mapper tests pass. Regression tests added for: legacy multi-text, legacy text+json, latest-convention inference span multi-text, latest-convention tool-execution span multi-text, and CloudWatch multi-block tool result. Direct unit tests in
TestJoinToolResultContentcover all block types, edge cases (None, empty list, falsy non-None, empty-string text block, non-serializable json, bare dict, unknown keys, multi-key precedence).