Skip to content

Redact internal details from analytics engine 500 error responses#22233

Draft
finnegancarroll wants to merge 1 commit into
opensearch-project:mainfrom
finnegancarroll:fix/redact-internal-500-details
Draft

Redact internal details from analytics engine 500 error responses#22233
finnegancarroll wants to merge 1 commit into
opensearch-project:mainfrom
finnegancarroll:fix/redact-internal-500-details

Conversation

@finnegancarroll

@finnegancarroll finnegancarroll commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

Unrecognized exceptions from the analytics engine (those not converted by NativeErrorConverter to 400/429) currently leak internal details in the error response: cause and wrapped exception types.

Change

At the top-level convertingListener in DefaultPlanExecutor.doExecute():

  • If the exception was NOT converted to a recognized type (400/429), replace it with a generic message: Internal error [task_id=X, query_id=Y]
  • Log the full original exception at ERROR level server-side for operator debugging

User sees (after)

{
  "error": {
    "type": "runtime_exception",
    "reason": "Internal error [task_id=12345, query_id=8ffee8b8-944b-4fcd-9f71-6f4601f8034b]"
  },
  "status": 500
}

Operators grep server logs for the task/query ID to find the full stack trace.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit b1fb99d)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

When converted == e and isInternalError(converted) is true, the code calls listener.onFailure() with a new RuntimeException. However, the convertingListener itself is already wrapping the original listener, so this creates a double-wrap scenario. The outer convertingListener.onFailure(e) will be invoked first (line 412), which will then invoke the redaction logic again. This causes the redaction block to execute twice for the same error, potentially logging the error twice and creating nested RuntimeException wrappers.

ActionListener<AnalyticsQueryResponse> convertingListener = ActionListener.wrap(listener::onResponse, e -> {
    Exception converted = e instanceof Exception ex ? contextProvider.convertException(ex) : new RuntimeException(e);
    // If convertException returned unrecognized 500 — redact internal details and log the original.
    if (converted == e && isInternalError(converted)) {
        AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
        String queryId = queryTask.getQueryId();
        String identifier = "unassigned".equals(queryId)
            ? "task_id=" + task.getId()
            : "task_id=" + task.getId() + ", query_id=" + queryId;
        logger.error(
            new org.apache.logging.log4j.message.ParameterizedMessage("[analytics-engine] internal error [{}]", identifier),
            converted
        );
        listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
    } else {
        listener.onFailure(converted);
    }
Unassigned Query ID

The code checks if queryId equals the string literal "unassigned" to determine whether to include it in the error message. However, there is no evidence in the diff that getQueryId() ever returns this specific string. If getQueryId() returns null or a different sentinel value when unassigned, the condition will fail and the error message will incorrectly include "query_id=null" or another unexpected value.

String identifier = "unassigned".equals(queryId)
    ? "task_id=" + task.getId()
    : "task_id=" + task.getId() + ", query_id=" + queryId;

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to b1fb99d

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Preserve exception cause for debugging

The redacted error message loses the original exception's stack trace, making
debugging difficult. Consider wrapping the original exception as the cause while
still providing a sanitized message to preserve diagnostic information.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [425]

-listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
+listener.onFailure(new RuntimeException("Internal error [" + identifier + "]", converted));
Suggestion importance[1-10]: 8

__

Why: This is a valuable suggestion that addresses a significant issue. The current code creates a new RuntimeException without preserving the original exception as a cause, which loses the stack trace and makes debugging much harder. Including converted as the cause maintains diagnostic information while still providing the sanitized message.

Medium
Possible issue
Validate task type before casting

Add a type check before casting task to AnalyticsQueryTask to prevent potential
ClassCastException. The task parameter is of type Task, and casting without
verification could fail if an unexpected task type is passed.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [415-417]

 if (converted == e && isInternalError(converted)) {
+    if (!(task instanceof AnalyticsQueryTask)) {
+        logger.error("[analytics-engine] internal error [task_id=" + task.getId() + "]", converted);
+        listener.onFailure(new RuntimeException("Internal error [task_id=" + task.getId() + "]"));
+        return;
+    }
     AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
     String queryId = queryTask.getQueryId();
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential ClassCastException risk when casting task to AnalyticsQueryTask without type checking. However, since this is within the doExecute method that likely expects AnalyticsQueryTask by design, the risk may be theoretical rather than practical.

Medium

Previous suggestions

Suggestions up to commit b28f4e3
CategorySuggestion                                                                                                                                    Impact
Security
Handle wrapped 500 exceptions

The condition converted == e uses reference equality which may fail if
contextProvider.convertException wraps the exception. This could cause internal
errors to leak when the exception is wrapped but still represents a 500 error.
Consider checking isInternalError on both the original and converted exceptions.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [412-428]

 ActionListener<AnalyticsQueryResponse> convertingListener = ActionListener.wrap(listener::onResponse, e -> {
     Exception converted = e instanceof Exception ex ? contextProvider.convertException(ex) : new RuntimeException(e);
     // If convertException returned unrecognized 500 — redact internal details and log the original.
     if (converted == e && isInternalError(converted)) {
         AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
         String queryId = queryTask.getQueryId();
         String identifier = "unassigned".equals(queryId)
             ? "task_id=" + task.getId()
             : "task_id=" + task.getId() + ", query_id=" + queryId;
         logger.error(
             new org.apache.logging.log4j.message.ParameterizedMessage("[analytics-engine] internal error [{}]", identifier),
             converted
         );
         listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
+    } else if (converted != e && isInternalError(converted)) {
+        AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
+        String queryId = queryTask.getQueryId();
+        String identifier = "unassigned".equals(queryId)
+            ? "task_id=" + task.getId()
+            : "task_id=" + task.getId() + ", query_id=" + queryId;
+        logger.error(
+            new org.apache.logging.log4j.message.ParameterizedMessage("[analytics-engine] internal error [{}]", identifier),
+            e
+        );
+        listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
     } else {
         listener.onFailure(converted);
     }
 });
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that converted == e uses reference equality, which may miss cases where contextProvider.convertException wraps the exception but still returns a 500 error. However, the improved code duplicates logic and could be refactored more elegantly. The security concern is valid but the implementation could be cleaner.

Medium
Possible issue
Validate task type before casting

The unsafe cast to AnalyticsQueryTask could throw ClassCastException if the task
type is unexpected. Add a type check before casting to prevent potential crashes
when accessing queryId.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [416]

+if (!(task instanceof AnalyticsQueryTask)) {
+    logger.error("[analytics-engine] internal error [task_id=" + task.getId() + "]", converted);
+    listener.onFailure(new RuntimeException("Internal error [task_id=" + task.getId() + "]"));
+    return;
+}
 AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a potential ClassCastException risk when casting task to AnalyticsQueryTask. While this is a valid concern for defensive programming, the context suggests that doExecute is called with the correct task type by the framework. The suggestion adds safety but may be addressing an unlikely scenario.

Low
Suggestions up to commit b28f4e3
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent unsafe task type cast

The unsafe cast to AnalyticsQueryTask can throw ClassCastException if the task type
is unexpected. Add a type check before casting or handle the potential exception to
prevent crashes.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [416]

-AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
+if (!(task instanceof AnalyticsQueryTask queryTask)) {
+    logger.error("[analytics-engine] internal error [task_id={}]", task.getId(), converted);
+    listener.onFailure(new RuntimeException("Internal error [task_id=" + task.getId() + "]"));
+    return;
+}
Suggestion importance[1-10]: 7

__

Why: Valid concern about unsafe casting. The doExecute method receives a Task parameter that could theoretically be a different type. Adding a type check before casting to AnalyticsQueryTask would prevent potential ClassCastException and improve robustness.

Medium
Fix reference equality check logic

The condition converted == e uses reference equality which may fail if
contextProvider.convertException() wraps the exception. Consider checking if the
converted exception is still an internal error instead of relying on reference
equality to determine if conversion occurred.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [412-428]

 ActionListener<AnalyticsQueryResponse> convertingListener = ActionListener.wrap(listener::onResponse, e -> {
     Exception converted = e instanceof Exception ex ? contextProvider.convertException(ex) : new RuntimeException(e);
-    // If convertException returned unrecognized 500 — redact internal details and log the original.
-    if (converted == e && isInternalError(converted)) {
+    // If the converted exception is still an unrecognized 500 — redact internal details and log the original.
+    if (isInternalError(converted)) {
         AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
         String queryId = queryTask.getQueryId();
         String identifier = "unassigned".equals(queryId)
             ? "task_id=" + task.getId()
             : "task_id=" + task.getId() + ", query_id=" + queryId;
         logger.error(
             new org.apache.logging.log4j.message.ParameterizedMessage("[analytics-engine] internal error [{}]", identifier),
-            converted
+            e
         );
         listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
     } else {
         listener.onFailure(converted);
     }
 });
Suggestion importance[1-10]: 3

__

Why: The suggestion misunderstands the intent. The converted == e check specifically detects when contextProvider.convertException() returns the original exception unchanged (unrecognized error). Removing this check would redact all 500 errors, not just unrecognized ones. The suggested change to log e instead of converted is correct but minor.

Low
Suggestions up to commit b28f4e3
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add type check before task casting

The cast to AnalyticsQueryTask may fail if the task is not of that type. Add a type
check before casting to prevent ClassCastException when accessing queryId.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [416-417]

 if (converted == e && isInternalError(converted)) {
-    AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
-    String queryId = queryTask.getQueryId();
+    String queryId = task instanceof AnalyticsQueryTask queryTask 
+        ? queryTask.getQueryId() 
+        : "unknown";
Suggestion importance[1-10]: 7

__

Why: This is a valid safety improvement. The code assumes task is always an AnalyticsQueryTask but doesn't verify this before casting, which could cause a ClassCastException. Adding a type check with a fallback to "unknown" prevents potential runtime errors and makes the code more defensive.

Medium
Use class comparison for exception identity

The condition converted == e uses reference equality which may fail if
convertException wraps the exception. Consider checking if the exception type is
unchanged instead, or verify that convertException returns the same instance for
unrecognized exceptions.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [415]

 ActionListener<AnalyticsQueryResponse> convertingListener = ActionListener.wrap(listener::onResponse, e -> {
     Exception converted = e instanceof Exception ex ? contextProvider.convertException(ex) : new RuntimeException(e);
     // If convertException returned unrecognized 500 — redact internal details and log the original.
-    if (converted == e && isInternalError(converted)) {
+    if (converted.getClass() == e.getClass() && isInternalError(converted)) {
         AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
         String queryId = queryTask.getQueryId();
         String identifier = "unassigned".equals(queryId)
             ? "task_id=" + task.getId()
             : "task_id=" + task.getId() + ", query_id=" + queryId;
         logger.error(
             new org.apache.logging.log4j.message.ParameterizedMessage("[analytics-engine] internal error [{}]", identifier),
             converted
         );
         listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
     } else {
         listener.onFailure(converted);
     }
 });
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid concern about reference equality, but the proposed solution using converted.getClass() == e.getClass() may not correctly identify when convertException returns the same instance unchanged. The current implementation using == is likely intentional to detect when no conversion occurred. The suggestion's approach could miss cases where the exception is wrapped in the same type.

Low
Suggestions up to commit 9b841fb
CategorySuggestion                                                                                                                                    Impact
General
Preserve exception cause chain

Creating a new RuntimeException without preserving the original exception as a cause
loses the stack trace and makes debugging difficult. Consider passing the original
exception as the cause parameter to maintain the exception chain.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [425]

-listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
+listener.onFailure(new RuntimeException("Internal error [" + identifier + "]", converted));
Suggestion importance[1-10]: 7

__

Why: Valid suggestion. Preserving the original exception as a cause would maintain the stack trace for debugging while still redacting the message. However, the current implementation intentionally logs the full exception before redacting, so the information isn't completely lost.

Medium
Possible issue
Add null safety check

The queryId variable may be null, which would cause a NullPointerException when
passed to the equals method in the ternary expression. Use null-safe comparison or
add an explicit null check before constructing the identifier string.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [418-420]

-String identifier = "unassigned".equals(queryId)
+String identifier = (queryId == null || "unassigned".equals(queryId))
     ? "task_id=" + task.getId()
     : "task_id=" + task.getId() + ", query_id=" + queryId;
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential NullPointerException if queryId is null. The null-safe comparison using "unassigned".equals(queryId) already handles null correctly (returns false), but adding an explicit null check improves code clarity and defensive programming.

Low
Use class equality check

The identity check converted == e may fail if convertException wraps the original
exception in a new instance with the same type. Consider using
converted.getClass().equals(e.getClass()) or checking if converted is an instance of
a generic internal error type to ensure proper detection of unhandled exceptions.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [415]

-if (converted == e && isInternalError(converted)) {
+if (converted.getClass().equals(e.getClass()) && isInternalError(converted)) {
Suggestion importance[1-10]: 3

__

Why: The suggestion misunderstands the intent. The identity check converted == e is specifically designed to detect when convertException returns the same instance (unrecognized exception). Changing to class equality would break this logic and incorrectly flag recognized exceptions that happen to have the same class.

Low
Suggestions up to commit 57bde0f
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add type check before casting

The unsafe cast to AnalyticsQueryTask will throw ClassCastException if the task is
not of that type. Add a type check before casting to prevent potential crashes when
handling errors.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [415-422]

 if (converted == e && isInternalError(converted)) {
-    AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
-    String queryId = queryTask.getQueryId();
-    String identifier = "unassigned".equals(queryId)
-        ? "task_id=" + task.getId()
-        : "task_id=" + task.getId() + ", query_id=" + queryId;
+    String identifier;
+    if (task instanceof AnalyticsQueryTask) {
+        AnalyticsQueryTask queryTask = (AnalyticsQueryTask) task;
+        String queryId = queryTask.getQueryId();
+        identifier = "unassigned".equals(queryId)
+            ? "task_id=" + task.getId()
+            : "task_id=" + task.getId() + ", query_id=" + queryId;
+    } else {
+        identifier = "task_id=" + task.getId();
+    }
     logger.error("[analytics-engine] internal error [{}]", identifier, converted);
     listener.onFailure(new RuntimeException("Internal error [" + identifier + "]"));
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential ClassCastException risk when casting task to AnalyticsQueryTask. Adding a type check before casting is a defensive programming practice that prevents crashes during error handling, though the actual likelihood depends on the codebase's guarantees about task types.

Medium

@finnegancarroll finnegancarroll force-pushed the fix/redact-internal-500-details branch from 94fd5c6 to fd80fc2 Compare June 18, 2026 04:18
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit fd80fc2

@finnegancarroll finnegancarroll force-pushed the fix/redact-internal-500-details branch from fd80fc2 to 57bde0f Compare June 18, 2026 05:12
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 57bde0f

@github-actions

Copy link
Copy Markdown
Contributor

❌ Gradle check result for 57bde0f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 9b841fb

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 9b841fb

@github-actions

Copy link
Copy Markdown
Contributor

✅ Gradle check result for 9b841fb: SUCCESS

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.75%. Comparing base (5a41747) to head (b28f4e3).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #22233      +/-   ##
============================================
+ Coverage     73.38%   73.75%   +0.36%     
- Complexity    75904    76271     +367     
============================================
  Files          6069     6069              
  Lines        344869   344869              
  Branches      49623    49623              
============================================
+ Hits         253081   254353    +1272     
+ Misses        71570    70567    -1003     
+ Partials      20218    19949     -269     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@finnegancarroll finnegancarroll force-pushed the fix/redact-internal-500-details branch from 9b841fb to b28f4e3 Compare June 18, 2026 19:14
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b28f4e3

@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b28f4e3

@github-actions

Copy link
Copy Markdown
Contributor

✅ Gradle check result for b28f4e3: SUCCESS

@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b28f4e3

@github-actions

Copy link
Copy Markdown
Contributor

✅ Gradle check result for b28f4e3: SUCCESS

Unrecognized exceptions (those not converted by NativeErrorConverter to
400/429) now return a generic 'Internal error [task_id=X, query_id=Y]'
message to the user instead of leaking internal details (stage IDs, shard
routing, gRPC metadata, native error messages, planner internals).

The full stack trace is logged at ERROR level server-side so operators
can still diagnose issues using the task/query ID as a correlator.

Exceptions with well-defined HTTP semantics (IllegalArgumentException
-> 400, CircuitBreakingException -> 429, RejectedExecution -> 429) are
passed through unchanged since their messages are user-facing by design.

Signed-off-by: Finn Carroll <carrofin@amazon.com>
@finnegancarroll finnegancarroll force-pushed the fix/redact-internal-500-details branch from b28f4e3 to b1fb99d Compare June 18, 2026 22:45
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b1fb99d

@github-actions

Copy link
Copy Markdown
Contributor

❌ Gradle check result for b1fb99d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant