Skip to content

KAFKA-20602: Bound DLQ topic creation, fix log level and restore interrupt in DeadLetterQueueReporter#22424

Open
PavelZeger wants to merge 2 commits into
apache:trunkfrom
PavelZeger:KAFKA-20602
Open

KAFKA-20602: Bound DLQ topic creation, fix log level and restore interrupt in DeadLetterQueueReporter#22424
PavelZeger wants to merge 2 commits into
apache:trunkfrom
PavelZeger:KAFKA-20602

Conversation

@PavelZeger
Copy link
Copy Markdown

@PavelZeger PavelZeger commented May 30, 2026

DeadLetterQueueReporter.createAndSetup verifies/creates the dead letter queue topic during sink task startup. The current implementation has three issues:

  1. Indefinite hang on broker outage. Both admin.listTopics().names().get() and admin.createTopics(...).all().get() are unbounded blocking calls. When the broker is unreachable, each call blocks until the AdminClient's internal default.api.timeout.ms elapses (default 60s). Because tasks are started serially, a worker with N DLQ-enabled connectors can stall for up to N x 60s on a cold-broker scenario (DR failover, outage recovery), which surfaces to operators as "all connectors stuck Initializing".

  2. Wrong log level. The expected "topic does not exist yet" case on first startup was logged at ERROR, even though the very next line creates the topic. This produces spurious ERROR alerts on every first start of a connector with DLQ enabled.

  3. Interrupt status not restored. The InterruptedException handler threw a ConnectException without calling Thread.currentThread().interrupt(), clearing the interrupt signal for any upstream blocking call.

Changes:

  • Bound both admin calls with get(30, TimeUnit.SECONDS) so startup fails fast on an unreachable broker regardless of the AdminClient's configured api timeout, and add a catch (TimeoutException) that surfaces a clear ConnectException.
  • Downgrade the "topic does not exist" message from ERROR to INFO and include the partition count and replication factor.
  • Restore the thread interrupt flag before re-throwing on InterruptedException.

The 30s timeout is hardcoded rather than exposed as a new config to avoid a KIP; it covers the overwhelming majority of clusters and can be revisited if a user reports needing longer.

Testing strategy:

Added three unit tests to ErrorReporterTest. createAndSetup constructs its own Admin and KafkaProducer internally, so the tests use mockStatic(Admin.class) to inject a mock admin (and mockConstruction for
the producer), then stub the returned KafkaFuture.get(timeout, unit) to drive each path:

  • createAndSetupTimesOutWhenBrokerUnreachable - stubs a TimeoutException and asserts a ConnectException is thrown and the admin client is closed. Done with a mocked timeout rather than a real unreachable broker so the test is deterministic and does not wait the full 30s.
  • createAndSetupRestoresInterruptOnInterruptedException - stubs an InterruptedException, runs on a dedicated thread (so the restored interrupt flag does not leak into other tests), and asserts the flag is set after the ConnectException propagates.
  • firstTimeDlqTopicCreationLogsAtInfoNotError - uses LogCaptureAppender to assert the first-time creation path logs at INFO and not ERROR.

No new integration or system tests: the change is a localized fix to an existing code path with no new public surface or cross-component behavior.

@github-actions github-actions Bot added the triage PRs from the community label May 30, 2026
@PavelZeger PavelZeger changed the title KAFKA-20602: Bound DLQ topic creation, fix log level and restore interrupt in DeadLetterQueueReporter Body: KAFKA-20602: Bound DLQ topic creation, fix log level and restore interrupt in DeadLetterQueueReporter May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connect triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant