KAFKA-10317: Global thread should honor shutdown signal during bootstrapping#22417
KAFKA-10317: Global thread should honor shutdown signal during bootstrapping#22417lucliu1108 wants to merge 8 commits into
Conversation
chickenchickenlove
left a comment
There was a problem hiding this comment.
Thanks for your hard work!
I left a comment.
When you get a chance, please take a look 🙇♂️
| if (inErrorStateSupplier.getAsBoolean()) { | ||
| logBootstrapInterrupted(storeMetadata); | ||
| return; |
There was a problem hiding this comment.
Could we make the shutdown-interrupted bootstrap path explicit instead of returning normally from GlobalStateManagerImpl?
Currently, when inErrorStateSupplier.getAsBoolean() is true, restoreState() / reprocessState() just return, so GlobalStateManagerImpl#initialize() can also return as if bootstrap completed successfully. As a result, GlobalStateUpdateTask#initialize() may continue into initTopology(), processorContext.initialize(), and flushState() even though shutdown has already been requested.
Since initTopology() can invoke user-provided Processor#init(), this could unnecessarily open external resources during shutdown. Maybe this should use an explicit internal signal, such as a dedicated bootstrap-interrupted exception caught only on the clean shutdown path, or return an initialize status like completed/interrupted so the follow-up initialization can be skipped.
What do you think?
There was a problem hiding this comment.
Hi @chickenchickenlove , thanks for the review!
Good point, i refactored to make the interrupted path explicit by:
GlobalStateManager.initialize()now returnsOptional<Set<String>>intead ofSet<String>.Optional.empty()is the explicit "bootstrap was interrupted by shutdown" signal — set both when the supplier-check fires between polls and when aWakeupExceptionis caught during shutdown.- GlobalStateUpdateTask.initialize() checks the Optional first; if empty, it returns
Collections.emptyMap()immediately and skips the rest of the process.
UladzislauBlok
left a comment
There was a problem hiding this comment.
Made initial pass
| if (inErrorStateSupplier.getAsBoolean()) { | ||
| logBootstrapInterrupted(storeMetadata); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Looks like a lot of code duplication. Can we move it to dedicated method, or keep it on high level?
AFAIU, when restoration will be completed (on current store) and we'll move to next one we'll interrupt it anyway. Kinda trade-off to not check same condition n-times
for (final StateStoreMetadata metadata : storeMetadata.values()) {
if (inErrorStateSupplier.getAsBoolean()) {
log.info("Global store bootstrap interrupted by shutdown before starting {}", metadata.stateStore.name());
break;
}
...
}There was a problem hiding this comment.
Thanks for the advice!
I have removed the check inside the restoreState or reprocessState and instead just have outer supplier check at per-store look in GlobalStateManagerImpl.initialize(), returning empty directly instead of continuing to the next store.
| try { | ||
| stateConsumer.pollAndUpdate(); | ||
| } catch (final WakeupException e) { | ||
| if (!inErrorState()) { | ||
| throw e; | ||
| } | ||
| } |
There was a problem hiding this comment.
I think this is not part of bootstrapping, is it?
There was a problem hiding this comment.
Good catch! The steady-state catch wasn't really about bootstrap. I removed it, and also tightened shutdown() to only triggering globalConsumer.wakeup() when the thread is still in CREATED state (i.e., still bootstrapping). The steady-state main loop is now untouched by this PR.
| if (inErrorState()) { | ||
| closeStateConsumer(stateConsumer, false); | ||
| return null; | ||
| } | ||
|
|
||
| setState(RUNNING); | ||
| return stateConsumer; | ||
| } catch (final WakeupException e) { | ||
| closeStateConsumer(stateConsumer, false); | ||
| if (inErrorState()) { | ||
| log.info("Global thread initialization interrupted by shutdown"); | ||
| } else { | ||
| startupException = new StreamsException( | ||
| "Unexpected wakeup during initialization of GlobalStreamThread", e); | ||
| } |
There was a problem hiding this comment.
I think this part should be enough:
if (inErrorState()) {
closeStateConsumer(stateConsumer, false);
return null;
}Do we need to catch WakeupException?
UPD: Overall idea is to break execution of GlobalStateManagerImpl and verify if it was interrupted (check inErrorState())
There was a problem hiding this comment.
Thanks for flagging this!
For the catch {} block, it was originally used for covering the other blocking consumer calls during bootstrap that aren't inside the local poll-catch: GlobalStateManagerImpl#partitionsFor(), endOffsets() and position() (also part of the bootstrapping path). If shutdown() is fired during these paths, WakeupException` will propagate up directly.
Right now after applying the suggestion of @chickenchickenlove , this catch is no longer needed and has been removed. All WakeupException during bootstrap are now caught inside GlobalStateManagerImpl.initialize() and converted to Optional.empty(), so no WakeupExceiption reaches the GlobalStreamThread.initialize().
UladzislauBlok
left a comment
There was a problem hiding this comment.
Left minor comments, but LGTM overall
| LegacyCheckpointingStateStore.migrateLegacyOffsets(logPrefix, stateDirectory, null, wrappedStores); | ||
|
|
||
| for (final StateStoreMetadata metadata : storeMetadata.values()) { | ||
| if (shouldStopBootstrappingSupplier.getAsBoolean()) { |
There was a problem hiding this comment.
Just to confirm: so we ether catch the "interrupted bootstrapping" fact on next iteration or when handling WakeupException. That makes sense
There was a problem hiding this comment.
Yes. First path is supplier check between stores, the other one is wakeupException thrown by methods like poll/partitionsFor/...
| if (inErrorState()) { | ||
| closeStateConsumer(stateConsumer, false); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
for which scenario we need this check?
There was a problem hiding this comment.
The previous try{ stateConsumer.initialize() } catch(...) block calls GlobalStateUpdateTask and will return directly if the shutdown signal is called. In that case, stateConsumer.initialize() returns normally with no exceptions thrown.
This follow-up checks the situation where shutdown is already requested, and routes to cleanup, causes the run() loop to go to the early-exit path.
There was a problem hiding this comment.
right, I see it now. thanks for clarification
| final Thread shutdownThread = new Thread(() -> { | ||
| try { | ||
| TestUtils.waitForCondition( | ||
| () -> stateRestoreListener.storeNameCalledStates.containsKey(MockStateRestoreListener.RESTORE_START), | ||
| 10 * 1000L, | ||
| "Bootstrap restore never started."); | ||
| } catch (final Exception e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| globalStreamThread.shutdown(); | ||
| }); | ||
| shutdownThread.start(); |
There was a problem hiding this comment.
Runnable + Executors is better imo, but this is super minor. You can ignore this comment if you like current approach more
There was a problem hiding this comment.
Good point. Seems that the raw thread path would not surface the exceptions thrown inside the lambda to the main test thread and returns normally on shutdownThead.join(). I switched to the executor + runnable path.
|
|
||
| startAndSwallowError(); | ||
| shutdownThread.join(); | ||
| globalStreamThread.join(5_000); |
There was a problem hiding this comment.
there is already timeout on test, so we can remove this one I guess
Summary
This PR introduces a shutdown-aware bootstrap loop in
GlobalStateManagerImpland aconsumer.wakeup()call duringGlobalStreamThread.shutdown()that together letKafkaStreams#close()interrupt global-store restoration in progress, instead of waiting for the entire changelog to be replayed.Ticket: https://issues.apache.org/jira/browse/KAFKA-10317
Implementation
The global thread passes its
inErrorState()predicate to the state manager, which checks it before each batch in the bootstrap poll loop and exits cleanly when shutdown is requested. Thewakeup()call additionally interrupts any in-flightpoll()so shutdown takes effect right away, even if the loop is currently blocked on a fetch. A matching WakeupException catch in the main update loop ensures clean shutdowns aren't reported through the uncaught-exception handler.Tests
Added unit tests in
GlobalStateManagerImplTestcovering the supplier check andWakeupExceptionhandling in bothrestoreStateandreprocessState, and end-to-end tests inGlobalStreamThreadTestfor the close-during-bootstrap scenario.