CLU-95: investigation notes + repro harness#36844
Draft
antiguru wants to merge 3 commits into
Draft
Conversation
Add a continuation write-up (CLU-95-CONTINUATION.md) capturing the root-cause theory for the 'failed to apply hard as-of constraint' bootstrap panic, and an mzcompose repro harness (test/clu-95-repro) that stresses the read-only/0dt leased-read-hold lifecycle and single-env ungraceful restarts. No product code changes; this is the understanding/repro phase.
Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping per-replica read hold without equivalent global read hold" WARN added in PR #35937) into a pure helper `find_unprotected_replica_holds`, and add four unit tests that exercise the hold-asymmetry condition tracked under incidents-and-escalations#39. The tests are the first deterministic specification of the bug-class shape and pin down the regression contract for the eventual fix. Also extend the CLU-95 repro harness with two new workflows targeting the build 1248 manifestation more directly: * `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep) on an unmanaged cluster pinned to a standalone Clusterd, cancelled mid-render, then clusterd force-killed to provoke reconnect. * `replica-removal-under-load` — writer cluster MV + concurrent dataflow churn on a separate compute cluster, then DROP CLUSTER REPLICA on the compute side to drive `Instance::remove_replica` under load. Both workflows accumulate perturbations under one long-lived envd and then do a single ungraceful restart, mirroring the workload-replay sanity_restart sequence from build 1248. Neither reproduces the bootstrap panic over 30/40 iterations, but the harness now scans for the diagnostic WARN as a secondary signal. CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log findings, rule out the leased-expiry framing for that build, and lay out the three-pronged fix direction: upstream hold-accounting fix (#39), bootstrap report-don't-panic safety net (the CLU-95-specific recovery), and render-time report-don't-panic (the moral successor to the now-canceled CLU-34). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Investigation/understanding phase for CLU-95 — the recurring bootstrap panic:
No product code changes. This PR adds:
CLU-95-CONTINUATION.md— full root-cause write-up: decoded panic, the runtime invariantinput.since <= step_back(mv.upper)and where it's enforced, the read-only/0dt leased read-hold theory (lease expiry + the deliberately-disabledupdate_sinceinexpire_leased_reader, database-issues#6885), the tension that makes naive repros fail, open questions, a prioritized repro plan, candidate fixes with risk notes, and a code map with file:line references.test/clu-95-repro/mzcompose.py— a repro hunt (not yet a proven reproducer) withzdt-soak(leader + read-only follower, shortpersist_reader_lease_duration, MV chain + REFRESH MV, reboot follower under load, scan logs) andrestart-soak(single-env ungraceful kill+restart, mirroring workload-replay'ssanity_restart).TL;DR theory
u732is a real, durably-written user MV; the panic means a storage input's read frontier (since) advanced ~41s past the MV's durable write frontier (upper). In a single read-write env the input is held by a persist critical handle (never expires), so this shouldn't happen. In read-only/0dt the follower holds inputs with persist leased handles, and on lease expiryupdate_sinceis disabled (#6885), so the leader's nextcompare_and_downgrade_sincecan jump the inputsinceforward past a dependent MV's upper — and persistsincenever regresses, so the bad state is durable and every later bootstrap panics.How to run
Notes / caveats
check-mzcompose-files.shwill flag it as unused.https://claude.ai/code/session_01G3SvtMjZaSAzqW1dGropWn
Generated by Claude Code