CEP-45: Lost witness marker race by bdeggleston · Pull Request #4892 · apache/cassandra

bdeggleston · 2026-06-17T20:17:49Z

Don't truncate journal segments until witnessed offsets they contain are flushed. Also moves MutationTrackingService startup to after the commit log is replayed

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

Don't truncate journal segments until witnessed offsets they contain are flushed. Also moves MutationTrackingService startup to after the commit log is replayed

frankgh

I've added a couple of comments, but the patch looks good in general.

frankgh · 2026-06-18T17:42:48Z

        );
+        pendingClearReplaySize = Metrics.register(
+                factory.createMetricName("PendingClearReplaySize"),
+                () -> MutationJournal.instance().pendingClearReplaySize()


Do we want to worry about the case where MT is disabled and maybe handle the IllegalStateException thrown when the instance is null?

frankgh · 2026-06-18T17:43:28Z

+    // opaque / immutable list of segments that we should clear the needs-replay flag on
+    public static class PendingClearReplay
+    {
+        private ImmutableSet<Long> segments;


NIT: can we make this final?

Suggested change

private ImmutableSet<Long> segments;

private final mmutableSet<Long> segments;

frankgh · 2026-06-18T21:30:00Z

        executor.awaitTermination(1, TimeUnit.MINUTES);
+        // attempt to persist offsets and mark segments as
+        // not needing replay one last time before shutdown
+        if (started)


Suggested change

if (started)

if (wasStarted)

frankgh · 2026-06-18T21:30:49Z

@@ -323,6 +337,10 @@ private void shutdownBlocking() throws InterruptedException
        activeReconciler.shutdownBlocking();
        executor.shutdown();
        executor.awaitTermination(1, TimeUnit.MINUTES);


Should we log if we fail to shutdown here?

Suggested change

executor.awaitTermination(1, TimeUnit.MINUTES);

if (!executor.awaitTermination(1, TimeUnit.MINUTES))

{

logger.warn("Mutation tracking executor did not terminate within 1 minute; forcing shutdown");

}

frankgh · 2026-06-18T21:46:24Z

+     * To improve startup, we periodically save our view of mutation ids that we've witnessed to disk as part of this
+     * class. Any ids witnessed since the last time this class was run are reconstructed by replaying the journal.
+     *
+     * However, if an sstable is flushed is after the most recent LogStatePersister run, AND it marks a segment as no


NIT:

Suggested change

* However, if an sstable is flushed is after the most recent LogStatePersister run, AND it marks a segment as no

* However, if an sstable is flushed after the most recent LogStatePersister run, AND it marks a segment as no

frankgh · 2026-06-18T21:51:59Z

+            TableMetadata table = Schema.instance.getTableMetadata(keyspaceName, tableName);
+            DecoratedKey dk = Murmur3Partitioner.instance.decorateKey(ByteBufferUtil.bytes(key));
+            MutationSummary summary = MutationTrackingService.instance().createSummaryForKey(dk, table.id, false);
+            if (summary.size() == 0)


NIT:

Suggested change

if (summary.size() == 0)

if (summary.isEmpty())

frankgh

+1 looks good to me

bdeggleston · 2026-06-20T05:12:19Z

Just pushed up a small test fix.

Unlike normal writes, mutation tracking noops a write if we’ve already seen it, which is a reasonable optimization when we’re tracking each write. Unfortunately this can bite us on startup. If witnessed offsets are flushed to disk before the memtable containing those offsets are also flushed to sstables, then on startup mutation tracking will think it’s already seen all of those mutations and not write them to the memtable and losing a bunch of data in the process. The fix is pretty simple and just does what commit log replay does. Commit log replay applies it’s mutations with makeDurable set to false which means apply to the memtable but not the commit log. So I updated applyInternalTracked to also take a makeDurable flag. If this is false, we now skip applying the mutation, and we also apply the mutation to the memtable whether we’ve seen it before or not.

frankgh · 2026-06-20T13:31:38Z

            started = MutationTrackingService.instance().startWriting(mutation);

-            if (started)
+            if (started || !makeDurable)


Do we need durable MT journal when makeDurable=false? I'm thinking mostly of the case where the schema is created with durable_writes=false

so basically, the question is whether on line 633 do we need to write the journal when we don't need durable writes.

that's a good point, I'd forgotten you can turn off log durability for keyspaces. I think the answer is that MT doesn't work without the mutation journal, so we need to make this path only reachable on replay and that we add a check to schema changes to fail validation if durable writes are turned off

frankgh · 2026-06-21T02:04:01Z

        int writesPerKey = 2;
        int pks = 100;
-        withRandom(rng -> {
+        withRandom(1509900183613458L, rng -> {


do we want to reset this seed?

I'll remove it on commit. Thanks!

frankgh · 2026-06-21T02:04:38Z

+        // CREATE: tracked + durable_writes=false should be rejected
+        String createKs = nextKsName();
+        Throwable createFailure = expectFailure(() ->
+            schemaChange("CREATE KEYSPACE " + createKs +


👍 yeah, this makes sense

frankgh · 2026-06-22T14:03:26Z

+            assertEquals("Pre-bounce witness count must equal write count", writes, preBounceOffsetCount);
+
+            // Flush so notifyFlushed marks the active segment's interval clean.
+            cluster.get(1).nodetoolResult("flush", KEYSPACE).asserts().success();


should we flush system.coordinator_logs here as well?

cluster.get(1).nodetoolResult("flush", "system", "coordinator_logs").asserts().success();

frankgh · 2026-06-22T14:16:18Z

+        Throwable createFailure = expectFailure(() ->
+            schemaChange("CREATE KEYSPACE " + createKs +
+                         " WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}" +
+                         " AND replication_type = 'tracked'" +
+                         " AND durable_writes = false")
+        );
+        assertTrue("Expected ConfigurationException root cause, got: " + createFailure,
+                   rootCause(createFailure) instanceof ConfigurationException);


Suggested change

Throwable createFailure = expectFailure(() ->

schemaChange("CREATE KEYSPACE " + createKs +

" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}" +

" AND replication_type = 'tracked'" +

" AND durable_writes = false")

);

assertTrue("Expected ConfigurationException root cause, got: " + createFailure,

rootCause(createFailure) instanceof ConfigurationException);

assertThatThrownBy(() ->

schemaChange("CREATE KEYSPACE " + createKs +

" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}" +

" AND replication_type = 'tracked'" +

" AND durable_writes = false")

).hasRootCauseInstanceOf(ConfigurationException.class);

frankgh · 2026-06-22T14:16:34Z

+        Throwable alterTrackedFailure = expectFailure(() ->
+            schemaChange("ALTER KEYSPACE " + alterKs + " WITH durable_writes = false")
+        );
+        assertTrue("Expected ConfigurationException root cause, got: " + alterTrackedFailure,
+                   rootCause(alterTrackedFailure) instanceof ConfigurationException);


Suggested change

Throwable alterTrackedFailure = expectFailure(() ->

schemaChange("ALTER KEYSPACE " + alterKs + " WITH durable_writes = false")

);

assertTrue("Expected ConfigurationException root cause, got: " + alterTrackedFailure,

rootCause(alterTrackedFailure) instanceof ConfigurationException);

assertThatThrownBy(() -> schemaChange("ALTER KEYSPACE " + alterKs + " WITH durable_writes = false")).hasRootCauseInstanceOf(ConfigurationException.class);

frankgh · 2026-06-22T14:16:51Z

+        Throwable alterToTrackedFailure = expectFailure(() ->
+            schemaChange("ALTER KEYSPACE " + migratedKs + " WITH replication_type = 'tracked'")
+        );
+        assertTrue("Expected ConfigurationException root cause, got: " + alterToTrackedFailure,
+                   rootCause(alterToTrackedFailure) instanceof ConfigurationException);


Suggested change

Throwable alterToTrackedFailure = expectFailure(() ->

schemaChange("ALTER KEYSPACE " + migratedKs + " WITH replication_type = 'tracked'")

);

assertTrue("Expected ConfigurationException root cause, got: " + alterToTrackedFailure,

rootCause(alterToTrackedFailure) instanceof ConfigurationException);

assertThatThrownBy(() -> schemaChange("ALTER KEYSPACE " + migratedKs + " WITH replication_type = 'tracked'")).hasRootCauseInstanceOf(ConfigurationException.class);

frankgh · 2026-06-22T14:17:01Z

+    private static Throwable expectFailure(Runnable r)
+    {
+        try
+        {
+            r.run();
+        }
+        catch (Throwable t)
+        {
+            return t;
+        }
+        fail("Expected exception but none was thrown");
+        return null;
+    }
+
+    private static Throwable rootCause(Throwable t)
+    {
+        Throwable cause = t;
+        while (cause.getCause() != null && cause.getCause() != cause)
+            cause = cause.getCause();
+        return cause;
+    }


Suggested change

private static Throwable expectFailure(Runnable r)

{

try

{

r.run();

}

catch (Throwable t)

{

return t;

}

fail("Expected exception but none was thrown");

return null;

}

private static Throwable rootCause(Throwable t)

{

Throwable cause = t;

while (cause.getCause() != null && cause.getCause() != cause)

cause = cause.getCause();

return cause;

}

frankgh · 2026-06-22T14:17:19Z

Can we instead use import static org.apache.cassandra.utils.AssertionUtils.assertThatThrownBy; here?

CEP-45: Lost witness marker race

6a4f54f

Don't truncate journal segments until witnessed offsets they contain are flushed. Also moves MutationTrackingService startup to after the commit log is replayed

bdeggleston requested a review from frankgh June 17, 2026 20:17

frankgh reviewed Jun 18, 2026

View reviewed changes

review feedback

6f9cc80

frankgh approved these changes Jun 18, 2026

View reviewed changes

fix startup when witnessed mutations are ahead of journal replay

e8b3075

frankgh reviewed Jun 20, 2026

View reviewed changes

don't allow disabling durable writes for tracked keyspaces

b998fad

frankgh reviewed Jun 21, 2026

View reviewed changes

frankgh reviewed Jun 22, 2026

View reviewed changes

	private ImmutableSet<Long> segments;
	private final mmutableSet<Long> segments;

-        executor.awaitTermination(1, TimeUnit.MINUTES);
+        if (!executor.awaitTermination(1, TimeUnit.MINUTES))
+        {
+            logger.warn("Mutation tracking executor did not terminate within 1 minute; forcing shutdown");
+        }

	* However, if an sstable is flushed is after the most recent LogStatePersister run, AND it marks a segment as no
	* However, if an sstable is flushed after the most recent LogStatePersister run, AND it marks a segment as no

Conversation

bdeggleston commented Jun 17, 2026

Uh oh!

frankgh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frankgh left a comment

Choose a reason for hiding this comment

Uh oh!

bdeggleston commented Jun 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants