Skip to content

[Feature][Zeta] Support sink partition strategy routing#10964

Open
yzeng1618 wants to merge 5 commits into
apache:devfrom
yzeng1618:dev-rag-api
Open

[Feature][Zeta] Support sink partition strategy routing#10964
yzeng1618 wants to merge 5 commits into
apache:devfrom
yzeng1618:dev-rag-api

Conversation

@yzeng1618
Copy link
Copy Markdown
Member

Purpose of this pull request

This PR adds a sink-declared partition routing contract for RAG / Knowledge Sync scenarios.
Related issues:

Main changes:

  • Add SinkPartitionStrategy API with NONE and HASH_BY_FIELDS modes.
  • Add default SeaTunnelSink#getPartitionStrategy() so existing sinks keep current behavior.
  • Add routing field validation against sink input SeaTunnelRowType.
  • Preserve SeaTunnelRow.options when rows are copied.
  • Add Zeta sink partition routing:
    • route data rows by declared fields such as document_id;
    • ensure rows with the same routing value are delivered to the same sink writer;
    • use batched Hazelcast task operation delivery between sink writer tasks;
    • flush data before checkpoint barriers;
    • align checkpoint barriers on receiver side;
    • keep schema/control records out of data hash routing.
  • Add Flink/Spark fail-fast guards for non-empty sink partition strategies until those engines support equivalent routing.
  • Update English and Chinese sink connector developer docs.

This is the PR-B0 style foundation for Knowledge Sync: sink partition strategy API plus Zeta-first routing support. It does not implement Qdrant/Milvus document lifecycle behavior in this PR.

Does this PR introduce any user-facing change?

Yes.

This PR introduces a new sink developer API, SeaTunnelSink#getPartitionStrategy(), that allows a sink to declare field-based routing requirements.

Existing sinks are not affected because the default implementation returns an empty strategy.

For Zeta, when a sink declares HASH_BY_FIELDS(["document_id"]), data rows with the same document_id are routed to the same sink writer.

For Flink and Spark, non-empty sink partition strategies now fail fast with an explicit unsupported-engine error instead of being silently ignored.

How was this patch tested?

Added tests

Check list

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I re-reviewed the latest head from scratch against the actual Zeta runtime path, and the overall partition-routing direction makes sense. The blocker I still see is in the schema-evolution path: once a sink declares a partition strategy, SeaTunnelRow records can be routed to remote sink writers, but SchemaChangeEvent is still consumed only by the local writer.

Runtime chain I checked:

Source / transform emits SchemaChangeEvent
  -> SeaTunnelSourceCollector.collect(event)
      -> sendRecordToNext(new Record<>(event))
  -> SinkPartitionExchange.received(record)
      -> if Barrier: broadcast to all sink writers
      -> if SchemaChangeEvent: drain batches, then consumeLocally(record) only

Subsequent SeaTunnelRow records
  -> SinkPartitionRouter.route(record)
      -> hash partition fields to a remote sink writer
  -> dispatcher.dispatch(batch)
  -> remote writer handles the row in SinkFlowLifeCycle.handleRecord()
      -> writer.write(row)
      -> but that writer may never have received applySchemaChange(event)

The key paths are:

  • seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SeaTunnelSourceCollector.java:119-133
  • seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkPartitionExchange.java:104-129
  • seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkPartitionRouter.java:64-80
  • seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkFlowLifeCycle.java:327-343

Issue 1: SchemaChangeEvent is not broadcast to all participating sink writers

  • Location: seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkPartitionExchange.java:112
  • Why this is a real blocker:
    the new exchange layer correctly broadcasts barriers, but schema changes stay local. After that, normal data rows may still hash to a remote writer. That remote writer then writes post-DDL data before it has applied the schema change.
  • Risk:
    this can break the main write path for any sink that combines partition routing with schema evolution.
  • Better fix:
    either broadcast SchemaChangeEvent to every partitionTargetTask, or explicitly reject the combination of partition strategy + schema evolution until the runtime support is complete.
  • Severity: High

Issue 2: the new tests still miss the schema-change + remote-routing regression case

  • Location: seatunnel-engine/seatunnel-engine-server/src/test/java/org/apache/seatunnel/engine/server/task/flow/SinkPartitionExchangeTest.java:44
  • Why it matters:
    the current tests cover same-key routing and barrier alignment, but not the failing path above, so Issue 1 can slip through with all tests green.
  • Severity: Medium

Compatibility:

  • This is backward compatible for sinks that do not declare a partition strategy.
  • For sinks that do declare one, the current runtime contract is incomplete once schema evolution is involved.

Performance / side effects:

  • The extra hashing and writer-to-writer dispatch cost looks acceptable.
  • The bigger concern here is correctness, not CPU or memory.

Test stability:

  • The newly added UTs are structurally stable: they are in-memory tests without Thread.sleep, fixed ports, or container timing races.
  • The gap is coverage, not flakiness.

CI:

  • I did not run this locally.
  • The current Build check was still queued when I reviewed, but Issue 1 is already a source-level merge blocker independent of CI.

Merge conclusion: can merge after fixes

  1. Blocking items
  • Issue 1 must be fixed first, because the current schema-evolution path is incorrect once rows are routed to remote sink writers.
  1. Suggested follow-up
  • Add the missing regression test from Issue 2 so this runtime contract stays protected.

Overall, the feature direction is good and the planner/runtime split is clean. The one thing we should not merge as-is is the local-only schema-change handling.

@nzw921rx
Copy link
Copy Markdown
Collaborator

This seems to conflict somewhat with STIP‑23. I think we should first come up with a design document. The backpressure PR also needs to be taken into consideration

@DanielLeens
Copy link
Copy Markdown
Contributor

Thanks for raising the STIP-23 angle. I agree this needs to line up with the broader partition-routing and backpressure design rather than being patched in isolation. From Daniel's side, the current blocker from the latest review is still the same runtime gap: SchemaChangeEvent is still consumed only locally while partitioned SeaTunnelRow records can be routed to remote sink writers. Since there is no new commit after that review yet, I am not starting another full re-review on this exact revision in this round. Once there is either a design write-up or a follow-up commit that closes the runtime path, I am happy to take another pass.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head from scratch against the real Zeta sink routing path.

What this PR solves

  • User pain: today sink partitioning is mostly local to a writer task, so users cannot declare connector-level partition routing that redistributes rows across sink writers.
  • Fix approach: add planner/runtime support for SinkPartitionStrategy, remote writer exchange, and translator guards for unsupported engines.
  • One-line summary: the previous schema-change blocker is fixed on the latest head, and my remaining merge concern is CI rather than a new source-level correctness issue.

Runtime path I checked

upstream row / schema event / barrier
  -> SinkFlowLifeCycle.received(...)
      -> SinkPartitionExchange.received(...)
          -> data row: route to target writer
          -> schema change: drain batches, broadcast to all target writers
          -> barrier: drain batches, broadcast, align once per target writer
  -> remote target writer
      -> receivedSinkPartitionExchange(...)
      -> handleRecord(...)

Re-review result

  • The earlier blocker is fixed: SchemaChangeEvent is now broadcast through createBroadcastEnvelopes(...), and the new SinkPartitionExchangeTest.schemaChangeIsBroadcastBeforeRemoteData() covers the exact runtime path that was missing before.
  • I did not find a new source-level blocker in the current partition-routing logic.

CI

  • The current GitHub Build check is still red for this PR.
  • From the accessible metadata here I can confirm the red status, but I could not retrieve a detailed failing job log from the fork run via the available GitHub API surface in this environment, so I do not want to invent a root cause.

Conclusion: can merge after fixes

  1. Blocking items
  • Please bring the current Build back to green, or provide the failing job details if the red status is from unrelated fork/infra noise.
  1. Suggested non-blocking follow-up
  • None more important than closing the CI loop now that the source-level schema-change gap is fixed.

@yzeng1618
Copy link
Copy Markdown
Member Author

yzeng1618 commented May 31, 2026

This seems to conflict somewhat with STIP‑23. I think we should first come up with a design document. The backpressure PR also needs to be taken into consideration

I rechecked this with STIP-23 in mind.For this PR, hash routing is limited to data rows. Barriers and schema changes are treated as control events, so they are flushed/broadcast instead of being routed by hash. I also updated the design notes to make this explicit. The backpressure/timer signal part is not handled here and should remain in the backpressure PR.

@yzeng1618 yzeng1618 requested a review from DanielLeens May 31, 2026 01:20
@DanielLeens
Copy link
Copy Markdown
Contributor

Thanks for circling back with the STIP-23 clarification and the updated design note.

That explanation lines up with the latest Daniel re-review on head 5a789c128a387726aedc9972bc7aa37b217d3cbb: treating barriers and schema changes as control events, rather than hashing them like data rows, is exactly the distinction I needed to see called out explicitly.

Since there is still no new commit after that latest Daniel review, I am keeping this as a reply-only follow-up on the unchanged head instead of reopening a full source review. On my side the remaining merge gate is still CI rather than a new runtime-path blocker.

@nzw921rx
Copy link
Copy Markdown
Collaborator

nzw921rx commented Jun 1, 2026

Apologies, my earlier comments may not have been clear enough. Let me reorganize my thoughts here.

I've identified a few points in the current approach that I think are worth discussing:

  • Architectural consistency

    The sink —> sink point-to-point shuffle effectively introduces an additional data channel outside the engine's existing data exchange, checkpoint alignment, and back-pressure paths, which may not fully align with Zeta's current unidirectional DAG architecture. Additionally, this approach may have limited reusability — if future data flows need to introduce scenarios such as primary-key-based routing, it could be challenging to extend from this foundation.

  • Cluster stability risk

    The current hash routing lacks a back-pressure feedback loop, which could make it easier for hot keys to overload a single writer. Combined with the synchronous blocking RPC (the dispatch side's .get() blocks the task thread), local skew could potentially escalate into a broader stall. On the receiving side, although SinkPartitionExchangeOperation obtains the target task through TaskExecutionService, the actual write executes synchronously on the HZ operation thread — if downstream responses slow down, it could occupy HZ operation threads. Worth noting is that this overhead scales as a product of the number of jobs and parallelism. When the cluster runs many concurrent jobs (e.g. 100), HZ operation threads can become heavily saturated. Since this thread pool is shared across the entire cluster (heartbeats, checkpoints, cluster management all depend on it), if heavily saturated, there could be a risk of heartbeat timeouts, false node-down detection, and cascading failover.

  • Failover data loss

    Because the sink -> sink point-to-point channel is outside the engine's checkpoint scope, data that has been sent but not yet persisted could potentially be lost during failover, and the existing recovery mechanism may not be able to replay it. If exactly-once or at-least-once recovery needs to be supported in the future, the room for retrofitting might be quite limited.

  • A thought on direction

    I'm personally leaning towards not adding a point-to-point channel on the Sink side, and perhaps holding off on introducing a standalone exchange operator at this stage. A direction worth considering is to have SourceSplitEnumerator handle routing and split strategies on the Source side, which would align better with the engine's existing mechanisms.

Based on the above considerations, my suggestion is to produce a detailed design first, covering at least the following:

  • The complete document_id data flow (assignment, routing, delivery, landing), including the addressing relationship between each parallelism instance and vector database partitions
  • Checkpoint alignment can be considered as a follow-up iteration (state boundaries, snapshot timing, recovery consistency)

The above is solely my personal opinion and I welcome any corrections @yzeng1618 @davidzollo @zhangshenghang .

@DanielLeens
Copy link
Copy Markdown
Contributor

Thanks for reorganizing the concerns here, and I appreciate the level of detail. These are thoughtful architecture questions, especially around how a Sink->Sink side channel relates to checkpoint alignment, backpressure propagation, and shared HZ operation threads.

I want to be careful not to pretend a brand-new full Daniel re-review happened on the same unchanged head, though. The current PR head is still 5a789c128a387726aedc9972bc7aa37b217d3cbb, which is the same revision I already re-reviewed on May 31. So my latest Daniel full-review conclusion on that exact head is still the same one from that review, rather than a newly expanded blocker list on the same code version.

That said, I do think your discussion points are important input for the next revision. If the author decides to address them with either a design write-up or a code update, please push that as a new commit and I will re-review the latest head from scratch again against the full Zeta runtime path. Until then, from Daniel's side this stays in reply-only / awaiting-update state rather than a new full review on the unchanged head.

@yzeng1618
Copy link
Copy Markdown
Member Author

I re-checked the current implementation against these concerns.

The current SinkPartitionExchange path does introduce a sink-side point-to-point channel: remote delivery goes through Hazelcast task operations, the dispatch side waits on the operation future, and the receive side can synchronously enter the target SinkWriter path from the operation call stack. The exchange also has local buffering/alignment state that is not part of checkpoint state. So I agree this should not be treated as the production Knowledge Sync routing direction without a fuller design.

I have drafted a design document first, focusing on the complete document_id flow: assignment, routing, delivery, landing, and the relationship between route buckets, parallel subtasks, and vector-store partitions/logical buckets. The proposed direction is to move the first production routing path toward SourceSplitEnumerator/source-side document assignment, and defer both the sink-side point-to-point channel and a standalone keyed exchange operator until checkpoint, back-pressure, and recovery semantics are designed as first-class engine behavior.

Checkpoint alignment and recovery consistency are called out as a follow-up iteration with explicit state boundaries.

@yzeng1618
Copy link
Copy Markdown
Member Author

Background

Knowledge Sync requires document-scoped lifecycle decisions. A vector lifecycle sink needs to read the old chunks for one document, compare hashes, delete stale chunks, upsert changed chunks, and skip unchanged chunks. If chunks for the same document_id are processed by multiple sink writers, those writers can make conflicting refresh/delete decisions.

The current implementation direction added a sink-declared partition strategy and a Zeta runtime component named SinkPartitionExchange. The mechanism is useful as a focused experiment, but the current runtime path introduces a sink-to-sink point-to-point data channel:

upstream flow -> local sink writer -> SinkPartitionExchange -> remote sink writer

That path is outside the normal Zeta source/transform/sink data exchange boundary. It uses Hazelcast task operations for cross-task delivery, blocks the sending task while waiting for remote dispatch completion, and executes the receiving write path synchronously inside the Hazelcast operation call stack. It also does not snapshot or restore exchange-local buffers and sequence state.

This design proposes a safer first production direction: keep document routing on the source side through SourceSplitEnumerator and normal engine dataflow, and defer a standalone engine exchange operator until its checkpoint and back-pressure semantics are designed as first-class runtime behavior.

Goals

  • Define the complete document_id data flow: assignment, routing, delivery, and landing.
  • Keep Knowledge Sync routing aligned with existing source split assignment, task execution, checkpoint, and back-pressure mechanisms.
  • Ensure one document belongs to one lifecycle decision domain.
  • Make the addressing relationship explicit between document routing buckets, source readers, sink writers, and vector database partitions or logical buckets.
  • Treat checkpoint alignment and recovery as a follow-up design area with clear state boundaries from the start.

Non-Goals

  • Do not extend the current sink-side point-to-point exchange for production lifecycle sync in this design.
  • Do not introduce a standalone keyed exchange operator in the first iteration.
  • Do not implement Qdrant or Milvus lifecycle behavior here.
  • Do not promise exactly-once vector-store visibility until checkpoint-driven sink lifecycle semantics are designed and verified.
  • Do not require every existing source to support document-level routing. Sources that cannot produce complete document splits should fail fast for lifecycle sync.

Current Sink-Side Exchange Assessment

The current sink-side exchange has already addressed some local ordering issues: checkpoint barriers and schema-change events are flushed and broadcast instead of being hashed as ordinary data records. This is important, but it does not make the exchange equivalent to an engine-native data channel.

Observed risks:

  • Architecture drift: routing happens inside the sink lifecycle after data has already reached a sink task. This creates an extra data plane that is not represented as a normal Zeta DAG edge.
  • Shared operation pressure: remote delivery uses Hazelcast operations. The same operation service is also used by worker heartbeats, checkpoint ACKs, task status, and cluster management paths.
  • Blocking dispatch: remote dispatch waits for the operation future. A slow target writer can block the sending task thread.
  • Synchronous receive path: the operation handler calls the target task and can reach SinkWriter.write(...) synchronously in the operation thread.
  • Recovery gap: exchange-local pending batches, per-target sequence numbers, and barrier alignment bookkeeping are not part of checkpoint state.
  • Limited reuse: future keyed routing scenarios, such as primary-key based routing, would need a more general engine data exchange model rather than a sink-specific transport.

Therefore this design treats sink-side point-to-point routing as unsuitable for the first production Knowledge Sync routing path.

Proposed Direction

Use source-side document routing controlled by SourceSplitEnumerator.

The core rule is:

A document_id is assigned to exactly one route bucket, and all rows derived > from that document are emitted by the reader that owns that bucket.

For the first lifecycle implementation, the job should use a pointwise topology where the source reader subtask, transform subtask chain, and sink writer subtask for the same route bucket stay aligned. If the planner cannot guarantee this alignment, the lifecycle sink must fail fast instead of falling back to best-effort routing.

SourceSplitEnumerator
        |
        | document_id -> route bucket
        v
SourceReader[route bucket]
        |
        | normal Zeta dataflow
        v
Transform chain[route bucket]
        |
        | normal Zeta dataflow
        v
Lifecycle SinkWriter[route bucket]
        |
        v
Vector database partition or logical bucket

This keeps routing in the same place where SeaTunnel already reasons about parallelism, split assignment, checkpointed source state, and failover.

Routing Model

Routing Key

The primary routing key is the physical field document_id.

The document_id must be stable across job restarts for the same logical document. For file-backed documents, a recommended default is:

document_id = "doc_" + sha256(canonical_source_uri)

Connectors may provide stronger source-native identities when available, such as database primary keys, object version IDs, or external document IDs. The selected identity must be included in the source/enumerator checkpoint state when it affects pending work assignment.

Route Bucket

The route bucket is deterministic:

route_bucket = floorMod(stableHash(document_id), route_parallelism)

route_parallelism is fixed when the job is planned and must be persisted or reconstructible during recovery. Changing route parallelism across restore is an incompatible state migration unless a separate rebalance protocol is introduced.

Assignment

The enumerator discovers documents and assigns each document-level split to the reader that owns the route bucket.

document split:
  source_uri
  source_version
  document_id
  document_hash candidate
  route_bucket
  delete marker, if the document disappeared

The split must represent a whole document or a whole document lifecycle event. Chunk-level split assignment is not allowed in lifecycle mode because it can split one document across multiple decision domains.

Complete Data Flow

1. Discovery

The source enumerator discovers source objects, files, or source records. It normalizes the source address and resolves the document identity.

Required output from discovery:

  • source_uri
  • document_id
  • optional source_version
  • optional source-native modified timestamp
  • delete marker for removed documents when the source supports deletion

2. Routing Assignment

The enumerator computes route_bucket from document_id and assigns the document split to the reader mapped to that bucket.

If no reader is registered for a bucket, the split remains pending in enumerator state until the reader is available. On failover, pending and assigned-but-not checkpointed splits are reconstructed from the latest checkpoint and reassigned using the same deterministic rule.

3. Source Delivery

The source reader reads the whole document split and emits one or more rows. Every emitted row for that document must contain the physical routing and lifecycle fields required downstream:

  • document_id
  • document_hash
  • source_uri
  • source_version, if available
  • deleted
  • chunk_id
  • chunk_hash
  • chunk_index
  • content fields used by parse/chunk/embedding

For a delete event, the reader may emit a compact tombstone row with deleted = true, document_id, and enough target context for the lifecycle sink to delete all existing chunks for that document.

4. Transform Delivery

Transforms must preserve the physical routing fields. Document parse, chunking, metadata projection, and embedding transforms may add fields, but they must not drop or rename document_id unless the job explicitly maps it to another physical field before lifecycle routing is enabled.

If a transform can expand one document into many rows, all expanded rows remain inside the same route bucket because expansion happens after the source reader has been assigned the document split.

5. Sink Landing

The lifecycle sink writer receives all rows for one document through the route bucket assigned to its subtask. The writer performs document-scoped lifecycle logic:

read old chunks for document_id
compare chunk_hash and document_hash
delete stale chunks
upsert changed or new chunks
skip unchanged chunks

The sink writer must use idempotent target identifiers:

point_id = chunk_id
payload.document_id = document_id
payload.chunk_hash = chunk_hash

For target systems with physical partitions, such as Milvus, the writer may map the route bucket to a configured partition:

vector_partition = partition_prefix + "_" + route_bucket

For target systems without a matching physical partition concept, such as a single Qdrant collection, the route bucket is a SeaTunnel ownership boundary and document_id remains a payload filter/delete key. In both cases, the addressing contract is the same from SeaTunnel's perspective.

Addressing Relationship

Example with route_parallelism = 4:

Document Route bucket Source reader Transform chain Sink writer Vector target
doc_A hash(doc_A) % 4 = 1 reader-1 subtask-1 writer-1 partition/logical-bucket-1
doc_B hash(doc_B) % 4 = 3 reader-3 subtask-3 writer-3 partition/logical-bucket-3
doc_C hash(doc_C) % 4 = 1 reader-1 subtask-1 writer-1 partition/logical-bucket-1

The invariant is not that each bucket has only one document. The invariant is that each document has only one bucket.

Back-Pressure and Hot Keys

This design relies on the existing dataflow back-pressure path instead of adding Hazelcast data operations from sink writer to sink writer. A slow lifecycle sink writer slows its normal upstream chain.

Hot keys are still possible. A single large document or a small number of very large documents can overload one bucket. The first production version should handle this as an explicit lifecycle trade-off:

  • do not split one document across buckets in lifecycle mode;
  • expose per-bucket document count, row count, bytes, and lifecycle latency metrics;
  • optionally fail fast or warn when one document exceeds configured size or chunk-count thresholds;
  • allow users to increase route parallelism before the job starts.

Dynamic hot-key migration is out of scope because moving a live document between writers would require a checkpointed rebalance protocol.

Checkpoint and Recovery Boundaries

Checkpoint alignment can be implemented in a follow-up iteration, but the state boundaries should be fixed now.

Source Enumerator State

The enumerator owns:

  • discovered but unassigned document splits;
  • assigned splits that have not been checkpoint-confirmed;
  • route parallelism and route bucket mapping inputs;
  • source version cursors or listing progress;
  • delete markers that still need delivery.

Source Reader State

Each reader owns:

  • currently processing document split;
  • read offset inside the document, if the source supports resumable reads;
  • emitted chunk progress if parsing/chunking happens inside the reader.

If the reader cannot resume inside a document, it may replay the whole document from the last checkpoint. The sink lifecycle path must therefore be idempotent by document_id and chunk_id.

Sink Writer State

The sink writer owns:

  • prepared but not yet committed lifecycle operations, if the target sink uses a checkpoint-gated commit model;
  • target transaction handles or staging metadata, if supported;
  • retryable commit/delete/upsert intent records when external operations are delayed until checkpoint completion.

Visible external side effects before checkpoint completion must be either idempotent or staged. A sink that cannot satisfy this must document at-least-once visibility and pass tests for duplicate replay.

Recovery Scenarios

  • Failure before checkpoint completion: source state rolls back to the last completed checkpoint; document splits may be replayed; sink writes must be idempotent or staged.
  • Failure after checkpoint completion but before cleanup: sink commit or cleanup may be retried; commit/delete/upsert operations must tolerate repeated attempts.
  • Reader failure during a large document: the document may restart from the last checkpointed reader offset, or from the beginning if no offset is available.
  • Parallelism change on restore: not supported in the first version for lifecycle routing because it changes route bucket ownership.

Planner Requirements

For a lifecycle sink that requires document routing, the planner should validate:

  • document_id exists as a physical field before lifecycle sink input.
  • Source supports document-level split assignment or declares that it cannot run lifecycle routing.
  • Source, transform, and sink route parallelism are aligned for the first implementation.
  • No transform between source and lifecycle sink performs repartitioning or drops routing fields.
  • Non-Zeta engines either preserve source split assignment to sink writer ownership or fail fast with a clear error.

Alternatives Considered

A. Keep Sink-Side Point-to-Point Exchange

This preserves the current implementation direction and can route rows even when source and sink parallelism differ. It is not recommended for production Knowledge Sync because it adds a sink-local data channel outside the normal DAG, uses Hazelcast operations for data transfer, and lacks exchange-level checkpoint state.

B. Add a Standalone Engine Keyed Exchange

This is the most general long-term engine solution. It could support document_id, primary keys, and other keyed flows. It should be designed as a first-class Zeta operator with explicit queueing, checkpoint, back-pressure, metrics, and failover semantics. It is deferred because introducing it now would expand the scope beyond the immediate Knowledge Sync design.

C. Source-Side Document Assignment

This is the recommended first production path. It reuses the existing SourceSplitEnumerator model, keeps routing before row fan-out, avoids sink-to-sink data transfer, and gives recovery a natural source-state boundary. Its main limitation is that it requires source/sink route alignment and whole document splits.

Implementation Slices

  1. ADR and contract update

    • Freeze lifecycle routing as source/enumerator-led for the first production path.
    • Mark sink-side exchange as experimental or not used by lifecycle sinks.
  2. Document metadata contract

    • Finalize DocumentId, DocumentHash, ChunkId, ChunkHash, and physical field mappings.
    • Ensure transforms preserve these fields.
  3. Document split assignment

    • Add or extend source-side document split strategy for file/Markdown first.
    • Add deterministic bucket assignment in the enumerator.
  4. Lifecycle sink validation

    • Validate required fields and aligned route parallelism.
    • Fail fast for unsupported sources or topologies.
  5. Qdrant lifecycle V1

    • Use document_id and chunk_id for read/compare/delete/upsert/skip.
    • Treat route bucket as ownership boundary even if Qdrant uses one collection.
  6. Checkpoint follow-up

    • Define checkpoint-gated lifecycle side effects.
    • Add failover and replay tests before claiming production recovery semantics.

Verification Plan

  • Unit test deterministic document_id -> route_bucket assignment.
  • Unit test enumerator restore reassigns pending documents to the same bucket.
  • Integration test same document chunks reach one lifecycle sink writer.
  • Integration test different documents can use different writers.
  • Delete-event test ensures tombstones route to the same bucket as the original document.
  • Failover test replays from checkpoint without losing document lifecycle events.
  • Duplicate replay test verifies idempotent chunk_id upsert/delete behavior.
  • Skew test records per-bucket metrics and validates warning/fail-fast behavior for oversized documents.

@DanielLeens
Copy link
Copy Markdown
Contributor

Thanks for taking the time to write this up. The source-side document assignment direction is much clearer to me than continuing to harden the current sink-side point-to-point exchange as the first production path. In particular, spelling out the route-bucket ownership, checkpoint boundaries, and the reason to defer a standalone keyed exchange directly addresses the architecture concerns that were raised earlier.

Since the PR head is still the same 5a789c128a387726aedc9972bc7aa37b217d3cbb revision that I already reviewed, I do not want to pretend a brand-new full Daniel re-review happened on unchanged code. So from my side this stays in reply-only / awaiting-update state rather than becoming a new full review on the current head.

If you decide to keep this PR as the implementation vehicle, the next step needs to be a code update that matches the design direction and makes the intended scope explicit in the PR itself. Once there is a new commit, I am happy to re-review the latest head from scratch against the full Zeta runtime path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants