Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
266 changes: 266 additions & 0 deletions docs/en/transforms/multi-table-transform-and-join-boundary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
---
sidebar_position: 16
---

# Multi-Table Transform Capability Boundary

## Overview

SeaTunnel's **multi-table transform** feature allows a single transform node to process multiple
tables flowing from an upstream source (typically a CDC connector) in one pipeline. This page
documents precisely what is supported, what is not, and what alternatives to use when you hit a
capability boundary.

---

## 1. What Is a Multi-Table Transform?

In a standard single-table pipeline, one Source feeds one Transform chain feeds one Sink.
In a multi-table pipeline, a single Source (e.g., MySQL-CDC) emits records from **many tables**
simultaneously, and each downstream Transform or Sink must declare which table(s) it applies to.

```
MySQL-CDC ──► FieldMapper (orders table) ──► Kafka Sink (orders topic)
├──► FieldMapper (users table) ──► Kafka Sink (users topic)
└──► (unmatched tables pass through) ──► Elasticsearch Sink
```

---

## 2. Capability Boundary Table

| Capability | Supported | Notes |
|---|---|---|
| Per-table field rename / map | ✅ Yes | Use `FieldMapper` with `plugin_input` and `table_match_regex` |
| Per-table column filtering | ✅ Yes | Use `Filter` with `plugin_input` and `table_match_regex` |
| Per-table type casting | ✅ Yes | Use `FieldMapper` with `define_sink_type` option |
| Per-table SQL transform (single table) | ✅ Yes | Use `SQL` transform with `plugin_input`; scope it with `table_match_regex` when needed |
| `TableMerge` — merge multiple tables into one | ✅ Yes | Tables must share compatible schema |
| `TableRename` — rename tables in the stream | ✅ Yes | Works well for routing to Sink by table name |
| Row-level filtering (filter by `rowkind`) | ✅ Yes | Use `FilterRowKind` transform |
| Cross-table SQL JOIN | ❌ Not supported | See Section 4 for alternatives |
| Aggregation across multiple CDC tables | ❌ Not supported | Aggregate downstream in OLAP engine |
| Generating new tables from a JOIN result | ❌ Not supported | Use a dedicated SQL engine |
| Applying one transform to ALL tables wildcard | ⚠️ Partial | `TableMerge` then single transform; schema must be compatible |
| Changing schema mid-stream (DDL events) | ⚠️ Limited | Depends on sink; some sinks handle schema evolution; transforms do not |
| Nested JSON field extraction per table | ✅ Yes | Use `JsonPath` transform with `plugin_input` and `table_match_regex` |

---

## 3. TableMerge vs SQL Join

These two are the most commonly confused features.

### 3.1 `TableMerge` Transform

`TableMerge` **merges the row streams of multiple tables into a single result table**. All
input tables must have the same (or compatible) schema. Use it to route all tables to one Sink.

```json
{
"plugin_name": "TableMerge",
"plugin_input": ["orders_2023", "orders_2024"],
"plugin_output": "all_orders",
"merge_by_field": true
}
```

**When to use**: Fan-in from multiple source tables that have the same structure (e.g., sharded
tables, multi-year partitions, or multi-database tables with identical schema).

**When NOT to use**: When tables have different schemas that you need to correlate or enrich
from each other — that requires a JOIN.

### 3.2 SQL JOIN (not natively supported in multi-table pipelines)

A SQL JOIN correlates rows from two different tables based on a key. **SeaTunnel's `SQL`
transform does NOT support cross-table JOIN inside a multi-table streaming pipeline.**

Attempting to JOIN records from two upstream tables inside a single SQL transform is not
supported and will result in a configuration error.

**Recommended alternatives**:
- Write both tables to a shared data lake or warehouse (e.g., Hudi, Iceberg, ClickHouse), then
run the JOIN there
- Use Apache Flink with SeaTunnel's Flink connector for stateful JOIN operations
- Materialise the "dimension" table into a lookup cache (e.g., Redis, RocksDB) and use a custom
transform for enrichment

---

## 4. Cross-Source JOIN Limitation

SeaTunnel does **not** support streaming JOINs where the two input sides come from **different
sources** (e.g., joining MySQL-CDC with a PostgreSQL-CDC stream).

| Scenario | Supported |
|---|---|
| Single-source multi-table pass-through | ✅ |
| Single-source TableMerge (same schema) | ✅ |
| Cross-source JOIN (MySQL-CDC + PG-CDC) | ❌ |
| Cross-source JOIN (CDC + JDBC batch) | ❌ |
| Same-source JOIN on two different tables | ❌ |

**Workaround**: Write both sources to a common sink (Kafka, Iceberg, etc.) and perform the JOIN
downstream in a dedicated SQL engine (Flink, Spark, ClickHouse, etc.).

---

## 5. Per-Table Transform Configuration Pattern

When you need different transforms for different tables from the same source, declare separate
transform blocks that share the same `plugin_input` and use different `table_match_regex`
rules:

```json
{
"env": {
"job.name": "cdc-multi-table",
"job.mode": "STREAMING"
},
"source": [
{
"plugin_name": "MySQL-CDC",
"plugin_output": "cdc_stream",
"base-url": "jdbc:mysql://localhost:3306/mydb",
"username": "cdc_user",
"password": "password",
"database-names": ["mydb"],
"table-names": ["mydb.orders", "mydb.users", "mydb.products"]
}
],
"transform": [
{
"plugin_name": "FieldMapper",
"plugin_input": ["cdc_stream"],
"plugin_output": "orders_mapped",
"field_mapper": { "order_id": "id", "order_amount": "amount" },
"table_match_regex": "mydb\\.orders"
},
{
"plugin_name": "FieldMapper",
"plugin_input": ["cdc_stream"],
"plugin_output": "users_mapped",
"field_mapper": { "user_id": "id", "user_email": "email" },
"table_match_regex": "mydb\\.users"
}
],
"sink": [
{
"plugin_name": "Kafka",
"plugin_input": ["orders_mapped"],
"topic": "orders"
},
{
"plugin_name": "Kafka",
"plugin_input": ["users_mapped"],
"topic": "users"
},
{
"plugin_name": "Kafka",
"plugin_input": ["cdc_stream"],
"topic": "products",
"table_match_regex": "mydb\\.products"
}
]
}
```

---

## 6. Common Fields Example (Shared Schema)

If multiple tables share a common set of fields, you can use `TableMerge` to combine them and
apply a single transform:

```json
"transform": [
{
"plugin_name": "TableMerge",
"plugin_input": ["cdc_stream"],
"plugin_output": "all_events",
"table_match_regex": "mydb\\.(orders|payments|refunds)"
},
{
"plugin_name": "FieldMapper",
"plugin_input": ["all_events"],
"plugin_output": "all_events_mapped",
"field_mapper": { "created_at": "event_time", "event_type": "type" }
}
]
```

This works only when all three tables (`orders`, `payments`, `refunds`) share `created_at` and
`event_type` fields. If schemas differ, `TableMerge` will fail at runtime.

---

## 7. EtLT Patterns with Multi-Table Transform

**EtLT** (Extract, light-transform, Load, then Transform in the warehouse) is the recommended
pattern when SeaTunnel's transform layer cannot fulfil the full transformation requirement:

```
CDC Source
Light transforms (field rename, type cast, row filter)
Data Lake / Warehouse (Hudi / Iceberg / ClickHouse)
Heavy transforms (JOINs, aggregations, complex SQL)
in dbt / Flink SQL / Spark SQL
```

Use SeaTunnel's transform layer for:
- Field rename / filtering
- Type normalisation
- Row-level filtering
- Schema routing (different tables → different topics/tables)

Offload to downstream:
- Cross-table JOINs
- Aggregations
- Pivot / unpivot
- Enrichment from dimension tables

---

## 8. FAQ

**Q: Can I apply one transform to ALL tables without specifying each one?**

Not directly. Use `TableMerge` to combine tables with compatible schemas first, then apply a
single transform to the merged result. If schemas differ, you must use separate transform blocks,
typically with different `table_match_regex` rules.

**Q: Does the `SQL` transform support `JOIN`?**

No. The `SQL` transform only supports single-table queries (SELECT, WHERE, expressions). For
JOINs, use an external SQL engine after loading the data into a sink.

**Q: What happens to tables that are not matched by any transform?**

Unmatched tables continue flowing through the pipeline and can be captured by a Sink that
uses the right `plugin_input` and `table_match_regex`.

**Q: Can I add a new table to an existing CDC pipeline without downtime?**

This depends on the connector. MySQL-CDC supports dynamic table discovery in some configurations,
but adding a transform for a new table requires a pipeline restart. Use `stop-with-savepoint`
to minimise data loss (see [REST API v2 Reference](../engines/zeta/rest-api-v2.md)).

---

## See Also

- [TableMerge Transform Reference](table-merge.md)
- [TableRename Transform Reference](table-rename.md)
- [transform-multi-table Reference](transform-multi-table.md)
- [Multi-Table Architecture Overview](../architecture/features/multi-table.md)
- [CDC Pipeline Architecture](../architecture/cdc-pipeline-architecture.md)
- [REST API v2 Reference](../engines/zeta/rest-api-v2.md)
3 changes: 2 additions & 1 deletion docs/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ const sidebars = {
"transforms/table-filter",
"transforms/table-merge",
"transforms/table-rename",
"transforms/transform-multi-table"
"transforms/transform-multi-table",
"transforms/multi-table-transform-and-join-boundary"
]
},
{
Expand Down
Loading
Loading