Skip to content

[Fix][E2E] Stabilize engine failover test and rebalance connector shards#10949

Open
DanielLeens wants to merge 3 commits into
apache:devfrom
DanielLeens:david_fix_pr10925_ci_local
Open

[Fix][E2E] Stabilize engine failover test and rebalance connector shards#10949
DanielLeens wants to merge 3 commits into
apache:devfrom
DanielLeens:david_fix_pr10925_ci_local

Conversation

@DanielLeens
Copy link
Copy Markdown
Contributor

Why

This PR extracts the remaining actionable CI fixes from the previous CI follow-up line without mixing them back into the DB2 work.

It addresses two concrete failures observed from the split CI PR line:

  1. engine-v2-it (11) could fail in ClusterFailureNoRestoreIT because the batch job sometimes finished before the test actually shut down the worker.
  2. all-connectors-it-2 (11) could lose the hosted runner heartbeat because the shard was too heavy, especially when connector-iceberg-e2e and connector-hbase-e2e ran together with the rest of part-2.

What is changed

  • Stabilize ClusterFailureNoRestoreIT by:
    • keeping the batch source busy longer with a larger row count per parallelism
    • waiting for the job to reach RUNNING
    • waiting for observable output progress before shutting down the worker
  • Rebalance connector CI shards by:
    • removing connector-iceberg-e2e and connector-hbase-e2e from all-connectors-it-2
    • adding a new dedicated all-connectors-it-8 job for those two heavier suites

Validation

Executed in the local checkout used for this PR:

  • ./mvnw spotless:apply -nsu -Dmaven.gitcommitid.skip=true -T 3C
  • git diff --check -- .github/workflows/backend.yml seatunnel-e2e/seatunnel-engine-e2e/connector-seatunnel-e2e-base/src/test/java/org/apache/seatunnel/engine/e2e/ClusterFailureNoRestoreIT.java
  • Recomputed the workflow split and verified the new layout:
    • all-connectors-it-2: :connector-assert-e2e,:connector-file-cos-e2e,:connector-rabbitmq-e2e,:connector-easysearch-e2e,:connector-qdrant-e2e,:connector-aerospike-e2e
    • all-connectors-it-8: :connector-iceberg-e2e,:connector-hbase-e2e

Not fully revalidated in this run

I also attempted a focused local runtime validation for ClusterFailureNoRestoreIT, but the module is currently blocked by an unrelated upstream compile issue in LocalModeIT (SeaTunnelClient#getHealthMetrics(String) is missing from the current API surface in this branch line). That issue is outside the scope of this PR, so this PR keeps the fix focused on the two current actionable CI failures only.

@davidzollo davidzollo marked this pull request as draft May 25, 2026 13:01
@davidzollo davidzollo marked this pull request as ready for review May 30, 2026 05:29
DanielLeens and others added 3 commits June 1, 2026 12:27
… cleanup

- backend.yml: add first-position removal patterns for connector-iceberg-e2e
  and connector-hbase-e2e so they are correctly stripped even when sorted
  first in the shard module list (previous //,module/ pattern silently missed
  first-position modules that have no leading comma)
- KafkaIT.java: replace hardcoded jobId "18696753645413" in
  testRestoreKafkaToKafkaExactlyOnceOnStreaming with dynamic nanoTime value,
  consistent with how topic/group names are already dynamized
- KafkaIT.java: track dynamically-created topics in a CopyOnWriteArrayList
  and delete them in tearDown() to prevent Kafka broker metadata bloat from
  accumulated retention.ms=-1 topics across CI runs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@DanielLeens DanielLeens force-pushed the david_fix_pr10925_ci_local branch from 065b2b3 to 5202424 Compare June 1, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant