Skip to content

ddl: fail stale backfill task meta#68842

Open
zeminzhou wants to merge 1 commit into
pingcap:masterfrom
zeminzhou:codex/fix-stale-backfill-task-meta
Open

ddl: fail stale backfill task meta#68842
zeminzhou wants to merge 1 commit into
pingcap:masterfrom
zeminzhou:codex/fix-stale-backfill-task-meta

Conversation

@zeminzhou
Copy link
Copy Markdown
Contributor

@zeminzhou zeminzhou commented Jun 1, 2026

What problem does this PR solve?

Issue Number: close #68828

Problem Summary:

Distributed add-index backfill can pick up an existing DXF task by task key and resume it without checking whether the persisted BackfillTaskMeta still matches the current DDL job and reorg elements. If the old task meta contains stale EleIDs, the executor repeatedly returns index info not found, and the error is treated as retryable, so the task can retry forever without making progress.

What changed and how does it work?

  • Add a dedicated backfill task meta is outdated error for stale DXF backfill task metadata.
  • Validate an existing backfill task's persisted metadata before resuming it. The validation checks the job ID, schema ID, table ID, element type, and element IDs against the current reorgInfo.
  • If validation fails for a non-terminal task, mark the DXF task as failed and notify the scheduler instead of resuming the stale task.
  • Treat the stale backfill task metadata error as non-retryable in both DDL reorg retry classification and the backfill DXF executor.
  • Convert index info not found during distributed backfill executor setup into the same non-retryable stale-meta error.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Test commands:

./tools/check/failpoint-go-test.sh pkg/ddl -run 'TestOutdatedBackfillTaskMetaIsNonRetryable|TestValidateBackfillTaskMeta' -count=1
make bazel_prepare
make lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix a bug that could cause distributed `ADD INDEX` backfill to keep retrying forever when it resumed a stale DXF task whose metadata no longer matched the current table indexes.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved detection and handling of outdated backfill task metadata during distributed schema operations. The system now correctly identifies stale metadata and fails immediately instead of retrying unnecessarily, providing better error context (index, table, and job identifiers).
  • Tests

    • Added test coverage for outdated metadata error handling and validation logic.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jun 1, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Jun 1, 2026

@zeminzhou I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangkeao for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Jun 1, 2026

Hi @zeminzhou. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds validation for distributed backfill task metadata during task resumption to prevent infinite retry loops on stale metadata. It introduces a sentinel error type, validates task context before resuming, classifies stale metadata as non-retryable, and includes regression tests.

Changes

Stale Backfill Task Metadata Detection

Layer / File(s) Summary
Sentinel error and detection infrastructure
pkg/ddl/backfilling_dist_executor.go
Add strings import and define errBackfillTaskMetaOutdated sentinel error with isBackfillTaskMetaOutdatedErr helper to detect staleness via errors.Cause equality or error message substring matching.
Task metadata validation on resumption
pkg/ddl/index.go
Introduce validateBackfillTaskMeta(task, reorgInfo) to compare persisted job/table/element identities and IDs; validate metadata in executeDistTask after unmarshalling, and if invalid, mark task as failed with non-retryable sentinel error before resuming.
Non-retryable classification for outdated metadata
pkg/ddl/backfilling_dist_executor.go, pkg/ddl/index.go
Update IsRetryableError in both files to return false immediately when metadata is outdated. Annotate "index info not found" errors with sentinel error to preserve stale metadata context.
Tests for outdated metadata detection and non-retryability
pkg/ddl/backfilling_test.go
Add import updates and unit tests verifying validateBackfillTaskMeta rejects mismatched element IDs, and that isRetryableError treats outdated metadata as non-retryable (both original and reconstructed-from-message errors).

Sequence Diagram

sequenceDiagram
  participant executeDistTask
  participant validateBackfillTaskMeta
  participant IsRetryableError
  participant backfillDistExecutor

  executeDistTask->>executeDistTask: Unmarshal existing task metadata
  executeDistTask->>validateBackfillTaskMeta: Validate metadata vs. reorgInfo
  alt Metadata mismatch
    validateBackfillTaskMeta-->>executeDistTask: Return errBackfillTaskMetaOutdated
    executeDistTask->>executeDistTask: Mark task as failed
    executeDistTask-->>IsRetryableError: Return sentinel error
  else Metadata matches
    validateBackfillTaskMeta-->>executeDistTask: Return nil
    executeDistTask->>backfillDistExecutor: Resume task
    backfillDistExecutor->>backfillDistExecutor: Execute backfill steps
  end
  IsRetryableError->>IsRetryableError: Check isBackfillTaskMetaOutdatedErr
  alt Outdated metadata error
    IsRetryableError-->>backfillDistExecutor: Return false (non-retryable)
  else Other error
    IsRetryableError-->>backfillDistExecutor: Apply existing retry logic
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A stale task was looping endlessly in the night,
We add a sentinel guard and metadata validation tight,
Now outdated whispers are marked non-retryable, clear,
No more infinite retries—the backfill path is sincere! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding validation to fail stale backfill task metadata instead of retrying forever.
Description check ✅ Passed The PR description is comprehensive, following the template with Problem Summary, What Changed, test checklist completion, and detailed release notes provided.
Linked Issues check ✅ Passed All coding objectives from issue #68828 are met: validation of persisted BackfillTaskMeta, non-retryable error for stale meta, executor setup conversion, and unit tests added.
Out of Scope Changes check ✅ Passed All changes are within scope: adding sentinel error, validation helper, error classification in reorg retry logic, metadata validation on task resumption, and two unit tests directly addressing the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/ddl/index.go (1)

3107-3115: ⚡ Quick win

Consider adding a log message when stale task metadata is detected.

The code correctly validates and fails stale tasks, but a dedicated log message at this point would improve observability when debugging production incidents.

📋 Suggested log addition
 if err := validateBackfillTaskMeta(task, reorgInfo); err != nil {
+	logutil.DDLLogger().Warn("resuming task with stale metadata, marking as failed",
+		zap.Int64("taskID", task.ID), zap.String("taskKey", task.Key),
+		zap.Int64("jobID", reorgInfo.Job.ID), zap.Error(err))
 	if !task.TaskBase.IsDone() {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 3107 - 3115, When
validateBackfillTaskMeta(task, reorgInfo) returns an error and you proceed to
fail the task via taskManager.FailTask(w.workCtx, task.ID, task.State, err), add
a structured log entry before returning that records the stale metadata error
and task identifiers; specifically log the error value, task.ID, task.State and
whether task.TaskBase.IsDone() so operators can trace why
validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/ddl/index.go`:
- Around line 3107-3115: When validateBackfillTaskMeta(task, reorgInfo) returns
an error and you proceed to fail the task via taskManager.FailTask(w.workCtx,
task.ID, task.State, err), add a structured log entry before returning that
records the stale metadata error and task identifiers; specifically log the
error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators
can trace why validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4ecb5d64-f8e6-4d7f-8071-045bc8b7e986

📥 Commits

Reviewing files that changed from the base of the PR and between 147980c and a84e38c.

📒 Files selected for processing (3)
  • pkg/ddl/backfilling_dist_executor.go
  • pkg/ddl/backfilling_test.go
  • pkg/ddl/index.go

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

@zeminzhou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-build-next-gen a84e38c link true /test pull-build-next-gen
idc-jenkins-ci-tidb/build a84e38c link true /test build
idc-jenkins-ci-tidb/unit-test a84e38c link true /test unit-test
pull-unit-test-next-gen a84e38c link true /test pull-unit-test-next-gen
pull-unit-test-ddlv1 a84e38c link true /test pull-unit-test-ddlv1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 8.82353% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.9060%. Comparing base (99e1c67) to head (a84e38c).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68842        +/-   ##
================================================
+ Coverage   76.3104%   76.9060%   +0.5955%     
================================================
  Files          2041       2051        +10     
  Lines        563452     571388      +7936     
================================================
+ Hits         429973     439432      +9459     
+ Misses       132563     130201      -2362     
- Partials        916       1755       +839     
Flag Coverage Δ
integration 46.1622% <8.8235%> (+6.3837%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4610% <ø> (ø)
parser ∅ <ø> (∅)
br 65.8009% <ø> (+2.9699%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ddl: stale DXF backfill task meta can make add index retry forever

1 participant