ddl: fail stale backfill task meta#68842
Conversation
|
@zeminzhou I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zeminzhou. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughThis PR adds validation for distributed backfill task metadata during task resumption to prevent infinite retry loops on stale metadata. It introduces a sentinel error type, validates task context before resuming, classifies stale metadata as non-retryable, and includes regression tests. ChangesStale Backfill Task Metadata Detection
Sequence DiagramsequenceDiagram
participant executeDistTask
participant validateBackfillTaskMeta
participant IsRetryableError
participant backfillDistExecutor
executeDistTask->>executeDistTask: Unmarshal existing task metadata
executeDistTask->>validateBackfillTaskMeta: Validate metadata vs. reorgInfo
alt Metadata mismatch
validateBackfillTaskMeta-->>executeDistTask: Return errBackfillTaskMetaOutdated
executeDistTask->>executeDistTask: Mark task as failed
executeDistTask-->>IsRetryableError: Return sentinel error
else Metadata matches
validateBackfillTaskMeta-->>executeDistTask: Return nil
executeDistTask->>backfillDistExecutor: Resume task
backfillDistExecutor->>backfillDistExecutor: Execute backfill steps
end
IsRetryableError->>IsRetryableError: Check isBackfillTaskMetaOutdatedErr
alt Outdated metadata error
IsRetryableError-->>backfillDistExecutor: Return false (non-retryable)
else Other error
IsRetryableError-->>backfillDistExecutor: Apply existing retry logic
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/ddl/index.go (1)
3107-3115: ⚡ Quick winConsider adding a log message when stale task metadata is detected.
The code correctly validates and fails stale tasks, but a dedicated log message at this point would improve observability when debugging production incidents.
📋 Suggested log addition
if err := validateBackfillTaskMeta(task, reorgInfo); err != nil { + logutil.DDLLogger().Warn("resuming task with stale metadata, marking as failed", + zap.Int64("taskID", task.ID), zap.String("taskKey", task.Key), + zap.Int64("jobID", reorgInfo.Job.ID), zap.Error(err)) if !task.TaskBase.IsDone() {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ddl/index.go` around lines 3107 - 3115, When validateBackfillTaskMeta(task, reorgInfo) returns an error and you proceed to fail the task via taskManager.FailTask(w.workCtx, task.ID, task.State, err), add a structured log entry before returning that records the stale metadata error and task identifiers; specifically log the error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators can trace why validateBackfillTaskMeta failed—place the log just after the validateBackfillTaskMeta check and before calling taskManager.FailTask/handle.NotifyTaskChange.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/ddl/index.go`:
- Around line 3107-3115: When validateBackfillTaskMeta(task, reorgInfo) returns
an error and you proceed to fail the task via taskManager.FailTask(w.workCtx,
task.ID, task.State, err), add a structured log entry before returning that
records the stale metadata error and task identifiers; specifically log the
error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators
can trace why validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4ecb5d64-f8e6-4d7f-8071-045bc8b7e986
📒 Files selected for processing (3)
pkg/ddl/backfilling_dist_executor.gopkg/ddl/backfilling_test.gopkg/ddl/index.go
|
@zeminzhou: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68842 +/- ##
================================================
+ Coverage 76.3104% 76.9060% +0.5955%
================================================
Files 2041 2051 +10
Lines 563452 571388 +7936
================================================
+ Hits 429973 439432 +9459
+ Misses 132563 130201 -2362
- Partials 916 1755 +839
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
What problem does this PR solve?
Issue Number: close #68828
Problem Summary:
Distributed add-index backfill can pick up an existing DXF task by task key and resume it without checking whether the persisted
BackfillTaskMetastill matches the current DDL job and reorg elements. If the old task meta contains staleEleIDs, the executor repeatedly returnsindex info not found, and the error is treated as retryable, so the task can retry forever without making progress.What changed and how does it work?
backfill task meta is outdatederror for stale DXF backfill task metadata.reorgInfo.index info not foundduring distributed backfill executor setup into the same non-retryable stale-meta error.Check List
Tests
Test commands:
./tools/check/failpoint-go-test.sh pkg/ddl -run 'TestOutdatedBackfillTaskMetaIsNonRetryable|TestValidateBackfillTaskMeta' -count=1 make bazel_prepare make lintSide effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests