server: remove hot-path disconnect monitor#68685
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThis PR refactors connection liveness monitoring from a background goroutine architecture to a simpler probe-function model. The core mechanism now installs a reusable probe into ChangesConnection Liveness Probe Architecture
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Skipping CI for Draft Pull Request. |
|
@King-Dylan I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/server/conn.go (1)
2212-2246: ⚡ Quick winAdd a stale-tick regression test for the generation gate.
This loop now carries the key correctness contract of the refactor. A failpoint-backed test using
mockConnectionAliveMonitorIntervalthat lets statement N's tick arrive after statement N+1 becomes active would lock in thatgenerationreally prevents the older tick from cancelling the newer statement.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/server/conn.go` around lines 2212 - 2246, Add a failpoint-backed regression test that exercises monitorConnectionAlive's generation gate: use the failpoint "mockConnectionAliveMonitorInterval" to make the ticker tick slowly so a tick from statement N can arrive after statement N+1 becomes active, then drive connectionAliveMonitor.active and connectionAliveMonitor.generation across two sequenced statements (via methods that call cc.getCtx and mutate sessVars.SQLKiller.IsConnectionAlive) and assert that an older tick (captured using SQLKiller.IsConnectionAlive returning false) does not call sessVars.SQLKiller.SendKillSignal or cc.cancelDispatch for the newer generation; the test should explicitly toggle connectionAliveMonitor.generation between statements and verify only the matching-generation tick triggers cancelDispatch.pkg/session/tidb.go (1)
277-287: ⚡ Quick winAdd a boundary test for the 10ms fast path.
Please cover both sides of this cutoff: one case that stays below
10msand skips the probe, and one that crosses it and still performs the pre-commit check. This threshold is now part of the performance contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/session/tidb.go` around lines 277 - 287, Add two unit tests for shouldCheckConnectionAliveBeforeCommit that assert behavior on both sides of minConnectionAliveCheckBeforeCommitDuration: (1) "fast path" test: create a variable.SessionVars with autocommit=true, not in txn, set StartTime to time.Now() or time.Now().Add(-5*time.Millisecond) (i.e. < minConnectionAliveCheckBeforeCommitDuration) and verify shouldCheckConnectionAliveBeforeCommit returns false; (2) "slow path" test: same session vars but set StartTime to time.Now().Add(-minConnectionAliveCheckBeforeCommitDuration - 1*time.Millisecond) (i.e. > minConnectionAliveCheckBeforeCommitDuration) and verify the function returns true. Ensure tests call shouldCheckConnectionAliveBeforeCommit (and pass a minimal sqlexec.Statement or nil as appropriate) and use the exact symbols minConnectionAliveCheckBeforeCommitDuration and shouldCheckConnectionAliveBeforeCommit so they remain robust to line changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/server/conn.go`:
- Around line 2212-2246: Add a failpoint-backed regression test that exercises
monitorConnectionAlive's generation gate: use the failpoint
"mockConnectionAliveMonitorInterval" to make the ticker tick slowly so a tick
from statement N can arrive after statement N+1 becomes active, then drive
connectionAliveMonitor.active and connectionAliveMonitor.generation across two
sequenced statements (via methods that call cc.getCtx and mutate
sessVars.SQLKiller.IsConnectionAlive) and assert that an older tick (captured
using SQLKiller.IsConnectionAlive returning false) does not call
sessVars.SQLKiller.SendKillSignal or cc.cancelDispatch for the newer generation;
the test should explicitly toggle connectionAliveMonitor.generation between
statements and verify only the matching-generation tick triggers cancelDispatch.
In `@pkg/session/tidb.go`:
- Around line 277-287: Add two unit tests for
shouldCheckConnectionAliveBeforeCommit that assert behavior on both sides of
minConnectionAliveCheckBeforeCommitDuration: (1) "fast path" test: create a
variable.SessionVars with autocommit=true, not in txn, set StartTime to
time.Now() or time.Now().Add(-5*time.Millisecond) (i.e. <
minConnectionAliveCheckBeforeCommitDuration) and verify
shouldCheckConnectionAliveBeforeCommit returns false; (2) "slow path" test: same
session vars but set StartTime to
time.Now().Add(-minConnectionAliveCheckBeforeCommitDuration -
1*time.Millisecond) (i.e. > minConnectionAliveCheckBeforeCommitDuration) and
verify the function returns true. Ensure tests call
shouldCheckConnectionAliveBeforeCommit (and pass a minimal sqlexec.Statement or
nil as appropriate) and use the exact symbols
minConnectionAliveCheckBeforeCommitDuration and
shouldCheckConnectionAliveBeforeCommit so they remain robust to line changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d2084ec8-9421-426a-9333-d7795f8cc871
📒 Files selected for processing (2)
pkg/server/conn.gopkg/session/tidb.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68685 +/- ##
================================================
- Coverage 76.3086% 75.2870% -1.0216%
================================================
Files 2041 2023 -18
Lines 563504 568013 +4509
================================================
- Hits 430002 427640 -2362
- Misses 132586 140316 +7730
+ Partials 916 57 -859
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest-required |
7b5560d to
c1af698
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/server/conn.go (2)
2242-2301: ⚡ Quick winAdd a regression test for the idle-timeout restart window.
The
started/notifystate machine is the riskiest part of this refactor. A targeted test for “statement activates monitoring while the idle timer is trying to exit” would make future cleanup here much safer.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/server/conn.go` around lines 2242 - 2301, Add a regression test that reproduces the "idle-timeout restart window" race between the idleTimer exit path and a concurrent statement activation: create a test that constructs a conn (cc) with a connectionAliveMonitor, sets active=false and started=true, starts the monitor goroutine (the loop that reads cc.connectionAliveMonitor.notify/started/active and uses connectionAliveMonitorIdleTimeout), then orchestrate timing so the idleTimer expiry branch runs while another goroutine sends on cc.connectionAliveMonitor.notify and flips active to true (or performs the same steps a statement would) right during the idle exit checks; assert that the monitor does not stop (started remains true or monitor goroutine continues) and that no unintended return occurs. Use the concrete symbols cc.connectionAliveMonitor, cc.connectionAliveMonitor.notify, cc.connectionAliveMonitor.started, cc.connectionAliveMonitor.active, and the monitor goroutine entry (the loop in conn.go) to locate code; make the test deterministic by controlling time (use a short timeout constant or a fake clock if available) and explicit synchronization between goroutines to reliably exercise the race window.
2242-2257: ⚡ Quick winDocument the
started/activehandoff here.This idle-timeout exit path relies on a subtle CAS protocol to avoid losing wake-ups when a new statement arrives while the goroutine is trying to stop. Please spell that invariant out inline; otherwise this is hard to safely maintain.
As per coding guidelines "Comments SHOULD explain non-obvious intent, constraints, invariants, concurrency guarantees, SQL/compatibility contracts, or important performance trade-offs; SHOULD NOT restate what the code already makes clear".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/server/conn.go` around lines 2242 - 2257, Add an inline comment at the idle-timeout exit path inside connectionAliveMonitor's goroutine (the block using idleTimer, connectionAliveMonitor.active, and connectionAliveMonitor.started.CompareAndSwap) that documents the CAS-based handoff invariant: explain that "started" is used to indicate the goroutine is running and that we must CAS it from true→false only if "active" remains false to avoid losing a concurrent wake-up; detail that a concurrent waiter will set active=true then set started true (or restart the loop) and therefore the sequence of active checks plus CompareAndSwap ensures no wake-up is missed — i.e., describe the exact ordering guarantees the code relies on so future maintainers understand why we check active before/after the CompareAndSwap and why we return or continue accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/server/conn.go`:
- Around line 2242-2301: Add a regression test that reproduces the "idle-timeout
restart window" race between the idleTimer exit path and a concurrent statement
activation: create a test that constructs a conn (cc) with a
connectionAliveMonitor, sets active=false and started=true, starts the monitor
goroutine (the loop that reads cc.connectionAliveMonitor.notify/started/active
and uses connectionAliveMonitorIdleTimeout), then orchestrate timing so the
idleTimer expiry branch runs while another goroutine sends on
cc.connectionAliveMonitor.notify and flips active to true (or performs the same
steps a statement would) right during the idle exit checks; assert that the
monitor does not stop (started remains true or monitor goroutine continues) and
that no unintended return occurs. Use the concrete symbols
cc.connectionAliveMonitor, cc.connectionAliveMonitor.notify,
cc.connectionAliveMonitor.started, cc.connectionAliveMonitor.active, and the
monitor goroutine entry (the loop in conn.go) to locate code; make the test
deterministic by controlling time (use a short timeout constant or a fake clock
if available) and explicit synchronization between goroutines to reliably
exercise the race window.
- Around line 2242-2257: Add an inline comment at the idle-timeout exit path
inside connectionAliveMonitor's goroutine (the block using idleTimer,
connectionAliveMonitor.active, and
connectionAliveMonitor.started.CompareAndSwap) that documents the CAS-based
handoff invariant: explain that "started" is used to indicate the goroutine is
running and that we must CAS it from true→false only if "active" remains false
to avoid losing a concurrent wake-up; detail that a concurrent waiter will set
active=true then set started true (or restart the loop) and therefore the
sequence of active checks plus CompareAndSwap ensures no wake-up is missed —
i.e., describe the exact ordering guarantees the code relies on so future
maintainers understand why we check active before/after the CompareAndSwap and
why we return or continue accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 065c2485-3779-4a0a-b90c-13fc58b43228
📒 Files selected for processing (2)
pkg/server/conn.gopkg/session/tidb.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/session/tidb.go
c32b8c9 to
8b9c82f
Compare
8b9c82f to
0a6b80b
Compare
|
⏳ @King-Dylan I've received your updated pull request and will continue the review. I'll update this comment when I have something to share. ℹ️ Learn more details on Pantheon AI. |
|
Actionable comments posted: 0 |
|
/test fast_test_tiprow |
|
@King-Dylan: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test fast_test_tiprow |
|
@King-Dylan: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test fast_test_tiprow |
|
@King-Dylan: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/retest-required |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bb7133, Defined2014 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: close #68633
Problem Summary:
PR #68237 added active connection-liveness monitoring for autocommit
INSERT/UPDATE/DELETEso TiDB can notice some client disconnects before commit. That put fixed work on every eligible autocommit DML statement, including per-statement monitor goroutine/ticker/channel setup and an eager pre-commit socket probe. In the high-QPS sysbencholtp_deletecase, that hot-path cost caused the 3.7% regression reported in #68633.What changed and how does it work?
SQLKiller.IsConnectionAlivewhile the statement is executing, so existing interruption checkpoints can probe the socket synchronously with SQLKiller's normal throttle.INSERT ... SLEEP()disconnect regression coverage by keeping the kill signal for write statements whenSLEEP()observes the interruption; the statement then aborts and rolls back instead of committing partial rows.Check List
Tests
Manual test commands:
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Bug Fixes
Performance
Tests