Skip to content

sessionctx: add cluster-wide read-only status variable#68853

Open
bb7133 wants to merge 4 commits into
masterfrom
bb7133/tidb-is-read-only-status
Open

sessionctx: add cluster-wide read-only status variable#68853
bb7133 wants to merge 4 commits into
masterfrom
bb7133/tidb-is-read-only-status

Conversation

@bb7133
Copy link
Copy Markdown
Member

@bb7133 bb7133 commented Jun 1, 2026

What problem does this PR solve?

Issue Number: close #68852

Problem Summary:

TiDB has tidb_restricted_read_only and tidb_super_read_only, but there is no SQL-visible status that tells operators whether every live TiDB instance has applied effective read-only state.

What changed and how does it work?

This PR adds read-only global sysvar tidb_is_read_only.

tidb_is_read_only returns ON only when every live TiDB instance publishes effective read-only state through the existing server info sync path:

effective_read_only = tidb_restricted_read_only || tidb_super_read_only

If there is no live TiDB info, or any live TiDB reports effective_read_only = false, the aggregate returns OFF.

Implementation details:

  • Add tidb_is_read_only as a read-only global sysvar.
  • Add per-instance read-only status fields to serverinfo.DynamicInfo.
  • Publish local read-only status when tidb_restricted_read_only or tidb_super_read_only changes.
  • Derive the aggregate value from all live TiDB server info records.
  • Keep existing read-only enforcement behavior unchanged.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • No behavior change for existing tidb_restricted_read_only / tidb_super_read_only enforcement

Release note

Add the read-only system variable `tidb_is_read_only` to indicate whether all live TiDB instances are effectively running in read-only mode.

Summary by CodeRabbit

  • New Features

    • Cluster effective read-only is now computed from live instances and a local report; cluster is read-only only when all live instances are read-only.
    • Instances can report local read-only changes to trigger cluster status updates.
    • New system variable tidb_is_read_only to query effective cluster read-only state.
  • Tests

    • Added tests for cluster aggregation, instance reporting, and tidb_is_read_only behavior.
  • Chores

    • Build/test target adjustments to include the new read-only plumbing.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-tests-checked release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 1, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Jun 1, 2026

@bb7133 I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terry1purcell, xhebox for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds cluster-wide read-only plumbing and a read-only SQL-visible variable by registering vardef callbacks and reporter, extending ServerInfo with read-only fields, persisting per-instance status via Syncer, aggregating effective read-only across instances, and wiring tests and sysvar handlers.

Changes

Cluster-wide Read-Only Status Variable

Layer / File(s) Summary
Read-only status plumbing framework
pkg/sessionctx/vardef/readonly.go, pkg/sessionctx/vardef/BUILD.bazel
Defines ClusterReadOnlyChecker and ReadOnlyStatusReporter, atomic storage for callbacks, LocalTiDBReadOnlyStatus(), getters/setters, and includes readonly.go in vardef build.
System variable integration
pkg/sessionctx/vardef/tidb_vars.go, pkg/sessionctx/variable/sysvar.go, pkg/sessionctx/variable/sysvar_test.go
Adds TiDBIsReadOnly constant, reportTiDBReadOnlyStatus(ctx) helper, updates TiDBRestrictedReadOnly and TiDBSuperReadOnly SetGlobal handlers to accept context.Context and trigger reporting; tests validate tidb_is_read_only.
Server info schema extension
pkg/domain/serverinfo/info.go
Adds TiDBRestrictedReadOnly, TiDBSuperReadOnly, and TiDBEffectiveReadOnly fields to DynamicInfo and updates Clone().
Server info persistence and tests
pkg/domain/serverinfo/syncer.go, pkg/domain/serverinfo/syncer_test.go, pkg/domain/serverinfo/BUILD.bazel
Adds Syncer.UpdateServerReadOnlyStatus() to compute effective read-only, persist to etcd when session present, update in-memory DynamicInfo; GetAllServerInfo non-etcd path now includes local read-only fields; tests exercise non-etcd behavior; test shard_count updated.
Cluster aggregation and wiring
pkg/domain/infosync/info.go, pkg/domain/infosync/info_test.go, pkg/domain/infosync/BUILD.bazel
Registers cluster checker and reporter in init(), implements UpdateServerReadOnlyStatus(ctx), getClusterReadOnlyStatus, and clusterReadOnlyStatusFromServerInfo aggregation (true only when non-empty and all instances effective-read-only); tests cover aggregation scenarios; infosync test shard_count and deps updated.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant SysVar as tidb_is_read_only
  participant Vardef
  participant Checker as getClusterReadOnlyStatus
  participant Syncer
  participant Instances as LiveTiDBInstances

  User->>SysVar: SELECT @@global.tidb_is_read_only
  SysVar->>Vardef: GetClusterReadOnlyStatus(ctx)
  Vardef->>Checker: invoke checker(ctx)
  Checker->>Syncer: GetAllServerInfo()
  Syncer->>Instances: gather DynamicInfo per instance
  Instances-->>Syncer: map[string]ServerInfo{DynamicInfo{TiDBEffectiveReadOnly}}
  Syncer-->>Checker: serverInfoMap
  Checker->>Checker: clusterReadOnlyStatusFromServerInfo()
  alt all live instances EffectiveReadOnly==true
    Checker-->>Vardef: true
  else any missing/false
    Checker-->>Vardef: false
  end
  Vardef-->>SysVar: ON/OFF
  SysVar-->>User: ON/OFF
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

size/L, ok-to-test

Suggested reviewers

  • yudongusa
  • D3Hunter
  • tiancaiamao

Poem

🐰 I hop through nodes both near and far,

Checking flags beneath each cluster star,
When every burrow bids all writes stay,
I whisper "read-only" — the cluster's way,
A tiny rabbit confirms the bar.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a cluster-wide read-only status variable to the sessionctx package.
Linked Issues check ✅ Passed All coding requirements from issue #68852 are met: read-only sysvar tidb_is_read_only is added, per-instance read-only fields in serverinfo.DynamicInfo are implemented, local status publishing on variable changes is added, and cluster-wide aggregation logic is derived from live server info.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the cluster-wide read-only status variable feature. No out-of-scope modifications detected; changes focus on the required infrastructure and system variable implementation.
Description check ✅ Passed The PR description clearly explains the problem, implementation, testing, and includes a release note following the required format.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bb7133/tidb-is-read-only-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/domain/serverinfo/syncer.go (1)

241-241: 💤 Low value

Good optimization using cached info.

When etcd is unavailable, this now returns the cached info directly instead of reconstructing via getServerInfo(). This is more efficient and consistent with the already-loaded state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/domain/serverinfo/syncer.go` at line 241, The current change should
return the cached ServerInfo from the allInfo map instead of reconstructing it
via getServerInfo() when etcd is unavailable; update the code path that
currently calls getServerInfo() to fetch and return allInfo[info.ID] (the cached
value) and ensure callers of the sync routine use that cached ServerInfo,
referencing the allInfo map and the getServerInfo function to locate the logic
to modify.
pkg/domain/infosync/info.go (1)

351-366: 💤 Low value

Silent failure when info syncer unavailable is intentional.

updateServerReadOnlyStatus returns nil when getGlobalInfoSyncer() fails (line 358-360). This is correct: if the syncer isn't initialized (e.g., during startup or in tests), we can't publish status to etcd, but we shouldn't fail the SET operation. The local variable state will still be updated via the direct atomic stores in vardef.

Optional: add debug logging

If you want observability for this path:

 func updateServerReadOnlyStatus(ctx context.Context) error {
 	is, err := getGlobalInfoSyncer()
 	if err != nil {
+		logutil.BgLogger().Debug("info syncer not available, skipping read-only status publish", zap.Error(err))
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/domain/infosync/info.go` around lines 351 - 366, The current
updateServerReadOnlyStatus function intentionally returns nil when
getGlobalInfoSyncer() fails (silent on uninitialized syncer); preserve that
behavior but improve observability by logging the error at debug/info level
instead of dropping it: in updateServerReadOnlyStatus, after calling
getGlobalInfoSyncer(), if err != nil call the package logger (or processLogger)
to emit a contextual debug message including the error and the fact the info
syncer is uninitialized, then return nil; leave the subsequent call to
is.svrInfoSyncer.UpdateServerReadOnlyStatus and the use of
vardef.RestrictedReadOnly.Load()/vardef.VarTiDBSuperReadOnly.Load() unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/sessionctx/variable/sysvar.go`:
- Around line 84-88: The build is failing because reportTiDBReadOnlyStatus calls
vardef.ReportReadOnlyStatus and vardef.GetClusterReadOnlyStatus which do not
exist in the current plumbing; update the call sites (e.g., in
reportTiDBReadOnlyStatus and the other occurrences around lines shown) to use
the actual exported API names from the plumbing layer or export these symbols
from vardef: either change vardef.ReportReadOnlyStatus -> the real function name
provided by the plumbing (and likewise for GetClusterReadOnlyStatus), or add
matching ReportReadOnlyStatus/GetClusterReadOnlyStatus wrappers in the vardef
package that forward to the plumbing implementation so the references in
reportTiDBReadOnlyStatus compile.

---

Nitpick comments:
In `@pkg/domain/infosync/info.go`:
- Around line 351-366: The current updateServerReadOnlyStatus function
intentionally returns nil when getGlobalInfoSyncer() fails (silent on
uninitialized syncer); preserve that behavior but improve observability by
logging the error at debug/info level instead of dropping it: in
updateServerReadOnlyStatus, after calling getGlobalInfoSyncer(), if err != nil
call the package logger (or processLogger) to emit a contextual debug message
including the error and the fact the info syncer is uninitialized, then return
nil; leave the subsequent call to is.svrInfoSyncer.UpdateServerReadOnlyStatus
and the use of
vardef.RestrictedReadOnly.Load()/vardef.VarTiDBSuperReadOnly.Load() unchanged.

In `@pkg/domain/serverinfo/syncer.go`:
- Line 241: The current change should return the cached ServerInfo from the
allInfo map instead of reconstructing it via getServerInfo() when etcd is
unavailable; update the code path that currently calls getServerInfo() to fetch
and return allInfo[info.ID] (the cached value) and ensure callers of the sync
routine use that cached ServerInfo, referencing the allInfo map and the
getServerInfo function to locate the logic to modify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ab2e1649-8775-4a17-8a63-b05967e64199

📥 Commits

Reviewing files that changed from the base of the PR and between 260d374 and 0a835be.

📒 Files selected for processing (8)
  • pkg/domain/infosync/info.go
  • pkg/domain/infosync/info_test.go
  • pkg/domain/serverinfo/info.go
  • pkg/domain/serverinfo/syncer.go
  • pkg/sessionctx/vardef/readonly.go
  • pkg/sessionctx/vardef/tidb_vars.go
  • pkg/sessionctx/variable/sysvar.go
  • pkg/sessionctx/variable/sysvar_test.go

Comment on lines +84 to +88
func reportTiDBReadOnlyStatus(ctx context.Context) {
if err := vardef.ReportReadOnlyStatus(ctx); err != nil {
logutil.BgLogger().Warn("update TiDB read-only status failed", zap.Error(err))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the vardef API mismatch before merge.

Crossbuild is failing here because vardef.ReportReadOnlyStatus and vardef.GetClusterReadOnlyStatus are undefined. This PR will not compile until the plumbing layer exports these symbols or these call sites are updated to the actual API names.

Also applies to: 1014-1020

🧰 Tools
🪛 GitHub Check: Bazel Crossbuild (ubuntu-24.04-arm)

[failure] 85-85:
undefined: vardef.ReportReadOnlyStatus

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sessionctx/variable/sysvar.go` around lines 84 - 88, The build is failing
because reportTiDBReadOnlyStatus calls vardef.ReportReadOnlyStatus and
vardef.GetClusterReadOnlyStatus which do not exist in the current plumbing;
update the call sites (e.g., in reportTiDBReadOnlyStatus and the other
occurrences around lines shown) to use the actual exported API names from the
plumbing layer or export these symbols from vardef: either change
vardef.ReportReadOnlyStatus -> the real function name provided by the plumbing
(and likewise for GetClusterReadOnlyStatus), or add matching
ReportReadOnlyStatus/GetClusterReadOnlyStatus wrappers in the vardef package
that forward to the plumbing implementation so the references in
reportTiDBReadOnlyStatus compile.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 59.61538% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.8546%. Comparing base (a9add5c) to head (e3eb19d).
⚠️ Report is 35 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68853        +/-   ##
================================================
+ Coverage   76.3085%   76.8546%   +0.5461%     
================================================
  Files          2041       2054        +13     
  Lines        563262     571445      +8183     
================================================
+ Hits         429817     439182      +9365     
+ Misses       132529     130515      -2014     
- Partials        916       1748       +832     
Flag Coverage Δ
integration 46.1672% <59.6153%> (+6.3887%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4610% <ø> (ø)
parser ∅ <ø> (∅)
br 65.8365% <ø> (+3.0055%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/domain/serverinfo/syncer.go (1)

184-214: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Confirm read-only and label updates can't race on s.info.

UpdateServerReadOnlyStatus and UpdateServerLabel both perform an unsynchronized read‑modify‑write on the atomic s.info pointer (clone current DynamicInfo → mutate → setDynamicServerInfo/Clone+store). Syncer has no mutex. If a label change and a read-only-status change run concurrently, one of them clones a snapshot taken before the other's store, so the later store can silently drop the other's update (lost update). The atomic pointer only guarantees a torn-free swap, not RMW atomicity.

If these callers are already serialized upstream (e.g. both flow through a single config/sysvar apply path), this is moot. Otherwise consider guarding the RMW sections with a small sync.Mutex.

#!/bin/bash
# Find call sites of the two RMW methods to check whether they can run concurrently.
rg -nP -C3 '\b(UpdateServerReadOnlyStatus|UpdateServerLabel)\s*\(' --type=go -g '!**/serverinfo/syncer.go'
# Check whether Syncer already has any lock field guarding s.info.
ast-grep --pattern 'type Syncer struct { $$$ }'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/domain/serverinfo/syncer.go` around lines 184 - 214,
UpdateServerReadOnlyStatus and UpdateServerLabel perform unsynchronized
read-modify-write on the atomic s.info (via cloneDynamicServerInfo,
setDynamicServerInfo and Clone+store) which can cause lost updates; add a small
sync.Mutex (e.g., infoMu) to the Syncer struct and use it to serialize the RMW
critical sections in both UpdateServerReadOnlyStatus and UpdateServerLabel (lock
before cloning/mutating and unlock after setDynamicServerInfo/store) so updates
cannot race; ensure you only hold the mutex for the minimal span covering clone
-> mutate -> set/store to avoid contention.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/domain/serverinfo/syncer.go`:
- Around line 184-214: UpdateServerReadOnlyStatus and UpdateServerLabel perform
unsynchronized read-modify-write on the atomic s.info (via
cloneDynamicServerInfo, setDynamicServerInfo and Clone+store) which can cause
lost updates; add a small sync.Mutex (e.g., infoMu) to the Syncer struct and use
it to serialize the RMW critical sections in both UpdateServerReadOnlyStatus and
UpdateServerLabel (lock before cloning/mutating and unlock after
setDynamicServerInfo/store) so updates cannot race; ensure you only hold the
mutex for the minimal span covering clone -> mutate -> set/store to avoid
contention.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 294e6206-c2fd-44ab-9ffb-01a52c65860c

📥 Commits

Reviewing files that changed from the base of the PR and between 0e2e9e5 and 0ce5031.

📒 Files selected for processing (2)
  • pkg/domain/serverinfo/syncer.go
  • pkg/domain/serverinfo/syncer_test.go

@bb7133
Copy link
Copy Markdown
Member Author

bb7133 commented Jun 1, 2026

/retest

1 similar comment
@bb7133
Copy link
Copy Markdown
Member Author

bb7133 commented Jun 1, 2026

/retest

Copy link
Copy Markdown
Contributor

@expxiaoli expxiaoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race bug.

}

// UpdateServerReadOnlyStatus updates the local TiDB read-only status in the info syncer.
func UpdateServerReadOnlyStatus(ctx context.Context) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is race here. UpdateServerLabel and UpdateServerReadOnlyStatus both clone DynamicInfo, change one field group, then write the whole struct back.

Bug case: label update clones read_only=false; read-only update writes read_only=true; label update then writes its old snapshot with new labels, restoring read_only=false. After that, tidb_is_read_only may return OFF even though this TiDB is already read-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cluster-wide read-only status variable

3 participants