Skip to content

columnar: support bucket parallel read in region#10871

Open
yongman wants to merge 15 commits into
pingcap:masterfrom
yongman:bucket-read
Open

columnar: support bucket parallel read in region#10871
yongman wants to merge 15 commits into
pingcap:masterfrom
yongman:bucket-read

Conversation

@yongman

@yongman yongman commented May 26, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #10844

Problem Summary:
The read concurrency is limited in region level.

What is changed and how it works?


  1. Support buckets level concurrency read in one region.
  2. Optimize the streams dispatching for pipelines. Change read task pre-alloc to on-demand to avoid thread imbalance.
  3. Lazy create columnar reader & prefetch create reader in async thread.
  4. Reuse SnapAccess for buckets concurrent read in one request.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • PD-backed caching for region bucket keys and shared-snapshot de-duplication to reduce duplicate leader snapshot requests and improve read latency.
    • New engine interface hooks to fetch region bucket keys and clear shared-snapshot state.
  • Refactor

    • Disaggregated columnar read pipeline reworked to a shared reader-task pool with lazy, slot-based reader materialization for better parallelism and resource efficiency.
    • Server skips legacy read-thread initialization when columnar mode is enabled.
  • Tests

    • Unit test added for cache eviction behavior during in-flight loads.

Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot

ti-chi-bot Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels May 26, 2026
@ti-chi-bot

ti-chi-bot Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jinhelin for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds FFI callbacks and hub caches (region bucket keys; shared snap access), exposes them via FFI, and refactors the C++ disaggregated columnar reader to use a shared serialized context, bucket-aware planning, and lazy, slot-based proxy reader materialization; also gates DeltaMerge thread init when columnar mode is active.

Changes

Columnar Storage Read Path Enhancement

Layer / File(s) Summary
FFI Contract Extension for Bucket Keys and Snapshot Clearing
contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h, contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs, contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs, contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Adds fn_get_region_bucket_keys and fn_clear_shared_snap_access_by_start_ts callbacks, exports ffi_get_region_bucket_keys and ffi_clear_shared_snap_access_by_start_ts, and wires them into the hub FFI helper.
Region Bucket Key Caching in PD Client
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Adds RegionBucketCacheEntry and PdClientWithCache::region_bucket_cache, initializes and evicts it with region cache, and exposes get_region_bucket_keys that validates region epoch.
Shared Snapshot Access Deduplication
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Introduces SharedSnapAccessCache, SharedSnapAccessGroup, and SharedSnapAccessKey, implements weak-ref reuse with per-key async loader mutexes, get_or_request_shared_snapshot, and exposes clear_shared_snap_access_by_start_ts.
Shared context and bucket helpers (C++)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp, dbms/src/Storages/StorageDisaggregatedColumnar.h
Adds RNProxyReaderSharedContext, bucket-boundary helpers, and a proxy-backed fetch for region bucket keys with Rust-GC cleanup; serializes shared inputs and normalizes datetime literals to UTC.
createProxyColumnarReader and RegionError handling
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Implements per-reader payload construction, calls fn_get_columnar_reader with buffer/GC guards, and refines RegionError handling (epoch vs not-found) and lock resolution via shared context.
Slot-based materialization and runtime
dbms/src/Storages/StorageDisaggregatedColumnar.cpp, dbms/src/Storages/StorageDisaggregatedColumnar.h
Implements RNProxyReaderPlan/slots, RNProxyReadTask slot lifecycle, getOrCreateReader with backoff, prefetchReader, tryAcquireReaderIndex, and lazy RNProxyInputStream creation and acquisition.
Bucket-aware planning and integration
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
During buildProxyReadTask, fetches region bucket keys, splits physical ranges by bucket boundaries, expands into multiple RNProxyReaderPlan entries when enabled, and creates a shared RNProxyReadTask.
Source IO loop and throughput metrics
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Rewrites RNProxyInputStream::readImpl and RNProxySourceOp::executeIOImpl to use lazy readers, prefetch next readers, handle drained streams, and log rows/sec and bytes/sec.
Header/API refactors for plan-based readers
dbms/src/Storages/StorageDisaggregatedColumnar.h
Refactors RNProxyReadTask constructor and public API, updates RNProxyInputStream::Options and RNProxySourceOp::Options, and changes runtime state types to support the new model.

DeltaMerge Pool Conditional Initialization

Layer / File(s) Summary
Columnar Mode Detection in Server Startup
dbms/src/Server/Server.cpp
Guards DeltaMerge read-thread pool initialization so it runs only when disagg_opt.use_columnar is false.

Sequence Diagram(s)

sequenceDiagram
  participant StorageDisaggregated as StorageDisaggregatedColumnar
  participant RNTask as RNProxyReadTask
  participant HubFFI as Hub ffi (fn_get_columnar_reader / fn_get_region_bucket_keys)
  participant PD as PD (Bucket lookup)
  StorageDisaggregated->>RNTask: buildProxyReadTask(reader_plans)
  RNTask->>HubFFI: fn_get_region_bucket_keys(region_id, region_ver)
  HubFFI->>PD: request bucket keys (PD API)
  PD-->>HubFFI: bucket keys
  RNTask->>HubFFI: fn_get_columnar_reader(serialized_payload)
  HubFFI-->>RNTask: ColumnarReader handle / error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • CalvinNeo
  • JaySon-Huang
  • JinheLin

Poem

🐰 I nibbled through buckets, keys in tow,
I cached the snaps where soft winds blow,
Readers wake only when they're due,
Shared hearts wait — one fetch, then two,
Hooray — the columnar streams now flow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature: adding bucket-level parallel read support within a region, which aligns with the primary objective of removing region-level read concurrency limits.
Description check ✅ Passed The PR description covers the problem (region-level read concurrency limits), four key changes implemented, and notes manual testing was performed. However, the commit-message section is empty and lacks detailed explanation of how the implementation works.
Linked Issues check ✅ Passed The PR directly addresses issue #10844 by implementing bucket-level concurrent reads within regions, eliminating the region-level concurrency bottleneck. The changes support columnar storage consumption and enable read-path optimization as required.
Out of Scope Changes check ✅ Passed All changes are scoped to columnar storage read path optimization: FFI interfaces for bucket keys and snapshot access management, cloud storage caching, shared snapshot reuse, lazy reader creation/prefetching, and conditional DeltaMerge initialization. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp

dbms/src/Storages/StorageDisaggregatedColumnar.cpp:15:10: fatal error: 'Common/config.h' file not found
15 | #include <Common/config.h> // for ENABLE_NEXT_GEN_COLUMNAR
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ac3da5b319c6e4e7687f69efde5e6b2072ec2079-796714f7dd93d507/tmp/clang_command_.tmp.d3d896.txt
++Contents of '/tmp/coderabbit-infer/ac3da5b319c6e4e7687f69efde5e6b2072ec2079-796714f7dd93d507/tmp/clang_command_.tmp.d3d896.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86

... [truncated 1176 characters] ...

e"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/796714f7dd93d507/file.o" "-x" "c++"
"dbms/src/Storages/StorageDisaggregatedColumnar.cpp" "-O0" "-fno-builtin"
"-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 26, 2026
Signed-off-by: yongman <yming0221@gmail.com>
yongman added 8 commits June 2, 2026 22:27
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
@yongman yongman marked this pull request as ready for review June 9, 2026 03:48
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@pantheon-ai

pantheon-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

@yongman I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (2)

95-107: ⚡ Quick win

Log the original cleanup exception.

This broad catch (...) drops the exception details from clear_shared_snap_access_by_start_ts, so failures here become almost impossible to diagnose. tryLogCurrentException(log, "...") is the project-standard way to preserve that context in cleanup paths.

♻️ Suggested change
         catch (...)
         {
-            try
-            {
-                LOG_WARNING(log, "clear shared snapaccess cache failed, start_ts={}", start_ts);
-            }
-            catch (...)
-            {
-            }
+            tryLogCurrentException(log, fmt::format("clear shared snapaccess cache failed, start_ts={}", start_ts));
         }

As per coding guidelines, Use tryLogCurrentException(log, "context") in broad catch (...) paths to avoid duplicated exception-formatting code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 95 - 107,
The catch-all after calling clear_shared_snap_access_by_start_ts(start_ts,
proxy_ptr) drops the original exception; replace the inner LOG_WARNING/empty
catch with a call to tryLogCurrentException(log, "clear shared snapaccess cache
failed, start_ts={}", start_ts) (or at minimum tryLogCurrentException(log,
"clear shared snapaccess cache failed, start_ts=" + toString(start_ts))) so the
original exception context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.

Source: Coding guidelines


675-675: ⚡ Quick win

Use the fmt-style DB::Exception constructor here.

This new branch is still using the legacy (message, code) form while the surrounding paths already use the repository-standard Exception(ErrorCodes::..., "...") style.

♻️ Suggested change
-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");

As per coding guidelines, Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` at line 675, Replace the
legacy throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage
with the repository-standard fmt-style constructor: throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error"); update the throw
site in StorageDisaggregatedColumnar (the throw of Exception referencing
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the ErrorCodes-first signature
and follow the same fmt-style pattern as surrounding paths.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 974-981: The group is being recreated by insert/get_loader after
remove_by_start_ts tears it down, so make evictions sticky: add a
tombstone/generation field to SharedSnapAccessGroup (e.g., evicted: AtomicBool
or generation: AtomicU64) and update remove_by_start_ts to mark the group's
tombstone/generation before removing it from groups; then change group
creation/lookup logic in insert and get_loader to consult that
tombstone/generation (if evicted or generation mismatches, refuse to recreate or
insert into the group and return an error/ignore) so any in-flight loader
holding the group's lock cannot resurrect it after clear (also ensure
request_snapshot_from_leader paths check the group's generation/tombstone before
calling insert).

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 1036-1076: The block-stream path is currently allowed to create
one RNProxyInputStream per bucket unit because planned_reader_num (derived from
total_max_reader_num after splitRangesByBucketKeys) can exceed num_streams;
modify the logic so getInputStreams() does not instantiate more streams than
num_streams: compute an effective_reader_num = min(planned_reader_num,
num_streams) (or cap by num_streams where planned_reader_num is used) and change
distribution so region_reader_plans and their bucket_units are assigned/shared
among those effective_reader_num logical readers (i.e., allow a single
RNProxyInputStream to process multiple bucket units/region plans) instead of
creating a stream per bucket unit; ensure enable_bucket_parallel,
planned_reader_num, total_max_reader_num and splitRangesByBucketKeys usages stay
but the instantiation point that creates RNProxyInputStream honors the cap and
uses a shared/task-pool style assignment.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 95-107: The catch-all after calling
clear_shared_snap_access_by_start_ts(start_ts, proxy_ptr) drops the original
exception; replace the inner LOG_WARNING/empty catch with a call to
tryLogCurrentException(log, "clear shared snapaccess cache failed, start_ts={}",
start_ts) (or at minimum tryLogCurrentException(log, "clear shared snapaccess
cache failed, start_ts=" + toString(start_ts))) so the original exception
context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.
- Line 675: Replace the legacy throw Exception("lock error",
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage with the repository-standard
fmt-style constructor: throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR,
"lock error"); update the throw site in StorageDisaggregatedColumnar (the throw
of Exception referencing ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the
ErrorCodes-first signature and follow the same fmt-style pattern as surrounding
paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 44b8edea-875b-4b99-9507-46688bf65dd6

📥 Commits

Reviewing files that changed from the base of the PR and between 88bb9d7 and d0564d7.

📒 Files selected for processing (8)
  • contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h
  • contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
  • dbms/src/Server/Server.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h

Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Signed-off-by: yongman <yming0221@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)

471-485: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Handle the empty-plan path before reading the first stream header.

Line 1136 now allows buildProxyReadTask() to return no tasks, but the block-stream path still assumes at least one source exists and unconditionally calls pipeline.firstStream() on Line 480. An empty-range scan will therefore crash here instead of returning an empty result.

Also applies to: 1136-1142

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 471 - 485,
The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.
🧹 Nitpick comments (1)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)

668-680: ⚡ Quick win

Use the repo-standard DB::Exception constructor here.

These throws reverse the expected (error_code, format, ...) order. Please switch them to the fmt-style form so they do not depend on a legacy overload.

Suggested cleanup
-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");
...
-            throw Exception("read_block failed in tiflash-proxy", ErrorCodes::LOGICAL_ERROR);
+            throw Exception(ErrorCodes::LOGICAL_ERROR, "read_block failed in tiflash-proxy");

As per coding guidelines, **/*.cpp: Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);

Also applies to: 1250-1254

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 668 - 680,
Replace legacy Exception throws that pass message then code with the
repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 471-485: The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 668-680: Replace legacy Exception throws that pass message then
code with the repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3573fc15-b3ea-48c0-b058-892e56d64f1d

📥 Commits

Reviewing files that changed from the base of the PR and between d0564d7 and 7b19a1c.

📒 Files selected for processing (2)
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • dbms/src/Storages/StorageDisaggregatedColumnar.h

Signed-off-by: yongman <yming0221@gmail.com>
@yongman

yongman commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2026
yongman added 2 commits June 9, 2026 17:30
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDisaggregatedColumnar.h (1)

116-118: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore a shared-ownership-only construction path for RNProxyReadTask.

With the old factory gone, this public constructor now allows stack or non-shared_ptr allocation even though createInputStream() / createSharedInputStream() call shared_from_this() downstream in dbms/src/Storages/StorageDisaggregatedColumnar.cpp:1008-1035. That turns an ownership mistake into a runtime std::bad_weak_ptr.

Suggested direction
 class RNProxyReadTask
     : public boost::noncopyable
     , public std::enable_shared_from_this<RNProxyReadTask>
 {
 public:
+    static RNProxyReadTaskPtr create(
+        std::vector<RNProxyReaderPlan> readerPlans,
+        size_t sourceNum,
+        std::shared_ptr<RNProxyReaderSharedContext> sharedReaderContext);
+
+private:
     RNProxyReadTask(
-        std::vector<RNProxyReaderPlan> reader_plans,
-        size_t source_num,
-        std::shared_ptr<RNProxyReaderSharedContext> shared_reader_context);
+        std::vector<RNProxyReaderPlan> readerPlans,
+        size_t sourceNum,
+        std::shared_ptr<RNProxyReaderSharedContext> sharedReaderContext);

As per coding guidelines, **/*.{cpp,h,hpp}: Prefer std::shared_ptr and std::unique_ptr for smart pointers, using std::make_shared and std::make_unique.

Also applies to: 144-147

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.h` around lines 116 - 118, The
RNProxyReadTask should be constructible only as a shared_ptr to avoid
bad_weak_ptr from shared_from_this; change RNProxyReadTask's public constructors
to be non-public (protected or private) and add a static factory like
RNProxyReadTask::createShared(...) that returns std::shared_ptr<RNProxyReadTask>
(use std::make_shared) so callers (including createInputStream and
createSharedInputStream) must obtain instances via shared_ptr; ensure any
existing public overloads referenced around RNProxyReadTask are removed or
delegated to the factory so shared_from_this() is always valid.

Source: Coding guidelines

🧹 Nitpick comments (2)
dbms/src/Storages/StorageDisaggregatedColumnar.h (2)

67-85: 🏗️ Heavy lift

Please add automated coverage for the one-shot reader-slot lifecycle.

This header introduces a concurrency-sensitive state machine (NotStarted / Creating / Ready / Failed / Consumed) plus shared vs fixed reader acquisition, but the PR summary says validation was manual only. I’d want focused coverage for failure wakeups, double-consume rejection, and prefetch-vs-inline materialization before this leaves hold.

Also applies to: 114-126, 204-205

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.h` around lines 67 - 85, Add
automated unit tests that exercise the RNProxyReaderSlot lifecycle and its
RNProxyReaderMaterializeState transitions: write tests that (1) simulate a
materialization failure by setting slot.exception and state=Failed and verify
waiting threads are woken via the slot.cv and that the exception is propagated
to callers; (2) validate double-consume rejection by acquiring slot.reader once
(state==Ready -> consume) then attempting a second acquire and asserting it
fails or is rejected and state becomes Consumed; and (3) cover
prefetch-vs-inline materialization by driving both code paths that populate
slot.reader directly (inline) and that prefetch/async-create (Creating -> Ready)
and assert correct final states, proper reader ownership in slot.reader, and
that the destructor (~RNProxyReaderSlot) does not leak or leave threads blocked.
Use RNProxyReaderSlot, RNProxyReaderMaterializeState, reader, exception, mutex,
and cv symbols to locate and exercise the logic.

59-65: 🏗️ Heavy lift

Normalize the new header API to TiFlash naming/types now.

The added public contract exposes snake_case identifiers and platform-width size_t indices. Since this header is new surface area for the columnar read path, it is much cheaper to align it with the project conventions before more callers land. As per coding guidelines, **/*.{cpp,h,hpp}: Use explicit width types from dbms/src/Core/Types.h: UInt8, UInt32, Int64, Float64, String and Method and variable names should use camelCase.

Also applies to: 118-156, 176-186, 210-215, 269-273

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.h` around lines 59 - 65, The
public struct and related APIs must follow project naming and explicit-width
type conventions: rename RNProxyReaderPlan member identifiers from snake_case to
camelCase (e.g., region_id -> regionId, region_ver -> regionVersion,
region_conf_ver -> regionConfVer, physical_table_ranges -> physicalTableRanges)
and replace any platform-width/index types (e.g., size_t or plain int) with the
explicit types from dbms/src/Core/Types.h (UInt8/UInt32/Int64/UInt64/etc.) as
appropriate; update the RNProxyReaderPlan definition and all other affected
structs/functions referenced in the comment (the other blocks at 118-156,
176-186, 210-215, 269-273) to use the same camelCase names and fixed-width types
so public API matches TiFlash conventions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.h`:
- Around line 116-118: The RNProxyReadTask should be constructible only as a
shared_ptr to avoid bad_weak_ptr from shared_from_this; change RNProxyReadTask's
public constructors to be non-public (protected or private) and add a static
factory like RNProxyReadTask::createShared(...) that returns
std::shared_ptr<RNProxyReadTask> (use std::make_shared) so callers (including
createInputStream and createSharedInputStream) must obtain instances via
shared_ptr; ensure any existing public overloads referenced around
RNProxyReadTask are removed or delegated to the factory so shared_from_this() is
always valid.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.h`:
- Around line 67-85: Add automated unit tests that exercise the
RNProxyReaderSlot lifecycle and its RNProxyReaderMaterializeState transitions:
write tests that (1) simulate a materialization failure by setting
slot.exception and state=Failed and verify waiting threads are woken via the
slot.cv and that the exception is propagated to callers; (2) validate
double-consume rejection by acquiring slot.reader once (state==Ready -> consume)
then attempting a second acquire and asserting it fails or is rejected and state
becomes Consumed; and (3) cover prefetch-vs-inline materialization by driving
both code paths that populate slot.reader directly (inline) and that
prefetch/async-create (Creating -> Ready) and assert correct final states,
proper reader ownership in slot.reader, and that the destructor
(~RNProxyReaderSlot) does not leak or leave threads blocked. Use
RNProxyReaderSlot, RNProxyReaderMaterializeState, reader, exception, mutex, and
cv symbols to locate and exercise the logic.
- Around line 59-65: The public struct and related APIs must follow project
naming and explicit-width type conventions: rename RNProxyReaderPlan member
identifiers from snake_case to camelCase (e.g., region_id -> regionId,
region_ver -> regionVersion, region_conf_ver -> regionConfVer,
physical_table_ranges -> physicalTableRanges) and replace any
platform-width/index types (e.g., size_t or plain int) with the explicit types
from dbms/src/Core/Types.h (UInt8/UInt32/Int64/UInt64/etc.) as appropriate;
update the RNProxyReaderPlan definition and all other affected structs/functions
referenced in the comment (the other blocks at 118-156, 176-186, 210-215,
269-273) to use the same camelCase names and fixed-width types so public API
matches TiFlash conventions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0fbcd96b-8f2f-40a1-922d-baf5d17bad4d

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff0938 and ac3da5b.

📒 Files selected for processing (3)
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp

@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@yongman: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-sanitizer-tsan ac3da5b link false /test pull-sanitizer-tsan

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yongman

yongman commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/unhold

@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2026
@JaySon-Huang

Copy link
Copy Markdown
Contributor

/test pull-integration-next-gen-columnar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support new columnar storage as data source

2 participants