[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines by harelhuang · Pull Request #2342 · tile-ai/tilelang

harelhuang · 2026-06-05T12:05:55Z

Summary

Fix LayoutInference crash when two sibling T.Pipelined loops share the same alloc_shared buffer but use different num_stages
Extends and replaces [Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines #2337 with a correct fix that works on CUDA backend
Fixes [BUG][Fuzzer] LayoutInference crash for sibling T.Pipelined loops sharing alloc_shared buffers with different num_stages #2309

Root Cause

When two T.Pipelined loops share A_shared/B_shared buffers with different num_stages (e.g. 2 and 4):

First pipeline expands A_shared from (64,32) to (2,64,32) via RewriteAllocBuffer
Second pipeline's CollectUsedPipelineBuffers picks up the already-expanded (2,64,32) buffer
RewriteAllocBuffer inserts another dimension, producing (4,2,64,32) — a 4D buffer
LayoutInference crashes: ICHECK fails comparing (2,64,32) input shape with (4,64,32) expected shape

Fix (3 parts)

Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars (not by reverse Buffer object lookup, which fails on CUDA path)
Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion for the second pipeline
Reconcile old/new Buffer objects across the entire block body via a BufferReplacer (handles BufferLoad, BufferStore, DeclBuffer, AllocBuffer) so LayoutInference sees a single consistent buffer object per Var

Test plan

Reproducer from [BUG][Fuzzer] LayoutInference crash for sibling T.Pipelined loops sharing alloc_shared buffers with different num_stages #2309 compiles and runs correctly on CUDA (NVIDIA L20)
No regression on standard single-pipeline GEMM
make -j8 builds cleanly

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed buffer expansion inconsistencies in pipelined loops so buffers retain correct dimensionality when sibling pipelines interact.
- Ensured rewritten pipeline bodies consistently reference the updated buffer allocations, preventing mismatched buffer objects after transformation.
- Reconciled chained sibling pipeline expansions so successive pipeline transformations produce consistent buffer references across a block.
Tests
- Added a regression test that validates sibling pipelined loops with differing stage counts.

github-actions · 2026-06-05T12:06:06Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-06-05T12:10:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3eb0aa7-4a30-4ab3-83dd-6aaa57ab8600

📥 Commits

Reviewing files that changed from the base of the PR and between 7996c72 and 0a18521.

📒 Files selected for processing (1)

testing/python/issue/test_tilelang_issue_2309.py

📝 Walkthrough

Walkthrough

Detects and reconcile sibling-expanded shared buffers during pipeline injection: introduces a BufferReplacer, records original→old-expanded mappings when encountering previously expanded sibling buffers, rewrites pipelines to use consistent expanded buffers, and applies replacements at block scope.

Changes

Buffer reconciliation for sibling pipelined loops

Layer / File(s)	Summary
State extension and BufferReplacer utility `src/transform/inject_pipeline.cc`	`BufferReplacer` is introduced as a `StmtExprMutator` to rewrite `BufferLoad`, `BufferStore`, `DeclBuffer`, and `AllocBuffer` nodes and update `SBlockNode` alloc/reads/writes. PipelineInjector gains maps for tracking original→old-expanded and old-expanded→new-expanded buffers.
Pipeline buffer detection `src/transform/inject_pipeline.cc`	During For-node pipeline collection, buffers already expanded by sibling pipelined loops are detected via `pending_buffer_remap_`, recorded as original-2D → old-expanded, and `pipeline_allocs` entries are replaced with original 2D buffers before rewrite.
Post-RewritePipeline reconciliation `src/transform/inject_pipeline.cc`	After `RewritePipeline`, the pass builds an old-expanded → new-expanded mapping from `rewrite_result.buffer_remap` and applies `BufferReplacer` to the rewritten pipeline statement so references point to newly expanded buffers.
Block-level buffer reconciliation `src/transform/inject_pipeline.cc`	When visiting `SBlockNode`, chained replacements from `old_expanded_to_new_` are composed for multi-sibling cases, `BufferReplacer` is applied to the block body, and `children_modified` is set to true when replacements occur.
Sibling-pipeline regression test `testing/python/issue/test_tilelang_issue_2309.py`	Adds `run_asymmetric_pipeline()` and `test_sibling_pipeline_different_num_stages()` that compile and run a kernel with two T.Pipelined loops sharing `alloc_shared` buffers with different `num_stages`, validating results against a PyTorch reference.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

oraluben
LeiWang1999

Poem

🐰
Buffers tangled, siblings raced through night,
I hop in softly to set each mapping right,
Loads, stores, allocs — I tuck them into place,
Old into new, every reference finds its space,
The IR hums calmly, pipelines sleep tight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main fix: preventing double expansion of shared buffers across sibling pipelines, which directly addresses the core issue.
Linked Issues check	✅ Passed	The PR implements all coding requirements from issue `#2309`: detects already-expanded buffers, replaces them with originals for fresh expansion, reconciles Buffer objects via BufferReplacer, and includes a regression test.
Out of Scope Changes check	✅ Passed	All changes directly address the sibling pipeline buffer expansion issue; no unrelated modifications detected in the implementation or test files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/transform/inject_pipeline.cc`:
- Around line 754-767: old_expanded_to_new_ currently accumulates pairwise
mappings and can create transitive chains (e.g., A->B then B->C) so a single
BufferReplacer call leaves intermediate buffers in the block; when building
mappings from old_expanded_buffers_ and rewrite_result.buffer_remap you must
collapse/compose chains to the final target before calling BufferReplacer: for
each mapping created from old_expanded to new_expanded, follow any existing
mapping in old_expanded_to_new_ (and any new mappings from
rewrite_result.buffer_remap) until you reach the final buffer and store
old_expanded -> final_new_expanded (or alternatively clear/rebuild
old_expanded_to_new_ at the start of the parent loop and only insert the latest
mappings); apply the same composition/clear logic at the other occurrence
(around the 1084-1088 region) so BufferReplacer receives only direct final
mappings.
- Around line 45-82: BufferReplacer currently only rewrites
BufferLoad/BufferStore/DeclBuffer/AllocBuffer but leaves SBlock metadata
(alloc_buffers, reads, writes) pointing to old Buffer objects, causing mixed
Buffer instances for the same data; update BufferReplacer to also handle
SBlock-like nodes by adding/overriding VisitStmt_ for the node type that carries
block metadata (the SBlock container used in this IR) and, when
replacements_.Get(old_buf) is true, replace entries in alloc_buffers, reads, and
writes with the new Buffer objects and reconstruct the SBlock node (similar to
the approach in flatten_buffer.cc) so nested blocks are fully reconciled across
the subtree.
- Around line 51-63: When replacing BufferLoad/BufferStore in VisitExpr_(const
BufferLoadNode *op) and VisitStmt_(const BufferStoreNode *op) via
replacements_.Get, preserve and recursively mutate child nodes instead of
rebuilding from op with empty predicate/indices; specifically, for
BufferLoadNode use new_buf.value() but pass visited indices (visit each
op->indices via VisitExpr) and preserve/visit op->predicate (use VisitExpr on
predicate if present) and for BufferStoreNode pass visited indices (visit each
index) as well as the already-visited value; keep the original span. If not
replacing, continue to call the StmtExprMutator fallbacks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6d02361-323f-4e2e-bb9a-0683d5637127

📥 Commits

Reviewing files that changed from the base of the PR and between 750e0ca and 5e9d81a.

📒 Files selected for processing (1)

src/transform/inject_pipeline.cc

…CUDA) When two T.Pipelined loops share the same alloc_shared buffer but use different num_stages (e.g. 2 and 4), the first pipeline expands the buffer from (M,K) to (2,M,K). The second pipeline then tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference with a shape mismatch ICHECK. This fix extends the original PR tile-ai#2337 to work on CUDA backend: 1. Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars (not by reverse Buffer object lookup, which fails on CUDA path). 2. Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion for the second pipeline. 3. After RewritePipeline, reconcile old and new expanded Buffer objects across the entire block body via a BufferReplacer (handles BufferLoad, BufferStore, DeclBuffer, AllocBuffer), so LayoutInference sees a single consistent buffer object per Var. Tested on CUDA (NVIDIA L20) with asymmetric num_stages (2 and 4), and on the reproducer from issue tile-ai#2309. No regression on standard single-pipeline GEMM. Fixes tile-ai#2309. Closes tile-ai#2337.

LeiWang1999 · 2026-06-05T15:05:07Z

@harelhuang Thanks, would you mind providing a related test or a script to help us reproduce and review?

harelhuang · 2026-06-08T06:53:15Z

Sure! Here's a minimal reproducer that crashes without the fix on CUDA:

import tilelang
import tilelang.language as T

M, N, K = 512, 512, 512
BM, BN, BK = 64, 64, 32

@tilelang.jit
def asymmetric_stages(A, B):
    A: T.Tensor((M, K), T.float16)
    B: T.Tensor((K, N), T.float16)
    C = T.empty((M, N), T.float16)
    K_half = K // 2
    with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
        A_shared = T.alloc_shared((BM, BK), T.float16)
        B_shared = T.alloc_shared((BK, BN), T.float16)
        C_local = T.alloc_fragment((BM, BN), T.float32)
        T.clear(C_local)
        for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=2):
            T.copy(A[by * BM, ko * BK], A_shared)
            T.copy(B[ko * BK, bx * BN], B_shared)
            T.gemm(A_shared, B_shared, C_local)
        for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=4):
            T.copy(A[by * BM, K_half + ko * BK], A_shared)
            T.copy(B[K_half + ko * BK, bx * BN], B_shared)
            T.gemm(A_shared, B_shared, C_local)
        T.copy(C_local, C[by * BM, bx * BN])
    return C

kernel = asymmetric_stages.compile(M=M, N=N, K=K)

Without the fix, this crashes with:

tvm.error.InternalError: Check failed: (az->CanProveEqual(...)) is false:
InputShape() = (2, 64, 32) shape = (4, 64, 32), rescale_num = 16, rescale_den = 16

The two T.Pipelined loops share A_shared and B_shared but use different num_stages (2 vs 4). The first pipeline expands the buffer from (64,32) to (2,64,32), and the second pipeline tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference.

Tested on CUDA (NVIDIA L20) — compiles and runs correctly with the fix.

…ling pipelines (CUDA) When two T.Pipelined loops share the same alloc_shared buffer but use different num_stages (e.g. 2 and 4), the first pipeline expands the buffer from (M,K) to (2,M,K). The second pipeline then tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference with a shape mismatch ICHECK. Fix: 1. Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars. 2. Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion. 3. Reconcile old/new Buffer objects across the entire block body via BufferReplacer (handles BufferLoad/BufferStore/DeclBuffer/AllocBuffer and SBlock metadata). 4. Compose replacement chains to handle 3+ sibling pipelines sharing the same buffer. Fixes tile-ai#2309. Closes tile-ai#2337. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ile-ai#2309)

harelhuang · 2026-06-08T11:37:07Z

Hi @LeiWang1999, could you re-run the failed CI job? The only failure is test_reduce[sum-bfloat16-128x128-f2f-t32-b1] — a pre-existing flaky test unrelated to this fix.

Our regression test test_tilelang_issue_2309.py::test_sibling_pipeline_different_num_stages passed on both CUDA and Metal:

CUDA: PASSED (7.51s)
Metal: PASSED (4m28s)

All other checks (Quick Lint, pre-commit.ci, CodeRabbit) are green.

harelhuang · 2026-06-09T07:54:06Z

@LeiWang1999 Confirmed this is a pre-existing flaky test — the same test_reduce[sum-bfloat16-128x128-f2f-t32-b1] failure also appears on chore/bump-version-0.1.11 (run 27114895653) which doesn't touch pipeline code at all.

Our change only modifies src/transform/inject_pipeline.cc and adds testing/python/issue/test_tilelang_issue_2309.py. The reduce test is unrelated.

Could you re-run the failed CUDA job?

harelhuang · 2026-06-09T08:01:50Z

Correction: the bump-version PR is #2354 (not the branch name). It had the same test_reduce[sum-bfloat16-128x128-f2f-t32-b1] failure and was still merged, confirming this is a known non-blocking flaky test.

harelhuang force-pushed the fix/sibling-pipeline-shared-buffer-cuda branch from 2fa543f to 5e9d81a Compare June 5, 2026 12:08

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/transform/inject_pipeline.cc

Comment thread src/transform/inject_pipeline.cc Outdated

Comment thread src/transform/inject_pipeline.cc

harelhuang force-pushed the fix/sibling-pipeline-shared-buffer-cuda branch from 5e9d81a to a27d91c Compare June 5, 2026 12:20

harelhuang and others added 2 commits June 8, 2026 15:21

[Test] Add regression test for sibling pipeline shared buffer (issue t…

0a18521

…ile-ai#2309)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342

[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342
harelhuang wants to merge 3 commits into
tile-ai:mainfrom
harelhuang:fix/sibling-pipeline-shared-buffer-cuda

harelhuang commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeiWang1999 commented Jun 5, 2026

Uh oh!

harelhuang commented Jun 8, 2026

Uh oh!

harelhuang commented Jun 8, 2026

Uh oh!

harelhuang commented Jun 9, 2026

Uh oh!

harelhuang commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harelhuang commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix (3 parts)

Test plan

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeiWang1999 commented Jun 5, 2026

Uh oh!

harelhuang commented Jun 8, 2026

Uh oh!

harelhuang commented Jun 8, 2026

Uh oh!

harelhuang commented Jun 9, 2026

Uh oh!

harelhuang commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harelhuang commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading