Skip to content

[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342

Open
harelhuang wants to merge 3 commits into
tile-ai:mainfrom
harelhuang:fix/sibling-pipeline-shared-buffer-cuda
Open

[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342
harelhuang wants to merge 3 commits into
tile-ai:mainfrom
harelhuang:fix/sibling-pipeline-shared-buffer-cuda

Conversation

@harelhuang

@harelhuang harelhuang commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Root Cause

When two T.Pipelined loops share A_shared/B_shared buffers with different num_stages (e.g. 2 and 4):

  1. First pipeline expands A_shared from (64,32) to (2,64,32) via RewriteAllocBuffer
  2. Second pipeline's CollectUsedPipelineBuffers picks up the already-expanded (2,64,32) buffer
  3. RewriteAllocBuffer inserts another dimension, producing (4,2,64,32) — a 4D buffer
  4. LayoutInference crashes: ICHECK fails comparing (2,64,32) input shape with (4,64,32) expected shape

Fix (3 parts)

  1. Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars (not by reverse Buffer object lookup, which fails on CUDA path)
  2. Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion for the second pipeline
  3. Reconcile old/new Buffer objects across the entire block body via a BufferReplacer (handles BufferLoad, BufferStore, DeclBuffer, AllocBuffer) so LayoutInference sees a single consistent buffer object per Var

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed buffer expansion inconsistencies in pipelined loops so buffers retain correct dimensionality when sibling pipelines interact.
    • Ensured rewritten pipeline bodies consistently reference the updated buffer allocations, preventing mismatched buffer objects after transformation.
    • Reconciled chained sibling pipeline expansions so successive pipeline transformations produce consistent buffer references across a block.
  • Tests

    • Added a regression test that validates sibling pipelined loops with differing stage counts.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@harelhuang harelhuang force-pushed the fix/sibling-pipeline-shared-buffer-cuda branch from 2fa543f to 5e9d81a Compare June 5, 2026 12:08
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3eb0aa7-4a30-4ab3-83dd-6aaa57ab8600

📥 Commits

Reviewing files that changed from the base of the PR and between 7996c72 and 0a18521.

📒 Files selected for processing (1)
  • testing/python/issue/test_tilelang_issue_2309.py

📝 Walkthrough

Walkthrough

Detects and reconcile sibling-expanded shared buffers during pipeline injection: introduces a BufferReplacer, records original→old-expanded mappings when encountering previously expanded sibling buffers, rewrites pipelines to use consistent expanded buffers, and applies replacements at block scope.

Changes

Buffer reconciliation for sibling pipelined loops

Layer / File(s) Summary
State extension and BufferReplacer utility
src/transform/inject_pipeline.cc
BufferReplacer is introduced as a StmtExprMutator to rewrite BufferLoad, BufferStore, DeclBuffer, and AllocBuffer nodes and update SBlockNode alloc/reads/writes. PipelineInjector gains maps for tracking original→old-expanded and old-expanded→new-expanded buffers.
Pipeline buffer detection
src/transform/inject_pipeline.cc
During For-node pipeline collection, buffers already expanded by sibling pipelined loops are detected via pending_buffer_remap_, recorded as original-2D → old-expanded, and pipeline_allocs entries are replaced with original 2D buffers before rewrite.
Post-RewritePipeline reconciliation
src/transform/inject_pipeline.cc
After RewritePipeline, the pass builds an old-expanded → new-expanded mapping from rewrite_result.buffer_remap and applies BufferReplacer to the rewritten pipeline statement so references point to newly expanded buffers.
Block-level buffer reconciliation
src/transform/inject_pipeline.cc
When visiting SBlockNode, chained replacements from old_expanded_to_new_ are composed for multi-sibling cases, BufferReplacer is applied to the block body, and children_modified is set to true when replacements occur.
Sibling-pipeline regression test
testing/python/issue/test_tilelang_issue_2309.py
Adds run_asymmetric_pipeline() and test_sibling_pipeline_different_num_stages() that compile and run a kernel with two T.Pipelined loops sharing alloc_shared buffers with different num_stages, validating results against a PyTorch reference.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • oraluben
  • LeiWang1999

Poem

🐰
Buffers tangled, siblings raced through night,
I hop in softly to set each mapping right,
Loads, stores, allocs — I tuck them into place,
Old into new, every reference finds its space,
The IR hums calmly, pipelines sleep tight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: preventing double expansion of shared buffers across sibling pipelines, which directly addresses the core issue.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #2309: detects already-expanded buffers, replaces them with originals for fresh expansion, reconciles Buffer objects via BufferReplacer, and includes a regression test.
Out of Scope Changes check ✅ Passed All changes directly address the sibling pipeline buffer expansion issue; no unrelated modifications detected in the implementation or test files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/transform/inject_pipeline.cc`:
- Around line 754-767: old_expanded_to_new_ currently accumulates pairwise
mappings and can create transitive chains (e.g., A->B then B->C) so a single
BufferReplacer call leaves intermediate buffers in the block; when building
mappings from old_expanded_buffers_ and rewrite_result.buffer_remap you must
collapse/compose chains to the final target before calling BufferReplacer: for
each mapping created from old_expanded to new_expanded, follow any existing
mapping in old_expanded_to_new_ (and any new mappings from
rewrite_result.buffer_remap) until you reach the final buffer and store
old_expanded -> final_new_expanded (or alternatively clear/rebuild
old_expanded_to_new_ at the start of the parent loop and only insert the latest
mappings); apply the same composition/clear logic at the other occurrence
(around the 1084-1088 region) so BufferReplacer receives only direct final
mappings.
- Around line 45-82: BufferReplacer currently only rewrites
BufferLoad/BufferStore/DeclBuffer/AllocBuffer but leaves SBlock metadata
(alloc_buffers, reads, writes) pointing to old Buffer objects, causing mixed
Buffer instances for the same data; update BufferReplacer to also handle
SBlock-like nodes by adding/overriding VisitStmt_ for the node type that carries
block metadata (the SBlock container used in this IR) and, when
replacements_.Get(old_buf) is true, replace entries in alloc_buffers, reads, and
writes with the new Buffer objects and reconstruct the SBlock node (similar to
the approach in flatten_buffer.cc) so nested blocks are fully reconciled across
the subtree.
- Around line 51-63: When replacing BufferLoad/BufferStore in VisitExpr_(const
BufferLoadNode *op) and VisitStmt_(const BufferStoreNode *op) via
replacements_.Get, preserve and recursively mutate child nodes instead of
rebuilding from op with empty predicate/indices; specifically, for
BufferLoadNode use new_buf.value() but pass visited indices (visit each
op->indices via VisitExpr) and preserve/visit op->predicate (use VisitExpr on
predicate if present) and for BufferStoreNode pass visited indices (visit each
index) as well as the already-visited value; keep the original span. If not
replacing, continue to call the StmtExprMutator fallbacks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6d02361-323f-4e2e-bb9a-0683d5637127

📥 Commits

Reviewing files that changed from the base of the PR and between 750e0ca and 5e9d81a.

📒 Files selected for processing (1)
  • src/transform/inject_pipeline.cc

Comment thread src/transform/inject_pipeline.cc
Comment thread src/transform/inject_pipeline.cc Outdated
Comment thread src/transform/inject_pipeline.cc
…CUDA)

When two T.Pipelined loops share the same alloc_shared buffer but use
different num_stages (e.g. 2 and 4), the first pipeline expands the
buffer from (M,K) to (2,M,K). The second pipeline then tries to expand
the already-3D buffer again, producing a 4D buffer that crashes
LayoutInference with a shape mismatch ICHECK.

This fix extends the original PR tile-ai#2337 to work on CUDA backend:
1. Detect already-expanded buffers in pipeline_allocs by matching against
   pending_buffer_remap_ entries via their data Vars (not by reverse
   Buffer object lookup, which fails on CUDA path).
2. Replace already-expanded 3D buffers with their original 2D form so
   RewritePipeline creates a fresh expansion for the second pipeline.
3. After RewritePipeline, reconcile old and new expanded Buffer objects
   across the entire block body via a BufferReplacer (handles BufferLoad,
   BufferStore, DeclBuffer, AllocBuffer), so LayoutInference sees a
   single consistent buffer object per Var.

Tested on CUDA (NVIDIA L20) with asymmetric num_stages (2 and 4),
and on the reproducer from issue tile-ai#2309. No regression on standard
single-pipeline GEMM.

Fixes tile-ai#2309.
Closes tile-ai#2337.
@harelhuang harelhuang force-pushed the fix/sibling-pipeline-shared-buffer-cuda branch from 5e9d81a to a27d91c Compare June 5, 2026 12:20
@LeiWang1999

Copy link
Copy Markdown
Member

@harelhuang Thanks, would you mind providing a related test or a script to help us reproduce and review?

@harelhuang

Copy link
Copy Markdown
Contributor Author

Sure! Here's a minimal reproducer that crashes without the fix on CUDA:

import tilelang
import tilelang.language as T

M, N, K = 512, 512, 512
BM, BN, BK = 64, 64, 32

@tilelang.jit
def asymmetric_stages(A, B):
    A: T.Tensor((M, K), T.float16)
    B: T.Tensor((K, N), T.float16)
    C = T.empty((M, N), T.float16)
    K_half = K // 2
    with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
        A_shared = T.alloc_shared((BM, BK), T.float16)
        B_shared = T.alloc_shared((BK, BN), T.float16)
        C_local = T.alloc_fragment((BM, BN), T.float32)
        T.clear(C_local)
        for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=2):
            T.copy(A[by * BM, ko * BK], A_shared)
            T.copy(B[ko * BK, bx * BN], B_shared)
            T.gemm(A_shared, B_shared, C_local)
        for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=4):
            T.copy(A[by * BM, K_half + ko * BK], A_shared)
            T.copy(B[K_half + ko * BK, bx * BN], B_shared)
            T.gemm(A_shared, B_shared, C_local)
        T.copy(C_local, C[by * BM, bx * BN])
    return C

kernel = asymmetric_stages.compile(M=M, N=N, K=K)

Without the fix, this crashes with:

tvm.error.InternalError: Check failed: (az->CanProveEqual(...)) is false:
InputShape() = (2, 64, 32) shape = (4, 64, 32), rescale_num = 16, rescale_den = 16

The two T.Pipelined loops share A_shared and B_shared but use different num_stages (2 vs 4). The first pipeline expands the buffer from (64,32) to (2,64,32), and the second pipeline tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference.

Tested on CUDA (NVIDIA L20) — compiles and runs correctly with the fix.

harelhuang and others added 2 commits June 8, 2026 15:21
…ling pipelines (CUDA)

When two T.Pipelined loops share the same alloc_shared buffer but use
different num_stages (e.g. 2 and 4), the first pipeline expands the
buffer from (M,K) to (2,M,K). The second pipeline then tries to expand
the already-3D buffer again, producing a 4D buffer that crashes
LayoutInference with a shape mismatch ICHECK.

Fix:
1. Detect already-expanded buffers in pipeline_allocs by matching against
   pending_buffer_remap_ entries via their data Vars.
2. Replace already-expanded 3D buffers with their original 2D form so
   RewritePipeline creates a fresh expansion.
3. Reconcile old/new Buffer objects across the entire block body via
   BufferReplacer (handles BufferLoad/BufferStore/DeclBuffer/AllocBuffer
   and SBlock metadata).
4. Compose replacement chains to handle 3+ sibling pipelines sharing
   the same buffer.

Fixes tile-ai#2309.
Closes tile-ai#2337.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@harelhuang

Copy link
Copy Markdown
Contributor Author

Hi @LeiWang1999, could you re-run the failed CI job? The only failure is test_reduce[sum-bfloat16-128x128-f2f-t32-b1] — a pre-existing flaky test unrelated to this fix.

Our regression test test_tilelang_issue_2309.py::test_sibling_pipeline_different_num_stages passed on both CUDA and Metal:

  • CUDA: PASSED (7.51s)
  • Metal: PASSED (4m28s)

All other checks (Quick Lint, pre-commit.ci, CodeRabbit) are green.

@harelhuang

Copy link
Copy Markdown
Contributor Author

@LeiWang1999 Confirmed this is a pre-existing flaky test — the same test_reduce[sum-bfloat16-128x128-f2f-t32-b1] failure also appears on chore/bump-version-0.1.11 (run 27114895653) which doesn't touch pipeline code at all.

Our change only modifies src/transform/inject_pipeline.cc and adds testing/python/issue/test_tilelang_issue_2309.py. The reduce test is unrelated.

Could you re-run the failed CUDA job?

@harelhuang

Copy link
Copy Markdown
Contributor Author

Correction: the bump-version PR is #2354 (not the branch name). It had the same test_reduce[sum-bfloat16-128x128-f2f-t32-b1] failure and was still merged, confirming this is a known non-blocking flaky test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][Fuzzer] LayoutInference crash for sibling T.Pipelined loops sharing alloc_shared buffers with different num_stages

2 participants