[Metal] M5 Cooperative Tensor T.gemm by oraluben · Pull Request #2252 · tile-ai/tilelang

oraluben · 2026-05-23T11:38:55Z

Summary

This PR extends the existing Metal backend with a Metal 4 cooperative tensor path for T.gemm.

The Metal backend already supported simdgroup-based GEMM lowering. This PR adds cooperative tensor as a new fast path on supported Apple GPUs, while keeping simdgroup as the compatibility path for older devices and non-Metal-4 targets.

Motivation

Metal 4 cooperative tensor exposes Apple's tensor-core-like matrix compute path. On supported hardware, it provides a substantially faster GEMM implementation than the existing simdgroup path.

This PR adds that path to TileLang so Metal T.gemm can use the newer hardware capability while preserving the existing simdgroup implementation for compatibility.

Design Notes

Although cooperative tensor is conceptually the Metal-side counterpart of CUDA tensor core programming, the programming model is not a direct CUDA clone.

As a practical approximation for CUDA reviewers, Apple GPUs expose a less CUDA-like split between register and threadgroup storage; both are backed by a more hardware-managed on-chip memory system. Because of that, explicit threadgroup staging is not automatically a faster path than feeding cooperative tensor operands directly.

This PR therefore keeps CUDA-shaped shared staging as a compatibility path, but optimizes the direct cooperative-tensor path as the Metal fast path. T.gemm remains the frontend abstraction, and Metal-specific instruction choice stays inside the Metal backend.

What Changed

At a high level, this PR adds:

cooperative tensor lowering for Metal GEMM;
Metal codegen support for cooperative tensor source emission;
target capability guarding so cooperative tensor source is only generated when supported/requested;
preservation of the existing simdgroup path;
pass-level compatibility so Metal-specific storage scopes do not leak into generic TVM assumptions;
tests for both the new cooperative tensor path and the existing simdgroup path.

Detailed lowering rules and implementation notes are documented separately in the Metal compiler internals doc.

Impact on TileLang

The main TileLang-level impact is that Metal now has a dedicated high-performance GEMM path that reflects Metal's own matrix programming model.

In particular:

T.gemm remains the user-facing abstraction.
Existing simdgroup kernels continue to work.
Cooperative tensor support is gated by target/runtime capability.
CUDA-style shared staging remains supported for compatibility, but is not assumed to be the default Metal performance model.
Future Metal schedules, such as shared-staging bypass or MLX-style variants, can be added behind the same backend boundary without changing the frontend API.

Compatibility

This PR is intended to be backward-compatible for existing Metal users.

The existing simdgroup path is still present and tested. The new cooperative tensor path is only used when the target/runtime capability allows it, so building TileLang with a newer SDK should not force all Metal kernels to require Metal 4.

One important compatibility check is that this PR has been validated on GitHub Actions with macOS 26 and M1 hardware. That environment exposes the newer SDK at build time but does not support cooperative tensor in hardware, so passing there verifies that the backend correctly falls back to the simdgroup path on unsupported devices.

Testing

Test coverage includes:

Metal simdgroup fallback / old path;
Metal cooperative tensor codegen;
runtime cooperative tensor correctness where supported;
source-level checks to avoid pulling cooperative tensor dependencies into simdgroup-only kernels.

Validated locally with:

pip install .
python -m pytest testing/python/metal/ -q -x
python -m pre_commit run --all-files

GitHub Actions additionally validated the macOS 26 + M1 fallback case described above.

Summary by CodeRabbit

Release Notes

New Features
- Added cooperative tensor support for Metal GPU kernels, enabling more efficient GEMM operations with new language primitives (cooperative_tensor_fill, cooperative_tensor_load, cooperative_tensor_store, cooperative_tensor_multiply_accumulate).
- Added Metal4 capability detection to enhance GPU compatibility and feature availability.
- Extended GEMM benchmarking with cooperative tensor variants and optional MLX reference comparisons.
Documentation
- Added comprehensive Metal TileLang development guide covering backend lowering, execution model, and implementation details.
Chores
- Updated TVM dependency to latest version.

Expose TileLang-owned cooperative tensor builtins so Metal MPP lowering does not depend on extra TVM fork APIs.

Add a shape-aware MPP instruction choice for shared-output Metal GEMM while preserving simdgroup fallback for fragments and unsupported tiles.

Generate Metal 4 MPP matmul2d code for cooperative tensor intrinsics and keep source-only codegen separate from runtime compilation.

Split Metal GEMM lowering into simdgroup and cooperative tensor emitters so M5 tiles use MPP while fragment accumulators keep the existing path.

Keep generic allocation and storage rewrites away from opaque Metal cooperative tensor scopes to avoid invalid scope analysis.

Add runtime and source-only coverage for non-square MPP GEMM so the new cooperative tensor path is reproducible in CI and on M5.

Point the submodule at the macOS SDK guarded Metal 4 runtime update used by cooperative tensor shaders.

Add a reference page covering the two Metal GEMM paths, selection rules, current limitations, and planned follow-up work.

coderabbitai · 2026-05-23T11:39:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4077ed8a-18bd-49c2-9683-5daa94124474

📥 Commits

Reviewing files that changed from the base of the PR and between 6bdd892 and daa8342.

📒 Files selected for processing (13)

docs/index.md
src/metal/op/copy.cc
src/metal/op/fill.cc
src/metal/op/utils.h
src/op/builtin.cc
src/op/gemm.cc
src/op/gemm.h
src/transform/storage_rewrite.cc
tilelang/metal/__init__.py
tilelang/metal/op/gemm/gemm_metal.py
tilelang/metal/utils.py
tilelang/transform/decouple_type_cast.py
tilelang/utils/language.py

💤 Files with no reviewable changes (2)

tilelang/utils/language.py
src/metal/op/utils.h

✅ Files skipped from review due to trivial changes (3)

docs/index.md
tilelang/metal/utils.py
tilelang/metal/init.py

🚧 Files skipped from review as they are similar to previous changes (6)

src/op/gemm.h
src/op/gemm.cc
src/op/builtin.cc
src/metal/op/copy.cc
src/metal/op/fill.cc
tilelang/metal/op/gemm/gemm_metal.py

📝 Walkthrough

Walkthrough

This PR adds Metal 4 cooperative tensor GEMM support to TileLang. It introduces four new builtin intrinsics (cooperative_tensor_fill/load/store/multiply_accumulate), Metal4 target detection, Python frontend wrappers, a dual-path MPSIntrinEmitter, split GemmMetal/GemmMetalSimdGroup classes, C++ Metal op lowering (gemm/copy/fill), transform pass exemptions, Metal codegen with MPP lowering and MLX swizzle, tests, benchmarks, and documentation.

Changes

Metal 4 Cooperative Tensor GEMM

Layer / File(s)	Summary
Cooperative tensor builtin ops and Gemm offset contract `src/op/builtin.h`, `src/op/builtin.cc`, `src/op/gemm.h`, `src/op/gemm.cc`	Declares and registers `cooperative_tensor_fill/load/store/multiply_accumulate` TIR intrinsics; changes `GemmNode::offsetA_`/`offsetB_` from `int` to `PrimExpr`.
Metal4 target detection and normalization `tilelang/metal/target.py`, `src/metal/target_utils.cc`, `src/metal/target_utils.h`, `3rdparty/tvm`	Adds `check_metal4_availability` (SDK + GPU model check), `normalize_metal_target`, `target_metal_supports_metal4`; registers `tl.TargetMetalSupportsMetal4` FFI; bumps TVM submodule.
Python frontend: builtins, annotations, gemm op, layout map, buffer utils `tilelang/language/builtin.py`, `tilelang/language/annotations.py`, `tilelang/language/gemm_op.py`, `tilelang/cuda/intrinsics/layout/mma_layout.py`, `tilelang/metal/utils.py`, `tilelang/metal/__init__.py`, `tilelang/utils/language.py`, `tilelang/transform/decouple_type_cast.py`	Adds Python wrappers for cooperative tensor intrinsics; adds `mlx` swizzle order with `ValueError`; removes GEMM offset validation; adds `metal_ct_store_index_map`; adds `is_metal_cooperative_tensor`/`is_metal_simdgroup` helpers; removes old `is_metal_simdgroup` from `utils/language.py`.
MPSIntrinEmitter cooperative tensor mode `tilelang/metal/intrinsics/metal_macro_generator.py`	Extends `MPSIntrinEmitter` with `use_cooperative_tensor` flag, operand constants, 16x32x16 micro-tile sizing, stride overrides, and routes `ldmatrix_a/b`, `mma`, and `simdgroup_copy` through cooperative tensor intrinsics when enabled.
GemmMetal / GemmMetalSimdGroup Python lowering `tilelang/metal/op/gemm/gemm_metal.py`, `tilelang/metal/op/gemm/__init__.py`, `tilelang/metal/transform/metal_fragment_to_simdgroup.py`	Splits Metal GEMM into `GemmMetalSimdGroup` (legacy) and `GemmMetal` (cooperative tensor GG/SS); adds padded layout, warp-partition selection, C writeback; updates `MetalFragmentToSimdgroup` with `num_warps` inference; registers both instruction kinds.
Metal op implementations: gemm, copy, fill, utils `src/metal/op/gemm.cc`, `src/metal/op/copy.cc`, `src/metal/op/fill.cc`, `src/metal/op/utils.h`	Adds `CanUseCooperativeTensor` and 16x32 warp-partition sizing; adds `LowerCooperativeTensorCopy` with aspect-ratio warp-tiling heuristic; adds cooperative tensor fill lowering; replaces `IsRegisterBuffer` with `IsCooperativeTensorBuffer`.
TVM transform pass exemptions `src/transform/storage_rewrite.cc`, `src/transform/lower_thread_allreduce.cc`, `src/transform/lower_device_kernel_launch.cc`, `src/transform/plan_update_buffer_allocation_location.cc`, `src/transform/layout_inference.cc`	Guards `metal.cooperative_tensor` allocations from storage rewrite, thread allreduce, and device kernel launch; adds LCA sanitizer (`MetalCooperativeTensorLCASanitizer`) and `ShouldPreserveOriginalBlock` for allocation location planning; wraps `InferLayout`/`ParseOperator` with `bad_optional_access` error reporting.
Metal codegen: cooperative tensor lowering and MLX swizzle `src/metal/codegen/codegen_metal.cc`, `src/metal/codegen/codegen_metal.h`, `tilelang/engine/lower.py`, `tilelang/engine/callback.py`	Extends `CodeGenTileLangMetal` with cooperative tensor usage analysis, kernel attribute emission, no_alias/restrict handling, CT alloc/fill/load/store/MMA lowering with storage elision, MLX swizzle `blockIdx` rewriting, simdgroup index expression printing; adds `BuildTileLangMetalWithoutCompile`; renames Metal callback key; fixes `lower.py` backend selector.
Tests and benchmark updates `testing/python/metal/test_metal_gemm_v2.py`, `testing/python/metal/test_metal_gemm_v2_linux.py`, `testing/python/metal/test_metal_simdgroup_store.py`, `benchmark/matmul_metal/benchmark_matmul_metal.py`	Adds Metal4 gating, global-C cooperative tensor GEMM kernel, and new test cases; updates simdgroup store codegen assertion to exclude MPP; extends benchmark with `ct_shared`/`ct_global` configs and MLX comparison.
Metal backend development documentation `docs/compiler_internals/metal_tilelang_development.md`, `docs/index.md`	Adds documentation covering Metal GEMM lowering paths, concept maps, feature status, performance snapshots, and developer commands; adds toctree entry.

Sequence Diagram(s)

sequenceDiagram
    participant Frontend as TileLang Frontend (Python)
    participant Emitter as MPSIntrinEmitter
    participant GemmMetal as GemmMetal / GemmMetalSimdGroup
    participant MetalOp as src/metal/op/gemm.cc
    participant Transform as TVM Transforms
    participant Codegen as CodeGenTileLangMetal
    participant MPP as MetalPerformancePrimitives

    Frontend->>GemmMetal: T.gemm(A, B, C, clear_accum=True)
    GemmMetal->>MetalOp: SelectInst(target) → metal.cooperative_tensor or metal.simdgroup
    MetalOp-->>GemmMetal: instruction + warp partition
    GemmMetal->>Emitter: MPSIntrinEmitter(use_cooperative_tensor=True/False)
    Emitter-->>GemmMetal: ldmatrix_a/b, mma, simdgroup_copy calls
    GemmMetal-->>Transform: PrimFunc with cooperative_tensor scope buffers
    Transform->>Transform: exempt metal.cooperative_tensor from storage_rewrite/allreduce/LCA
    Transform-->>Codegen: lowered PrimFunc
    Codegen->>Codegen: CooperativeTensorUseCollector scans body
    Codegen->>MPP: emit matmul2d_descriptor + matmul2d objects
    Codegen->>MPP: emit cooperative_tensor_load / matmul2d.run() / cooperative_tensor_store
    Codegen-->>Frontend: Metal shader source (MSL)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

tile-ai/tilelang#1869: Directly related — extends the existing Metal simdgroup GEMM instruction selection and metal_macro_generator/GEMM op lowering that this PR now splits into dual simdgroup/cooperative-tensor paths.
tile-ai/tilelang#2323: Both PRs modify tilelang/metal/target.py Metal target detection/registration; this PR adds the metal4 capability and normalize_metal_target on top of the target-detector framework.

Suggested labels

metal

Suggested reviewers

LeiWang1999

Poem

🐰 Hop hop, the rabbit cheers with glee,
Cooperative tensors on Metal — what a spree!
MPP and simdgroups in a grand duet,
MSL shaders the fastest we've seen yet.
With swizzles and tiles all lined up neat,
This bunny declares: the GEMM is complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Metal] M5 Cooperative Tensor T.gemm' clearly and specifically describes the primary change: adding Metal cooperative tensor support for T.gemm operations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-23T11:39:05Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

# Conflicts: # 3rdparty/tvm # src/metal/codegen/codegen_metal.cc # src/metal/op/copy.cc # src/metal/op/fill.cc # src/metal/op/gemm.cc # tilelang/cuda/intrinsics/layout/mma_layout.py # tilelang/metal/intrinsics/metal_macro_generator.py # tilelang/metal/op/gemm/__init__.py # tilelang/metal/op/gemm/gemm_metal.py # tilelang/metal/transform/__init__.py # tilelang/metal/transform/metal_fragment_to_simdgroup.py

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tilelang/metal/intrinsics/metal_macro_generator.py (1)
30-60: ⚠️ Potential issue | 🟡 Minor

Explicitly pass use_cooperative_tensor=True in GemmMetal instantiations for code clarity.

While GemmMetal is intentionally designed for cooperative tensor mode (evidenced by GEMM_INST_METAL_COOPERATIVE_TENSOR policy selection in _make_mps_emitter), the instantiations at lines 179 and 236 rely on the default parameter value instead of explicitly passing it. This makes the intent less obvious and could be confusing for maintainers. GemmMetalSimdGroup correctly sets use_cooperative_tensor=False explicitly; GemmMetal should do the same with use_cooperative_tensor=True at both instantiation sites.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/metal/intrinsics/metal_macro_generator.py` around lines 30 - 60,
Locate the two instantiations of the GemmMetal class (in the _make_mps_emitter
function) that currently do not explicitly pass the use_cooperative_tensor
parameter. Add use_cooperative_tensor=True as an explicit argument to both
GemmMetal instantiation calls to match the clarity and consistency pattern
already established by GemmMetalSimdGroup, which explicitly passes
use_cooperative_tensor=False. This makes the cooperative tensor design intent
clear to maintainers reading the code.

🧹 Nitpick comments (4)

tilelang/metal/intrinsics/metal_macro_generator.py (1)
95-145: 💤 Low value

Consider using tuple unpacking for cleaner indexing.

The cooperative tensor load intrinsic call correctly matches the upstream contract in builtin.py:1264-1293. The logic for transposed vs non-transposed row/col indexing is correct.

Minor style suggestion from static analysis: at line 119, consider buffer[(*extra, row_idx, col_idx)] instead of buffer[extra + (row_idx, col_idx)].
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/metal/intrinsics/metal_macro_generator.py` around lines 95 - 145, In
the _warp_ldmatrix_a macro function, replace the tuple concatenation syntax for
buffer indexing with tuple unpacking for improved readability. Change the buffer
access from buffer[extra + (row_idx, col_idx)] to use the unpacking operator
syntax buffer[(*extra, row_idx, col_idx)] where the buffer is being accessed
with the extra, row_idx, and col_idx values.
Source: Linters/SAST tools
src/op/builtin.h (1)
368-372: ⚡ Quick win

Add doxygen documentation for the new cooperative tensor intrinsics.

The four new cooperative tensor Op declarations lack documentation comments, unlike most other intrinsics in this file (see lines 287–366 for TMA intrinsics). Adding brief doxygen comments describing the signature and purpose of each intrinsic would improve maintainability.

For example, based on usage in src/metal/op/fill.cc and src/metal/op/copy.cc, cooperative_tensor_fill appears to take (data, tile_idx, fill_value, tile_m, tile_n), while cooperative_tensor_store takes 11 parameters including destination pointer, stride, and tile dimensions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/op/builtin.h` around lines 368 - 372, Add Doxygen documentation comments
above each of the four cooperative tensor Op declarations
(cooperative_tensor_fill, cooperative_tensor_load, cooperative_tensor_store, and
cooperative_tensor_multiply_accumulate) following the same documentation style
used for the TMA intrinsics in the file. Each comment should briefly describe
the function's purpose and list its parameters and their types (for example,
cooperative_tensor_fill takes data, tile_idx, fill_value, tile_m, tile_n, while
cooperative_tensor_store takes destination pointer, stride, and tile dimensions
along with others). Ensure the documentation format matches the existing doxygen
comments in the file for consistency.
3rdparty/tvm (1)
1-1: Metal 4 shader compilation support in TVM submodule is legitimate.

The TVM commit 11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30 is valid and from the active TileLang fork. The update modifies only src/runtime/metal/metal_module.mm (23 insertions, 4 deletions) to enable Metal 4 shader compilation, exactly as the PR objectives describe.

Consider documenting this submodule update in your CHANGELOG or PR description to clarify the Metal 4 feature enablement for future maintainers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/tvm` at line 1, The TVM submodule update that enables Metal 4 shader
compilation support is not documented in the project's CHANGELOG or PR
description, which could create confusion for future maintainers about the
purpose of this change. Add an entry to your CHANGELOG documenting the TVM
submodule update to commit 11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30, clearly
explaining that this change enables Metal 4 shader compilation support by
modifying src/runtime/metal/metal_module.mm. Additionally, update your PR
description to reference this feature enablement and link to the corresponding
CHANGELOG entry for clarity.
src/metal/op/copy.cc (1)
181-189: ⚡ Quick win

Dead code in tile size computation.

The conditional block (lines 183-186) sets kTileN and kTileM to the exact same values they were just assigned on lines 181-182, making it a no-op. The subsequent check on line 187 (if (kTileN > warp_N)) can never be true since kTileN was just set to warp_N on line 181.
♻️ Proposed cleanup
 int kTileN = warp_N;
 int kTileM = kTileSize;
-if (warp_tiles > 0 && warp_M > kTileSize) {
-  kTileN = warp_N;
-  kTileM = kTileSize;
-}
-if (kTileN > warp_N) {
-  kTileN = warp_N;
-}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/metal/op/copy.cc` around lines 181 - 189, The conditional block checking
if warp_tiles > 0 and warp_M > kTileSize is assigning the same values to kTileN
and kTileM that were just set unconditionally on the previous lines, making it
redundant dead code. Additionally, the subsequent if condition checking if
kTileN > warp_N can never be true since kTileN was just assigned to warp_N.
Remove the redundant conditional block (the one checking warp_tiles > 0 &&
warp_M > kTileSize) and the unreachable if condition that follows it, keeping
only the initial assignments of kTileN and kTileM unless there is additional
logic that should be applied based on the warp_tiles and warp_M conditions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/matmul_metal/benchmark_matmul_metal.py`:
- Line 247: The bare `except Exception as e:` catch at line 247 triggers Ruff's
BLE001 rule which flags blind exception catching. Since this is intentional to
keep the benchmark sweep running after bad configurations, either narrow the
exception type to catch only specific exceptions that could be raised by a bad
config, or add a local waiver comment like `# noqa: BLE001` followed by a
comment explaining the intentional broad catch is needed to continue the
benchmark sweep despite configuration errors.

In `@src/metal/codegen/codegen_metal.cc`:
- Around line 692-709: The persistent C tensor allocation generates fixed symbol
names (__pct_desc, __pct_op, __pct_cN) unconditionally, causing duplicate
definitions when multiple C buffers are marked for inlining. Additionally, the
direct GEMM path at lines 1554-1585 uses local descriptors with arbitrary
dimensions but still references the persistent __pct_c tensors that were created
with 16×32×16 shape, creating a mismatch when actual dimensions differ. Fix this
by: (1) generating unique symbol names per buffer allocation using a Var-keyed
prefix instead of hardcoded __pct names in all allocation sites (lines 692-709,
1326-1329, 1361-1365, 1472-1481), and (2) in the direct GEMM path, validate that
the descriptor dimensions match the persistent tensor shapes (16, 32, 16); if
dimensions don't match, skip the persistent tensor optimization and use
non-elided storage instead.

In `@src/metal/op/copy.cc`:
- Around line 123-127: The divisibility check in copy.cc is inconsistent with
fill.cc and the actual cooperative tensor GEMM micro tile dimension of 16×32.
Change the kTileSize constant and kTileElems calculation in the copy operation
to use the correct tile dimensions (16×32 = 512 elements instead of 16×16 = 256
elements) to match the tile size checks in fill.cc and ensure buffers that pass
the copy divisibility check will also pass the fill lowering requirements.

In `@testing/python/metal/test_metal_gemm_v2_linux.py`:
- Line 201: The assertion in
assert_metal_gemm_v2_global_cooperative_tensor_codegen currently hard-codes the
value 128 in the assertion check for max_total_threads_per_threadgroup, but the
function accepts a threads parameter that may differ from the default. Replace
the hard-coded 128 value with the threads parameter so that the assertion
correctly validates the requested thread count instead of always expecting 128,
allowing non-default callers to pass the assertion correctly.

In `@tilelang/metal/op/gemm/gemm_metal.py`:
- Around line 205-276: The c_bytes_per_thread calculation in the lower method
uses a hardcoded tile size of 64 bytes, but this doesn't match the actual
cooperative tensor micro-tile size being used. Move the c_bytes_per_thread
calculation to after the MPSIntrinEmitter is created (after line 239 where
mps_emitter is instantiated) and replace the hardcoded 64 value with the actual
micro-tile dimensions from the emitter: use micro_size_x * micro_size_y (which
are extracted from mps_emitter on lines 249-250) multiplied by the appropriate
element size in bytes to calculate the correct bytes per thread, which will
ensure the inner_k_steps heuristic is based on the actual tile size being used.

---

Outside diff comments:
In `@tilelang/metal/intrinsics/metal_macro_generator.py`:
- Around line 30-60: Locate the two instantiations of the GemmMetal class (in
the _make_mps_emitter function) that currently do not explicitly pass the
use_cooperative_tensor parameter. Add use_cooperative_tensor=True as an explicit
argument to both GemmMetal instantiation calls to match the clarity and
consistency pattern already established by GemmMetalSimdGroup, which explicitly
passes use_cooperative_tensor=False. This makes the cooperative tensor design
intent clear to maintainers reading the code.

---

Nitpick comments:
In `@3rdparty/tvm`:
- Line 1: The TVM submodule update that enables Metal 4 shader compilation
support is not documented in the project's CHANGELOG or PR description, which
could create confusion for future maintainers about the purpose of this change.
Add an entry to your CHANGELOG documenting the TVM submodule update to commit
11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30, clearly explaining that this change
enables Metal 4 shader compilation support by modifying
src/runtime/metal/metal_module.mm. Additionally, update your PR description to
reference this feature enablement and link to the corresponding CHANGELOG entry
for clarity.

In `@src/metal/op/copy.cc`:
- Around line 181-189: The conditional block checking if warp_tiles > 0 and
warp_M > kTileSize is assigning the same values to kTileN and kTileM that were
just set unconditionally on the previous lines, making it redundant dead code.
Additionally, the subsequent if condition checking if kTileN > warp_N can never
be true since kTileN was just assigned to warp_N. Remove the redundant
conditional block (the one checking warp_tiles > 0 && warp_M > kTileSize) and
the unreachable if condition that follows it, keeping only the initial
assignments of kTileN and kTileM unless there is additional logic that should be
applied based on the warp_tiles and warp_M conditions.

In `@src/op/builtin.h`:
- Around line 368-372: Add Doxygen documentation comments above each of the four
cooperative tensor Op declarations (cooperative_tensor_fill,
cooperative_tensor_load, cooperative_tensor_store, and
cooperative_tensor_multiply_accumulate) following the same documentation style
used for the TMA intrinsics in the file. Each comment should briefly describe
the function's purpose and list its parameters and their types (for example,
cooperative_tensor_fill takes data, tile_idx, fill_value, tile_m, tile_n, while
cooperative_tensor_store takes destination pointer, stride, and tile dimensions
along with others). Ensure the documentation format matches the existing doxygen
comments in the file for consistency.

In `@tilelang/metal/intrinsics/metal_macro_generator.py`:
- Around line 95-145: In the _warp_ldmatrix_a macro function, replace the tuple
concatenation syntax for buffer indexing with tuple unpacking for improved
readability. Change the buffer access from buffer[extra + (row_idx, col_idx)] to
use the unpacking operator syntax buffer[(*extra, row_idx, col_idx)] where the
buffer is being accessed with the extra, row_idx, and col_idx values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c7a820b-faad-4a2a-8bb6-5fa7e9ed40ef

📥 Commits

Reviewing files that changed from the base of the PR and between 65dbc98 and c0c41f6.

📒 Files selected for processing (36)

3rdparty/tvm
benchmark/matmul_metal/benchmark_matmul_metal.py
docs/compiler_internals/metal_tilelang_development.md
docs/index.md
src/metal/codegen/codegen_metal.cc
src/metal/codegen/codegen_metal.h
src/metal/op/copy.cc
src/metal/op/fill.cc
src/metal/op/gemm.cc
src/metal/op/utils.h
src/metal/target_utils.cc
src/metal/target_utils.h
src/op/builtin.cc
src/op/builtin.h
src/op/gemm.cc
src/op/gemm.h
src/transform/layout_inference.cc
src/transform/lower_device_kernel_launch.cc
src/transform/lower_thread_allreduce.cc
src/transform/plan_update_buffer_allocation_location.cc
src/transform/storage_rewrite.cc
testing/python/metal/test_metal_gemm_v2.py
testing/python/metal/test_metal_gemm_v2_linux.py
testing/python/metal/test_metal_simdgroup_store.py
tilelang/cuda/intrinsics/layout/mma_layout.py
tilelang/engine/lower.py
tilelang/language/annotations.py
tilelang/language/builtin.py
tilelang/language/gemm_op.py
tilelang/metal/intrinsics/metal_macro_generator.py
tilelang/metal/op/gemm/__init__.py
tilelang/metal/op/gemm/gemm_metal.py
tilelang/metal/target.py
tilelang/metal/transform/__init__.py
tilelang/metal/transform/metal_fragment_to_simdgroup.py
tilelang/utils/language.py

💤 Files with no reviewable changes (1)

tilelang/language/gemm_op.py

coderabbitai · 2026-06-20T16:58:14Z

+                f"{mode:>10s} | {block_text:>16s} | {threads:>4d} | {swizzle_text:>8s} | "
+                f"{tl:>10.1f} TFLOPS | {torch_ratio:>7.0f}% | {mlx_text}"
+            )
        except Exception as e:


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Satisfy Ruff for the intentional sweep catch.

This catch keeps the benchmark sweep running after a bad config, but Ruff BLE001 flags blind Exception; either narrow it or add a local waiver with intent.

Proposed fix

- except Exception as e: + except Exception as e: # noqa: BLE001 - keep benchmark sweeps running after per-config failures

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

except Exception as e: # noqa: BLE001 - keep benchmark sweeps running after per-config failures

🧰 Tools

🪛 Ruff (0.15.17)

[warning] 247-247: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/matmul_metal/benchmark_matmul_metal.py` at line 247, The bare `except Exception as e:` catch at line 247 triggers Ruff's BLE001 rule which flags blind exception catching. Since this is intentional to keep the benchmark sweep running after bad configurations, either narrow the exception type to catch only specific exceptions that could be raised by a bad config, or add a local waiver comment like `# noqa: BLE001` followed by a comment explaining the intentional broad catch is needed to continue the benchmark sweep despite configuration errors.

Source: Linters/SAST tools

coderabbitai · 2026-06-20T16:58:14Z

+      this->PrintIndent();
+      stream
+          << "constexpr auto __pct_desc = mpp::tensor_ops::matmul2d_descriptor("
+          << "16, 32, 16, false, false, true, "
+          << "mpp::tensor_ops::matmul2d_descriptor::mode::multiply_accumulate);"
+             "\n";
+      this->PrintIndent();
+      stream << "mpp::tensor_ops::matmul2d<__pct_desc, "
+                "metal::execution_simdgroup> __pct_op;\n";
+      for (int t = 0; t < num_c_tiles; t++) {
+        this->PrintIndent();
+        stream << "auto __pct_c" << t
+               << " = __pct_op.get_destination_cooperative_tensor<"
+               << "decltype(__pct_op.get_left_input_cooperative_tensor<half, "
+                  "half, float>()), "
+               << "decltype(__pct_op.get_right_input_cooperative_tensor<half, "
+                  "half, float>()), float>();\n";
+      }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and verify the mentioned lines exist wc -l src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 100

🏁 Script executed:

# Check the specific lines mentioned in the review sed -n '692,709p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 908

🏁 Script executed:

# Check the other mentioned line ranges sed -n '1326,1329p' src/metal/codegen/codegen_metal.cc sed -n '1361,1365p' src/metal/codegen/codegen_metal.cc sed -n '1472,1481p' src/metal/codegen/codegen_metal.cc sed -n '1554,1585p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 2865

🏁 Script executed:

# Search for all occurrences of __pct_ symbol generation patterns rg '__pct_' src/metal/codegen/codegen_metal.cc -A 2 -B 2

Repository: tile-ai/tilelang

Length of output: 3115

🏁 Script executed:

# Look for the descriptor generation pattern rg 'matmul2d_descriptor' src/metal/codegen/codegen_metal.cc -A 3 -B 3

Repository: tile-ai/tilelang

Length of output: 1544

🏁 Script executed:

# Search for symbol name generation with context to understand variable tracking rg '__pct_desc|__pct_op|__pct_c' src/metal/codegen/codegen_metal.cc -n | head -50

Repository: tile-ai/tilelang

Length of output: 1289

🏁 Script executed:

# Look at the context around persistent C allocation to understand when it happens sed -n '680,720p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 1814

🏁 Script executed:

# Search for how persistent C buffers are allocated and in what context rg 'persistent.*cooperation|cooperation.*persistent' src/metal/codegen/codegen_metal.cc -B 5 -A 5

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Look for the scope/function where these allocations happen to understand multi-buffer risk sed -n '650,730p' src/metal/codegen/codegen_metal.cc | head -100

Repository: tile-ai/tilelang

Length of output: 3482

🏁 Script executed:

# Check if there's any validation or gating that prevents non-16x32x16 from reaching line 1585 sed -n '1545,1590p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 2510

Two cooperative C tensors in the same kernel scope will produce duplicate Metal symbol definitions; non-16x32x16 GEMM shapes reuse mismatched persistent tensors.

The persistent C allocation (lines 694–709) emits fixed __pct_desc, __pct_op, and __pct_cN names unconditionally whenever can_inline_c is true. If a kernel allocates multiple C buffers marked for inlining, each generates the same symbol names, causing linker/compiler errors from duplicate definitions.

Additionally, the direct GEMM path (lines 1554–1585) creates local __desc and __op with correct dimensions when can_reuse_pct_op is false (M, N, K ≠ 16, 32, 16), but still references the persistent __pct_c tensors created with the 16×32×16 descriptor at allocation time. This creates a shape/dtype mismatch between the temporary cooperative tensors and the destination buffer.

Assign a Var-keyed symbol prefix per buffer, and validate that direct paths use descriptors compatible with the persistent tensor shapes or fall back to non-elided storage.

Also applies to: 1326–1329, 1361–1365, 1472–1481, 1554–1585

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/metal/codegen/codegen_metal.cc` around lines 692 - 709, The persistent C tensor allocation generates fixed symbol names (__pct_desc, __pct_op, __pct_cN) unconditionally, causing duplicate definitions when multiple C buffers are marked for inlining. Additionally, the direct GEMM path at lines 1554-1585 uses local descriptors with arbitrary dimensions but still references the persistent __pct_c tensors that were created with 16×32×16 shape, creating a mismatch when actual dimensions differ. Fix this by: (1) generating unique symbol names per buffer allocation using a Var-keyed prefix instead of hardcoded __pct names in all allocation sites (lines 692-709, 1326-1329, 1361-1365, 1472-1481), and (2) in the direct GEMM path, validate that the descriptor dimensions match the persistent tensor shapes (16, 32, 16); if dimensions don't match, skip the persistent tensor optimization and use non-elided storage instead.

coderabbitai · 2026-06-20T16:58:14Z

+  constexpr int kTileSize = 16;
+  constexpr int kTileElems = kTileSize * kTileSize;
+  TVM_FFI_ICHECK(total_elements % kTileElems == 0)
+      << "cooperative_tensor buffer size must be multiple of " << kTileElems
+      << ", got " << total_elements;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check cooperative tensor buffer allocation sizes # Search for cooperative tensor allocations rg -nP --type=py 'alloc.*scope="metal\.cooperative_tensor"' -A 2 -B 2 # Search for cooperative tensor tile size constants rg -n 'kTileM.*=.*16|kTileN.*=.*32' src/metal/op/

Repository: tile-ai/tilelang

Length of output: 1623

🏁 Script executed:

# Read fill.cc lines 30-45 to see the full divisibility requirement sed -n '30,45p' src/metal/op/fill.cc # Read gemm.cc lines 30-40 to check documentation sed -n '30,40p' src/metal/op/gemm.cc # Read copy.cc around the flagged lines for full context sed -n '115,135p' src/metal/op/copy.cc

Repository: tile-ai/tilelang

Length of output: 1856

Align tile size divisibility checks across copy and fill operations.

The divisibility check in copy.cc requires kTileElems = 256 (16×16), but fill.cc (lines 36-41) requires divisibility by 512 (16×32). A buffer with 256 elements would pass the copy check but fail the fill lowering. The cooperative tensor GEMM micro tile is 16×32 (documented in gemm.cc), so both copy and fill operations should align on that tile size.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/metal/op/copy.cc` around lines 123 - 127, The divisibility check in copy.cc is inconsistent with fill.cc and the actual cooperative tensor GEMM micro tile dimension of 16×32. Change the kTileSize constant and kTileElems calculation in the copy operation to use the correct tile dimensions (16×32 = 512 elements instead of 16×16 = 256 elements) to match the tile size checks in fill.cc and ensure buffers that pass the copy divisibility check will also pass the fill lowering requirements.

coderabbitai · 2026-06-20T16:58:14Z

+    assert "const device half* __src" in src_code
+    assert "[[simdgroup_index_in_threadgroup]]" in src_code
+    assert "__metal_get_thread_index_in_simdgroup" in src_code
+    assert "max_total_threads_per_threadgroup(128)" in src_code


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the requested thread count instead of hard-coding 128.

assert_metal_gemm_v2_global_cooperative_tensor_codegen accepts threads, but Line 201 always expects max_total_threads_per_threadgroup(128), so non-default callers would fail for the wrong reason.

Proposed fix

- assert "max_total_threads_per_threadgroup(128)" in src_code + assert f"max_total_threads_per_threadgroup({threads})" in src_code

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert "max_total_threads_per_threadgroup(128)" in src_code

assert f"max_total_threads_per_threadgroup({threads})" in src_code

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@testing/python/metal/test_metal_gemm_v2_linux.py` at line 201, The assertion in assert_metal_gemm_v2_global_cooperative_tensor_codegen currently hard-codes the value 128 in the assertion check for max_total_threads_per_threadgroup, but the function accepts a threads parameter that may differ from the default. Replace the hard-coded 128 value with the threads parameter so that the assertion correctly validates the requested thread count instead of always expecting 128, allowing non-default callers to pass the assertion correctly.

coderabbitai · 2026-06-20T16:58:15Z

+    @staticmethod
+    def _get_padded_stride(buffer):
+        continuous = int(buffer.shape[-1])
+        element_bits = int(tvm.DataType(buffer.dtype).bits)
+        padded = continuous
+        if (element_bits * continuous) % 256 == 0:
+            padded += 128 // element_bits
+        return padded
+
+    def lower(
+        self,
+        layout_map: dict,
+        target: Target,
+        thread_bounds: Range,
+        thread_var: tir.Var,
+        mbar_phase_expr: tir.PrimExpr | None = None,
+    ):
+        thread_nums = thread_bounds.extent
+        _, m_warp, n_warp = self._make_mps_emitter(target, int(thread_nums))
+        warp_row_tiles = int(self.M // m_warp)
+        warp_col_tiles = int(self.N // n_warp)
+
+        from tilelang.metal.intrinsics.metal_macro_generator import MPSIntrinEmitter

-                @T.prim_func
-                def _gemm_ss_shared() -> None:
-                    A_local = T.alloc_local((warp_rows * 64), a_dtype, scope="metal.simdgroup")
-                    B_local = T.alloc_local((warp_cols * 64), b_dtype, scope="metal.simdgroup")
-                    C_simd = T.alloc_local((num_simd_c * 64), accum_dtype, scope="metal.simdgroup")
-                    if clear_accum:
-                        for _i in T.serial(num_simd_c):
-                            T.make_filled_simdgroup_matrix(C_simd.data, _i, T.cast(0, accum_dtype))
-                    else:
-                        mps_emitter.simd_load(C_simd, C_buf)
-                    for ki in T.serial(0, (block_K // micro_size_k)):
-                        mps_emitter.ldmatrix_a(A_local, A_region, ki)
-                        mps_emitter.ldmatrix_b(B_local, B_region, ki)
-                        mps_emitter.mma(A_local, B_local, C_simd)
-
-                    mps_emitter.simd_store(C_simd, C_buf)
-
-                return _Simplify(_gemm_ss_shared, inline_let=True)
-        else:
+        a_stride = self._get_padded_stride(self.A) if self.is_gemm_ss() else None
+        b_stride = self._get_padded_stride(self.B) if self.is_gemm_ss() else None
+
+        c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64
+        inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1
+        output_dtype = self.accum_dtype
+        accum_dtype = T.float32 if self.is_gemm_gg() and str(output_dtype) in ("float16", "bfloat16") else output_dtype
+        mps_emitter = MPSIntrinEmitter(
+            a_dtype=self.a_dtype,
+            b_dtype=self.b_dtype,
+            accum_dtype=accum_dtype,
+            a_transposed=self.trans_A,
+            b_transposed=self.trans_B,
+            block_row_warps=m_warp,
+            block_col_warps=n_warp,
+            warp_row_tiles=warp_row_tiles,
+            warp_col_tiles=warp_col_tiles,
+            chunk=self.chunk,
+            thread_var=thread_var,
+            a_stride_override=a_stride,
+            b_stride_override=b_stride,
+            inner_k_steps=inner_k_steps,
+        )
+
+        a_dtype = self.a_dtype
+        b_dtype = self.b_dtype
+        warp_rows = mps_emitter.warp_rows
+        warp_cols = mps_emitter.warp_cols
+        num_simd_c = warp_rows * warp_cols
+        block_K = mps_emitter.chunk
+        micro_size_x = mps_emitter.micro_size_x
+        micro_size_y = mps_emitter.micro_size_y
+        micro_size_k = mps_emitter.micro_size_k
+        inner_k_steps = mps_emitter.inner_k_steps
+        a_tile_elems = micro_size_x * micro_size_k
+        b_tile_elems = micro_size_k * micro_size_y
+        c_tile_elems = micro_size_x * micro_size_y
+
+        A_region = self.ARegion
+        B_region = self.BRegion
+        C_region = self.CRegion
+        C_buf = C_region.buffer
+        clear_accum = self.clear_accum
+        c_in_cooperative_tensor = is_metal_cooperative_tensor(C_buf) or is_fragment(C_buf)
+        assert block_K >= micro_size_k, f"block_K ({block_K}) must be >= micro_size_k ({micro_size_k})"
+
+        if not (self.is_gemm_ss() or self.is_gemm_gg()):
            raise ValueError(f"Unsupported gemm combination, A: {self.A.scope()}, B: {self.B.scope()}")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

c_bytes_per_thread calculation uses simdgroup tile size (64) instead of cooperative tensor tile size.

At line 232, c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64 uses the fixed value 64, which corresponds to the simdgroup 8x8 micro-tile. However, GemmMetal uses cooperative tensor mode with 16x32=512 element micro-tiles.

This affects the inner_k_steps heuristic on line 233. If the intent is to measure register pressure per thread, the calculation should account for the actual tile size being used.

🔧 Suggested fix

- c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64 + # Cooperative tensor micro-tile is 16x32 = 512 elements + ct_micro_elems = 16 * 32 + c_bytes_per_thread = warp_row_tiles * warp_col_tiles * ct_micro_elems inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1

Or alternatively, compute this after creating the emitter to use consistent values:

+ micro_size_x_ct = 16 + micro_size_y_ct = 32 + c_tile_elems_approx = micro_size_x_ct * micro_size_y_ct + c_bytes_per_thread = warp_row_tiles * warp_col_tiles * c_tile_elems_approx inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/metal/op/gemm/gemm_metal.py` around lines 205 - 276, The c_bytes_per_thread calculation in the lower method uses a hardcoded tile size of 64 bytes, but this doesn't match the actual cooperative tensor micro-tile size being used. Move the c_bytes_per_thread calculation to after the MPSIntrinEmitter is created (after line 239 where mps_emitter is instantiated) and replace the hardcoded 64 value with the actual micro-tile dimensions from the emitter: use micro_size_x * micro_size_y (which are extracted from mps_emitter on lines 249-250) multiplied by the appropriate element size in bytes to calculate the correct bytes per thread, which will ensure the inner_k_steps heuristic is based on the actual tile size being used.

oraluben added 8 commits May 23, 2026 18:47

[Metal] Add cooperative tensor intrinsics

881c693

Expose TileLang-owned cooperative tensor builtins so Metal MPP lowering does not depend on extra TVM fork APIs.

[Metal] Select cooperative tensor GEMM lowering

0e39193

Add a shape-aware MPP instruction choice for shared-output Metal GEMM while preserving simdgroup fallback for fragments and unsupported tiles.

[Metal] Emit MPP cooperative tensor shaders

97bffed

Generate Metal 4 MPP matmul2d code for cooperative tensor intrinsics and keep source-only codegen separate from runtime compilation.

[Metal] Lower GEMM through MPP tensor ops

788c4f7

Split Metal GEMM lowering into simdgroup and cooperative tensor emitters so M5 tiles use MPP while fragment accumulators keep the existing path.

[Metal] Guard generic passes for cooperative tensors

90067da

Keep generic allocation and storage rewrites away from opaque Metal cooperative tensor scopes to avoid invalid scope analysis.

[Metal] Test cooperative tensor GEMM coverage

b93a5e1

Add runtime and source-only coverage for non-square MPP GEMM so the new cooperative tensor path is reproducible in CI and on M5.

[Metal] Update TVM Metal 4 runtime guard

19ffe6b

Point the submodule at the macOS SDK guarded Metal 4 runtime update used by cooperative tensor shaders.

[Metal] Document cooperative tensor GEMM

1f959b7

Add a reference page covering the two Metal GEMM paths, selection rules, current limitations, and planned follow-up work.

oraluben added 7 commits May 23, 2026 20:04

lint

67f1273

Improve Metal cooperative tensor GEMM

068c3b8

Optimize Metal cooperative tensor GEMM

31e8876

Update TVM Metal 4 support

6f952ed

Add Metal backend internals documentation

9d88fa6

Harden Metal cooperative tensor lowering

c0c41f6

oraluben marked this pull request as ready for review June 20, 2026 16:43

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

oraluben added 3 commits June 21, 2026 09:08

test older impl with new framework

ffb882c

Clean up Metal transform exports

6bdd892

Use macos-latest for Metal CI

c227b6f

oraluben requested review from LeiWang1999 and SiriusNEO June 21, 2026 04:41

oraluben mentioned this pull request Jun 21, 2026

[Metal] Enable Metal 4 shader compilation apache/tvm#19595

Merged

SiriusNEO reviewed Jun 21, 2026

View reviewed changes

Comment thread src/transform/storage_rewrite.cc Outdated

Comment thread tilelang/utils/language.py Outdated

oraluben added 2 commits June 22, 2026 22:13

Merge branch 'main' into metal-gemm-perf

048af4d

resolve comments and lint

daa8342

	except Exception as e:
	except Exception as e: # noqa: BLE001 - keep benchmark sweeps running after per-config failures

	assert "max_total_threads_per_threadgroup(128)" in src_code
	assert f"max_total_threads_per_threadgroup({threads})" in src_code

Conversation

oraluben commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design Notes

What Changed

Impact on TileLang

Compatibility

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oraluben commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading