Skip to content

[Metal] M5 Cooperative Tensor T.gemm#2252

Open
oraluben wants to merge 20 commits into
tile-ai:mainfrom
oraluben:metal-gemm-perf
Open

[Metal] M5 Cooperative Tensor T.gemm#2252
oraluben wants to merge 20 commits into
tile-ai:mainfrom
oraluben:metal-gemm-perf

Conversation

@oraluben

@oraluben oraluben commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Needs tile-ai/tvm#44

Summary

This PR extends the existing Metal backend with a Metal 4 cooperative tensor path for T.gemm.

The Metal backend already supported simdgroup-based GEMM lowering. This PR adds cooperative tensor as a new fast path on supported Apple GPUs, while keeping simdgroup as the compatibility path for older devices and non-Metal-4 targets.

Motivation

Metal 4 cooperative tensor exposes Apple's tensor-core-like matrix compute path. On supported hardware, it provides a substantially faster GEMM implementation than the existing simdgroup path.

This PR adds that path to TileLang so Metal T.gemm can use the newer hardware capability while preserving the existing simdgroup implementation for compatibility.

Design Notes

Although cooperative tensor is conceptually the Metal-side counterpart of CUDA tensor core programming, the programming model is not a direct CUDA clone.

As a practical approximation for CUDA reviewers, Apple GPUs expose a less CUDA-like split between register and threadgroup storage; both are backed by a more hardware-managed on-chip memory system. Because of that, explicit threadgroup staging is not automatically a faster path than feeding cooperative tensor operands directly.

This PR therefore keeps CUDA-shaped shared staging as a compatibility path, but optimizes the direct cooperative-tensor path as the Metal fast path. T.gemm remains the frontend abstraction, and Metal-specific instruction choice stays inside the Metal backend.

What Changed

At a high level, this PR adds:

  • cooperative tensor lowering for Metal GEMM;
  • Metal codegen support for cooperative tensor source emission;
  • target capability guarding so cooperative tensor source is only generated when supported/requested;
  • preservation of the existing simdgroup path;
  • pass-level compatibility so Metal-specific storage scopes do not leak into generic TVM assumptions;
  • tests for both the new cooperative tensor path and the existing simdgroup path.

Detailed lowering rules and implementation notes are documented separately in the Metal compiler internals doc.

Impact on TileLang

The main TileLang-level impact is that Metal now has a dedicated high-performance GEMM path that reflects Metal's own matrix programming model.

In particular:

  • T.gemm remains the user-facing abstraction.
  • Existing simdgroup kernels continue to work.
  • Cooperative tensor support is gated by target/runtime capability.
  • CUDA-style shared staging remains supported for compatibility, but is not assumed to be the default Metal performance model.
  • Future Metal schedules, such as shared-staging bypass or MLX-style variants, can be added behind the same backend boundary without changing the frontend API.

Compatibility

This PR is intended to be backward-compatible for existing Metal users.

The existing simdgroup path is still present and tested. The new cooperative tensor path is only used when the target/runtime capability allows it, so building TileLang with a newer SDK should not force all Metal kernels to require Metal 4.

One important compatibility check is that this PR has been validated on GitHub Actions with macOS 26 and M1 hardware. That environment exposes the newer SDK at build time but does not support cooperative tensor in hardware, so passing there verifies that the backend correctly falls back to the simdgroup path on unsupported devices.

Testing

Test coverage includes:

  • Metal simdgroup fallback / old path;
  • Metal cooperative tensor codegen;
  • runtime cooperative tensor correctness where supported;
  • source-level checks to avoid pulling cooperative tensor dependencies into simdgroup-only kernels.

Validated locally with:

  • pip install .
  • python -m pytest testing/python/metal/ -q -x
  • python -m pre_commit run --all-files

GitHub Actions additionally validated the macOS 26 + M1 fallback case described above.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added cooperative tensor support for Metal GPU kernels, enabling more efficient GEMM operations with new language primitives (cooperative_tensor_fill, cooperative_tensor_load, cooperative_tensor_store, cooperative_tensor_multiply_accumulate).
    • Added Metal4 capability detection to enhance GPU compatibility and feature availability.
    • Extended GEMM benchmarking with cooperative tensor variants and optional MLX reference comparisons.
  • Documentation

    • Added comprehensive Metal TileLang development guide covering backend lowering, execution model, and implementation details.
  • Chores

    • Updated TVM dependency to latest version.

oraluben added 8 commits May 23, 2026 18:47
Expose TileLang-owned cooperative tensor builtins so Metal MPP lowering does not depend on extra TVM fork APIs.
Add a shape-aware MPP instruction choice for shared-output Metal GEMM while preserving simdgroup fallback for fragments and unsupported tiles.
Generate Metal 4 MPP matmul2d code for cooperative tensor intrinsics and keep source-only codegen separate from runtime compilation.
Split Metal GEMM lowering into simdgroup and cooperative tensor emitters so M5 tiles use MPP while fragment accumulators keep the existing path.
Keep generic allocation and storage rewrites away from opaque Metal cooperative tensor scopes to avoid invalid scope analysis.
Add runtime and source-only coverage for non-square MPP GEMM so the new cooperative tensor path is reproducible in CI and on M5.
Point the submodule at the macOS SDK guarded Metal 4 runtime update used by cooperative tensor shaders.
Add a reference page covering the two Metal GEMM paths, selection rules, current limitations, and planned follow-up work.
@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4077ed8a-18bd-49c2-9683-5daa94124474

📥 Commits

Reviewing files that changed from the base of the PR and between 6bdd892 and daa8342.

📒 Files selected for processing (13)
  • docs/index.md
  • src/metal/op/copy.cc
  • src/metal/op/fill.cc
  • src/metal/op/utils.h
  • src/op/builtin.cc
  • src/op/gemm.cc
  • src/op/gemm.h
  • src/transform/storage_rewrite.cc
  • tilelang/metal/__init__.py
  • tilelang/metal/op/gemm/gemm_metal.py
  • tilelang/metal/utils.py
  • tilelang/transform/decouple_type_cast.py
  • tilelang/utils/language.py
💤 Files with no reviewable changes (2)
  • tilelang/utils/language.py
  • src/metal/op/utils.h
✅ Files skipped from review due to trivial changes (3)
  • docs/index.md
  • tilelang/metal/utils.py
  • tilelang/metal/init.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/op/gemm.h
  • src/op/gemm.cc
  • src/op/builtin.cc
  • src/metal/op/copy.cc
  • src/metal/op/fill.cc
  • tilelang/metal/op/gemm/gemm_metal.py

📝 Walkthrough

Walkthrough

This PR adds Metal 4 cooperative tensor GEMM support to TileLang. It introduces four new builtin intrinsics (cooperative_tensor_fill/load/store/multiply_accumulate), Metal4 target detection, Python frontend wrappers, a dual-path MPSIntrinEmitter, split GemmMetal/GemmMetalSimdGroup classes, C++ Metal op lowering (gemm/copy/fill), transform pass exemptions, Metal codegen with MPP lowering and MLX swizzle, tests, benchmarks, and documentation.

Changes

Metal 4 Cooperative Tensor GEMM

Layer / File(s) Summary
Cooperative tensor builtin ops and Gemm offset contract
src/op/builtin.h, src/op/builtin.cc, src/op/gemm.h, src/op/gemm.cc
Declares and registers cooperative_tensor_fill/load/store/multiply_accumulate TIR intrinsics; changes GemmNode::offsetA_/offsetB_ from int to PrimExpr.
Metal4 target detection and normalization
tilelang/metal/target.py, src/metal/target_utils.cc, src/metal/target_utils.h, 3rdparty/tvm
Adds check_metal4_availability (SDK + GPU model check), normalize_metal_target, target_metal_supports_metal4; registers tl.TargetMetalSupportsMetal4 FFI; bumps TVM submodule.
Python frontend: builtins, annotations, gemm op, layout map, buffer utils
tilelang/language/builtin.py, tilelang/language/annotations.py, tilelang/language/gemm_op.py, tilelang/cuda/intrinsics/layout/mma_layout.py, tilelang/metal/utils.py, tilelang/metal/__init__.py, tilelang/utils/language.py, tilelang/transform/decouple_type_cast.py
Adds Python wrappers for cooperative tensor intrinsics; adds mlx swizzle order with ValueError; removes GEMM offset validation; adds metal_ct_store_index_map; adds is_metal_cooperative_tensor/is_metal_simdgroup helpers; removes old is_metal_simdgroup from utils/language.py.
MPSIntrinEmitter cooperative tensor mode
tilelang/metal/intrinsics/metal_macro_generator.py
Extends MPSIntrinEmitter with use_cooperative_tensor flag, operand constants, 16x32x16 micro-tile sizing, stride overrides, and routes ldmatrix_a/b, mma, and simdgroup_copy through cooperative tensor intrinsics when enabled.
GemmMetal / GemmMetalSimdGroup Python lowering
tilelang/metal/op/gemm/gemm_metal.py, tilelang/metal/op/gemm/__init__.py, tilelang/metal/transform/metal_fragment_to_simdgroup.py
Splits Metal GEMM into GemmMetalSimdGroup (legacy) and GemmMetal (cooperative tensor GG/SS); adds padded layout, warp-partition selection, C writeback; updates MetalFragmentToSimdgroup with num_warps inference; registers both instruction kinds.
Metal op implementations: gemm, copy, fill, utils
src/metal/op/gemm.cc, src/metal/op/copy.cc, src/metal/op/fill.cc, src/metal/op/utils.h
Adds CanUseCooperativeTensor and 16x32 warp-partition sizing; adds LowerCooperativeTensorCopy with aspect-ratio warp-tiling heuristic; adds cooperative tensor fill lowering; replaces IsRegisterBuffer with IsCooperativeTensorBuffer.
TVM transform pass exemptions
src/transform/storage_rewrite.cc, src/transform/lower_thread_allreduce.cc, src/transform/lower_device_kernel_launch.cc, src/transform/plan_update_buffer_allocation_location.cc, src/transform/layout_inference.cc
Guards metal.cooperative_tensor allocations from storage rewrite, thread allreduce, and device kernel launch; adds LCA sanitizer (MetalCooperativeTensorLCASanitizer) and ShouldPreserveOriginalBlock for allocation location planning; wraps InferLayout/ParseOperator with bad_optional_access error reporting.
Metal codegen: cooperative tensor lowering and MLX swizzle
src/metal/codegen/codegen_metal.cc, src/metal/codegen/codegen_metal.h, tilelang/engine/lower.py, tilelang/engine/callback.py
Extends CodeGenTileLangMetal with cooperative tensor usage analysis, kernel attribute emission, no_alias/restrict handling, CT alloc/fill/load/store/MMA lowering with storage elision, MLX swizzle blockIdx rewriting, simdgroup index expression printing; adds BuildTileLangMetalWithoutCompile; renames Metal callback key; fixes lower.py backend selector.
Tests and benchmark updates
testing/python/metal/test_metal_gemm_v2.py, testing/python/metal/test_metal_gemm_v2_linux.py, testing/python/metal/test_metal_simdgroup_store.py, benchmark/matmul_metal/benchmark_matmul_metal.py
Adds Metal4 gating, global-C cooperative tensor GEMM kernel, and new test cases; updates simdgroup store codegen assertion to exclude MPP; extends benchmark with ct_shared/ct_global configs and MLX comparison.
Metal backend development documentation
docs/compiler_internals/metal_tilelang_development.md, docs/index.md
Adds documentation covering Metal GEMM lowering paths, concept maps, feature status, performance snapshots, and developer commands; adds toctree entry.

Sequence Diagram(s)

sequenceDiagram
    participant Frontend as TileLang Frontend (Python)
    participant Emitter as MPSIntrinEmitter
    participant GemmMetal as GemmMetal / GemmMetalSimdGroup
    participant MetalOp as src/metal/op/gemm.cc
    participant Transform as TVM Transforms
    participant Codegen as CodeGenTileLangMetal
    participant MPP as MetalPerformancePrimitives

    Frontend->>GemmMetal: T.gemm(A, B, C, clear_accum=True)
    GemmMetal->>MetalOp: SelectInst(target) → metal.cooperative_tensor or metal.simdgroup
    MetalOp-->>GemmMetal: instruction + warp partition
    GemmMetal->>Emitter: MPSIntrinEmitter(use_cooperative_tensor=True/False)
    Emitter-->>GemmMetal: ldmatrix_a/b, mma, simdgroup_copy calls
    GemmMetal-->>Transform: PrimFunc with cooperative_tensor scope buffers
    Transform->>Transform: exempt metal.cooperative_tensor from storage_rewrite/allreduce/LCA
    Transform-->>Codegen: lowered PrimFunc
    Codegen->>Codegen: CooperativeTensorUseCollector scans body
    Codegen->>MPP: emit matmul2d_descriptor + matmul2d objects
    Codegen->>MPP: emit cooperative_tensor_load / matmul2d.run() / cooperative_tensor_store
    Codegen-->>Frontend: Metal shader source (MSL)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • tile-ai/tilelang#1869: Directly related — extends the existing Metal simdgroup GEMM instruction selection and metal_macro_generator/GEMM op lowering that this PR now splits into dual simdgroup/cooperative-tensor paths.
  • tile-ai/tilelang#2323: Both PRs modify tilelang/metal/target.py Metal target detection/registration; this PR adds the metal4 capability and normalize_metal_target on top of the target-detector framework.

Suggested labels

metal

Suggested reviewers

  • LeiWang1999

Poem

🐰 Hop hop, the rabbit cheers with glee,
Cooperative tensors on Metal — what a spree!
MPP and simdgroups in a grand duet,
MSL shaders the fastest we've seen yet.
With swizzles and tiles all lined up neat,
This bunny declares: the GEMM is complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Metal] M5 Cooperative Tensor T.gemm' clearly and specifically describes the primary change: adding Metal cooperative tensor support for T.gemm operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

oraluben added 7 commits May 23, 2026 20:04
# Conflicts:
#	3rdparty/tvm
#	src/metal/codegen/codegen_metal.cc
#	src/metal/op/copy.cc
#	src/metal/op/fill.cc
#	src/metal/op/gemm.cc
#	tilelang/cuda/intrinsics/layout/mma_layout.py
#	tilelang/metal/intrinsics/metal_macro_generator.py
#	tilelang/metal/op/gemm/__init__.py
#	tilelang/metal/op/gemm/gemm_metal.py
#	tilelang/metal/transform/__init__.py
#	tilelang/metal/transform/metal_fragment_to_simdgroup.py
@oraluben oraluben marked this pull request as ready for review June 20, 2026 16:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/metal/intrinsics/metal_macro_generator.py (1)

30-60: ⚠️ Potential issue | 🟡 Minor

Explicitly pass use_cooperative_tensor=True in GemmMetal instantiations for code clarity.

While GemmMetal is intentionally designed for cooperative tensor mode (evidenced by GEMM_INST_METAL_COOPERATIVE_TENSOR policy selection in _make_mps_emitter), the instantiations at lines 179 and 236 rely on the default parameter value instead of explicitly passing it. This makes the intent less obvious and could be confusing for maintainers. GemmMetalSimdGroup correctly sets use_cooperative_tensor=False explicitly; GemmMetal should do the same with use_cooperative_tensor=True at both instantiation sites.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/metal/intrinsics/metal_macro_generator.py` around lines 30 - 60,
Locate the two instantiations of the GemmMetal class (in the _make_mps_emitter
function) that currently do not explicitly pass the use_cooperative_tensor
parameter. Add use_cooperative_tensor=True as an explicit argument to both
GemmMetal instantiation calls to match the clarity and consistency pattern
already established by GemmMetalSimdGroup, which explicitly passes
use_cooperative_tensor=False. This makes the cooperative tensor design intent
clear to maintainers reading the code.
🧹 Nitpick comments (4)
tilelang/metal/intrinsics/metal_macro_generator.py (1)

95-145: 💤 Low value

Consider using tuple unpacking for cleaner indexing.

The cooperative tensor load intrinsic call correctly matches the upstream contract in builtin.py:1264-1293. The logic for transposed vs non-transposed row/col indexing is correct.

Minor style suggestion from static analysis: at line 119, consider buffer[(*extra, row_idx, col_idx)] instead of buffer[extra + (row_idx, col_idx)].

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/metal/intrinsics/metal_macro_generator.py` around lines 95 - 145, In
the _warp_ldmatrix_a macro function, replace the tuple concatenation syntax for
buffer indexing with tuple unpacking for improved readability. Change the buffer
access from buffer[extra + (row_idx, col_idx)] to use the unpacking operator
syntax buffer[(*extra, row_idx, col_idx)] where the buffer is being accessed
with the extra, row_idx, and col_idx values.

Source: Linters/SAST tools

src/op/builtin.h (1)

368-372: ⚡ Quick win

Add doxygen documentation for the new cooperative tensor intrinsics.

The four new cooperative tensor Op declarations lack documentation comments, unlike most other intrinsics in this file (see lines 287–366 for TMA intrinsics). Adding brief doxygen comments describing the signature and purpose of each intrinsic would improve maintainability.

For example, based on usage in src/metal/op/fill.cc and src/metal/op/copy.cc, cooperative_tensor_fill appears to take (data, tile_idx, fill_value, tile_m, tile_n), while cooperative_tensor_store takes 11 parameters including destination pointer, stride, and tile dimensions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/op/builtin.h` around lines 368 - 372, Add Doxygen documentation comments
above each of the four cooperative tensor Op declarations
(cooperative_tensor_fill, cooperative_tensor_load, cooperative_tensor_store, and
cooperative_tensor_multiply_accumulate) following the same documentation style
used for the TMA intrinsics in the file. Each comment should briefly describe
the function's purpose and list its parameters and their types (for example,
cooperative_tensor_fill takes data, tile_idx, fill_value, tile_m, tile_n, while
cooperative_tensor_store takes destination pointer, stride, and tile dimensions
along with others). Ensure the documentation format matches the existing doxygen
comments in the file for consistency.
3rdparty/tvm (1)

1-1: Metal 4 shader compilation support in TVM submodule is legitimate.

The TVM commit 11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30 is valid and from the active TileLang fork. The update modifies only src/runtime/metal/metal_module.mm (23 insertions, 4 deletions) to enable Metal 4 shader compilation, exactly as the PR objectives describe.

Consider documenting this submodule update in your CHANGELOG or PR description to clarify the Metal 4 feature enablement for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/tvm` at line 1, The TVM submodule update that enables Metal 4 shader
compilation support is not documented in the project's CHANGELOG or PR
description, which could create confusion for future maintainers about the
purpose of this change. Add an entry to your CHANGELOG documenting the TVM
submodule update to commit 11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30, clearly
explaining that this change enables Metal 4 shader compilation support by
modifying src/runtime/metal/metal_module.mm. Additionally, update your PR
description to reference this feature enablement and link to the corresponding
CHANGELOG entry for clarity.
src/metal/op/copy.cc (1)

181-189: ⚡ Quick win

Dead code in tile size computation.

The conditional block (lines 183-186) sets kTileN and kTileM to the exact same values they were just assigned on lines 181-182, making it a no-op. The subsequent check on line 187 (if (kTileN > warp_N)) can never be true since kTileN was just set to warp_N on line 181.

♻️ Proposed cleanup
 int kTileN = warp_N;
 int kTileM = kTileSize;
-if (warp_tiles > 0 && warp_M > kTileSize) {
-  kTileN = warp_N;
-  kTileM = kTileSize;
-}
-if (kTileN > warp_N) {
-  kTileN = warp_N;
-}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/metal/op/copy.cc` around lines 181 - 189, The conditional block checking
if warp_tiles > 0 and warp_M > kTileSize is assigning the same values to kTileN
and kTileM that were just set unconditionally on the previous lines, making it
redundant dead code. Additionally, the subsequent if condition checking if
kTileN > warp_N can never be true since kTileN was just assigned to warp_N.
Remove the redundant conditional block (the one checking warp_tiles > 0 &&
warp_M > kTileSize) and the unreachable if condition that follows it, keeping
only the initial assignments of kTileN and kTileM unless there is additional
logic that should be applied based on the warp_tiles and warp_M conditions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/matmul_metal/benchmark_matmul_metal.py`:
- Line 247: The bare `except Exception as e:` catch at line 247 triggers Ruff's
BLE001 rule which flags blind exception catching. Since this is intentional to
keep the benchmark sweep running after bad configurations, either narrow the
exception type to catch only specific exceptions that could be raised by a bad
config, or add a local waiver comment like `# noqa: BLE001` followed by a
comment explaining the intentional broad catch is needed to continue the
benchmark sweep despite configuration errors.

In `@src/metal/codegen/codegen_metal.cc`:
- Around line 692-709: The persistent C tensor allocation generates fixed symbol
names (__pct_desc, __pct_op, __pct_cN) unconditionally, causing duplicate
definitions when multiple C buffers are marked for inlining. Additionally, the
direct GEMM path at lines 1554-1585 uses local descriptors with arbitrary
dimensions but still references the persistent __pct_c tensors that were created
with 16×32×16 shape, creating a mismatch when actual dimensions differ. Fix this
by: (1) generating unique symbol names per buffer allocation using a Var-keyed
prefix instead of hardcoded __pct names in all allocation sites (lines 692-709,
1326-1329, 1361-1365, 1472-1481), and (2) in the direct GEMM path, validate that
the descriptor dimensions match the persistent tensor shapes (16, 32, 16); if
dimensions don't match, skip the persistent tensor optimization and use
non-elided storage instead.

In `@src/metal/op/copy.cc`:
- Around line 123-127: The divisibility check in copy.cc is inconsistent with
fill.cc and the actual cooperative tensor GEMM micro tile dimension of 16×32.
Change the kTileSize constant and kTileElems calculation in the copy operation
to use the correct tile dimensions (16×32 = 512 elements instead of 16×16 = 256
elements) to match the tile size checks in fill.cc and ensure buffers that pass
the copy divisibility check will also pass the fill lowering requirements.

In `@testing/python/metal/test_metal_gemm_v2_linux.py`:
- Line 201: The assertion in
assert_metal_gemm_v2_global_cooperative_tensor_codegen currently hard-codes the
value 128 in the assertion check for max_total_threads_per_threadgroup, but the
function accepts a threads parameter that may differ from the default. Replace
the hard-coded 128 value with the threads parameter so that the assertion
correctly validates the requested thread count instead of always expecting 128,
allowing non-default callers to pass the assertion correctly.

In `@tilelang/metal/op/gemm/gemm_metal.py`:
- Around line 205-276: The c_bytes_per_thread calculation in the lower method
uses a hardcoded tile size of 64 bytes, but this doesn't match the actual
cooperative tensor micro-tile size being used. Move the c_bytes_per_thread
calculation to after the MPSIntrinEmitter is created (after line 239 where
mps_emitter is instantiated) and replace the hardcoded 64 value with the actual
micro-tile dimensions from the emitter: use micro_size_x * micro_size_y (which
are extracted from mps_emitter on lines 249-250) multiplied by the appropriate
element size in bytes to calculate the correct bytes per thread, which will
ensure the inner_k_steps heuristic is based on the actual tile size being used.

---

Outside diff comments:
In `@tilelang/metal/intrinsics/metal_macro_generator.py`:
- Around line 30-60: Locate the two instantiations of the GemmMetal class (in
the _make_mps_emitter function) that currently do not explicitly pass the
use_cooperative_tensor parameter. Add use_cooperative_tensor=True as an explicit
argument to both GemmMetal instantiation calls to match the clarity and
consistency pattern already established by GemmMetalSimdGroup, which explicitly
passes use_cooperative_tensor=False. This makes the cooperative tensor design
intent clear to maintainers reading the code.

---

Nitpick comments:
In `@3rdparty/tvm`:
- Line 1: The TVM submodule update that enables Metal 4 shader compilation
support is not documented in the project's CHANGELOG or PR description, which
could create confusion for future maintainers about the purpose of this change.
Add an entry to your CHANGELOG documenting the TVM submodule update to commit
11c1968acf0e95f2ac1d76b0dd9ffd44c8072b30, clearly explaining that this change
enables Metal 4 shader compilation support by modifying
src/runtime/metal/metal_module.mm. Additionally, update your PR description to
reference this feature enablement and link to the corresponding CHANGELOG entry
for clarity.

In `@src/metal/op/copy.cc`:
- Around line 181-189: The conditional block checking if warp_tiles > 0 and
warp_M > kTileSize is assigning the same values to kTileN and kTileM that were
just set unconditionally on the previous lines, making it redundant dead code.
Additionally, the subsequent if condition checking if kTileN > warp_N can never
be true since kTileN was just assigned to warp_N. Remove the redundant
conditional block (the one checking warp_tiles > 0 && warp_M > kTileSize) and
the unreachable if condition that follows it, keeping only the initial
assignments of kTileN and kTileM unless there is additional logic that should be
applied based on the warp_tiles and warp_M conditions.

In `@src/op/builtin.h`:
- Around line 368-372: Add Doxygen documentation comments above each of the four
cooperative tensor Op declarations (cooperative_tensor_fill,
cooperative_tensor_load, cooperative_tensor_store, and
cooperative_tensor_multiply_accumulate) following the same documentation style
used for the TMA intrinsics in the file. Each comment should briefly describe
the function's purpose and list its parameters and their types (for example,
cooperative_tensor_fill takes data, tile_idx, fill_value, tile_m, tile_n, while
cooperative_tensor_store takes destination pointer, stride, and tile dimensions
along with others). Ensure the documentation format matches the existing doxygen
comments in the file for consistency.

In `@tilelang/metal/intrinsics/metal_macro_generator.py`:
- Around line 95-145: In the _warp_ldmatrix_a macro function, replace the tuple
concatenation syntax for buffer indexing with tuple unpacking for improved
readability. Change the buffer access from buffer[extra + (row_idx, col_idx)] to
use the unpacking operator syntax buffer[(*extra, row_idx, col_idx)] where the
buffer is being accessed with the extra, row_idx, and col_idx values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c7a820b-faad-4a2a-8bb6-5fa7e9ed40ef

📥 Commits

Reviewing files that changed from the base of the PR and between 65dbc98 and c0c41f6.

📒 Files selected for processing (36)
  • 3rdparty/tvm
  • benchmark/matmul_metal/benchmark_matmul_metal.py
  • docs/compiler_internals/metal_tilelang_development.md
  • docs/index.md
  • src/metal/codegen/codegen_metal.cc
  • src/metal/codegen/codegen_metal.h
  • src/metal/op/copy.cc
  • src/metal/op/fill.cc
  • src/metal/op/gemm.cc
  • src/metal/op/utils.h
  • src/metal/target_utils.cc
  • src/metal/target_utils.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/op/gemm.cc
  • src/op/gemm.h
  • src/transform/layout_inference.cc
  • src/transform/lower_device_kernel_launch.cc
  • src/transform/lower_thread_allreduce.cc
  • src/transform/plan_update_buffer_allocation_location.cc
  • src/transform/storage_rewrite.cc
  • testing/python/metal/test_metal_gemm_v2.py
  • testing/python/metal/test_metal_gemm_v2_linux.py
  • testing/python/metal/test_metal_simdgroup_store.py
  • tilelang/cuda/intrinsics/layout/mma_layout.py
  • tilelang/engine/lower.py
  • tilelang/language/annotations.py
  • tilelang/language/builtin.py
  • tilelang/language/gemm_op.py
  • tilelang/metal/intrinsics/metal_macro_generator.py
  • tilelang/metal/op/gemm/__init__.py
  • tilelang/metal/op/gemm/gemm_metal.py
  • tilelang/metal/target.py
  • tilelang/metal/transform/__init__.py
  • tilelang/metal/transform/metal_fragment_to_simdgroup.py
  • tilelang/utils/language.py
💤 Files with no reviewable changes (1)
  • tilelang/language/gemm_op.py

f"{mode:>10s} | {block_text:>16s} | {threads:>4d} | {swizzle_text:>8s} | "
f"{tl:>10.1f} TFLOPS | {torch_ratio:>7.0f}% | {mlx_text}"
)
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Satisfy Ruff for the intentional sweep catch.

This catch keeps the benchmark sweep running after a bad config, but Ruff BLE001 flags blind Exception; either narrow it or add a local waiver with intent.

Proposed fix
-        except Exception as e:
+        except Exception as e:  # noqa: BLE001 - keep benchmark sweeps running after per-config failures
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
except Exception as e: # noqa: BLE001 - keep benchmark sweeps running after per-config failures
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 247-247: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/matmul_metal/benchmark_matmul_metal.py` at line 247, The bare
`except Exception as e:` catch at line 247 triggers Ruff's BLE001 rule which
flags blind exception catching. Since this is intentional to keep the benchmark
sweep running after bad configurations, either narrow the exception type to
catch only specific exceptions that could be raised by a bad config, or add a
local waiver comment like `# noqa: BLE001` followed by a comment explaining the
intentional broad catch is needed to continue the benchmark sweep despite
configuration errors.

Source: Linters/SAST tools

Comment on lines +692 to +709
this->PrintIndent();
stream
<< "constexpr auto __pct_desc = mpp::tensor_ops::matmul2d_descriptor("
<< "16, 32, 16, false, false, true, "
<< "mpp::tensor_ops::matmul2d_descriptor::mode::multiply_accumulate);"
"\n";
this->PrintIndent();
stream << "mpp::tensor_ops::matmul2d<__pct_desc, "
"metal::execution_simdgroup> __pct_op;\n";
for (int t = 0; t < num_c_tiles; t++) {
this->PrintIndent();
stream << "auto __pct_c" << t
<< " = __pct_op.get_destination_cooperative_tensor<"
<< "decltype(__pct_op.get_left_input_cooperative_tensor<half, "
"half, float>()), "
<< "decltype(__pct_op.get_right_input_cooperative_tensor<half, "
"half, float>()), float>();\n";
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and verify the mentioned lines exist
wc -l src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 100


🏁 Script executed:

# Check the specific lines mentioned in the review
sed -n '692,709p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 908


🏁 Script executed:

# Check the other mentioned line ranges
sed -n '1326,1329p' src/metal/codegen/codegen_metal.cc
sed -n '1361,1365p' src/metal/codegen/codegen_metal.cc
sed -n '1472,1481p' src/metal/codegen/codegen_metal.cc
sed -n '1554,1585p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 2865


🏁 Script executed:

# Search for all occurrences of __pct_ symbol generation patterns
rg '__pct_' src/metal/codegen/codegen_metal.cc -A 2 -B 2

Repository: tile-ai/tilelang

Length of output: 3115


🏁 Script executed:

# Look for the descriptor generation pattern
rg 'matmul2d_descriptor' src/metal/codegen/codegen_metal.cc -A 3 -B 3

Repository: tile-ai/tilelang

Length of output: 1544


🏁 Script executed:

# Search for symbol name generation with context to understand variable tracking
rg '__pct_desc|__pct_op|__pct_c' src/metal/codegen/codegen_metal.cc -n | head -50

Repository: tile-ai/tilelang

Length of output: 1289


🏁 Script executed:

# Look at the context around persistent C allocation to understand when it happens
sed -n '680,720p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 1814


🏁 Script executed:

# Search for how persistent C buffers are allocated and in what context
rg 'persistent.*cooperation|cooperation.*persistent' src/metal/codegen/codegen_metal.cc -B 5 -A 5

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look for the scope/function where these allocations happen to understand multi-buffer risk
sed -n '650,730p' src/metal/codegen/codegen_metal.cc | head -100

Repository: tile-ai/tilelang

Length of output: 3482


🏁 Script executed:

# Check if there's any validation or gating that prevents non-16x32x16 from reaching line 1585
sed -n '1545,1590p' src/metal/codegen/codegen_metal.cc

Repository: tile-ai/tilelang

Length of output: 2510


Two cooperative C tensors in the same kernel scope will produce duplicate Metal symbol definitions; non-16x32x16 GEMM shapes reuse mismatched persistent tensors.

The persistent C allocation (lines 694–709) emits fixed __pct_desc, __pct_op, and __pct_cN names unconditionally whenever can_inline_c is true. If a kernel allocates multiple C buffers marked for inlining, each generates the same symbol names, causing linker/compiler errors from duplicate definitions.

Additionally, the direct GEMM path (lines 1554–1585) creates local __desc and __op with correct dimensions when can_reuse_pct_op is false (M, N, K ≠ 16, 32, 16), but still references the persistent __pct_c tensors created with the 16×32×16 descriptor at allocation time. This creates a shape/dtype mismatch between the temporary cooperative tensors and the destination buffer.

Assign a Var-keyed symbol prefix per buffer, and validate that direct paths use descriptors compatible with the persistent tensor shapes or fall back to non-elided storage.

Also applies to: 1326–1329, 1361–1365, 1472–1481, 1554–1585

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/metal/codegen/codegen_metal.cc` around lines 692 - 709, The persistent C
tensor allocation generates fixed symbol names (__pct_desc, __pct_op, __pct_cN)
unconditionally, causing duplicate definitions when multiple C buffers are
marked for inlining. Additionally, the direct GEMM path at lines 1554-1585 uses
local descriptors with arbitrary dimensions but still references the persistent
__pct_c tensors that were created with 16×32×16 shape, creating a mismatch when
actual dimensions differ. Fix this by: (1) generating unique symbol names per
buffer allocation using a Var-keyed prefix instead of hardcoded __pct names in
all allocation sites (lines 692-709, 1326-1329, 1361-1365, 1472-1481), and (2)
in the direct GEMM path, validate that the descriptor dimensions match the
persistent tensor shapes (16, 32, 16); if dimensions don't match, skip the
persistent tensor optimization and use non-elided storage instead.

Comment thread src/metal/op/copy.cc
Comment on lines +123 to +127
constexpr int kTileSize = 16;
constexpr int kTileElems = kTileSize * kTileSize;
TVM_FFI_ICHECK(total_elements % kTileElems == 0)
<< "cooperative_tensor buffer size must be multiple of " << kTileElems
<< ", got " << total_elements;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check cooperative tensor buffer allocation sizes

# Search for cooperative tensor allocations
rg -nP --type=py 'alloc.*scope="metal\.cooperative_tensor"' -A 2 -B 2

# Search for cooperative tensor tile size constants
rg -n 'kTileM.*=.*16|kTileN.*=.*32' src/metal/op/

Repository: tile-ai/tilelang

Length of output: 1623


🏁 Script executed:

# Read fill.cc lines 30-45 to see the full divisibility requirement
sed -n '30,45p' src/metal/op/fill.cc

# Read gemm.cc lines 30-40 to check documentation
sed -n '30,40p' src/metal/op/gemm.cc

# Read copy.cc around the flagged lines for full context
sed -n '115,135p' src/metal/op/copy.cc

Repository: tile-ai/tilelang

Length of output: 1856


Align tile size divisibility checks across copy and fill operations.

The divisibility check in copy.cc requires kTileElems = 256 (16×16), but fill.cc (lines 36-41) requires divisibility by 512 (16×32). A buffer with 256 elements would pass the copy check but fail the fill lowering. The cooperative tensor GEMM micro tile is 16×32 (documented in gemm.cc), so both copy and fill operations should align on that tile size.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/metal/op/copy.cc` around lines 123 - 127, The divisibility check in
copy.cc is inconsistent with fill.cc and the actual cooperative tensor GEMM
micro tile dimension of 16×32. Change the kTileSize constant and kTileElems
calculation in the copy operation to use the correct tile dimensions (16×32 =
512 elements instead of 16×16 = 256 elements) to match the tile size checks in
fill.cc and ensure buffers that pass the copy divisibility check will also pass
the fill lowering requirements.

assert "const device half* __src" in src_code
assert "[[simdgroup_index_in_threadgroup]]" in src_code
assert "__metal_get_thread_index_in_simdgroup" in src_code
assert "max_total_threads_per_threadgroup(128)" in src_code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the requested thread count instead of hard-coding 128.

assert_metal_gemm_v2_global_cooperative_tensor_codegen accepts threads, but Line 201 always expects max_total_threads_per_threadgroup(128), so non-default callers would fail for the wrong reason.

Proposed fix
-    assert "max_total_threads_per_threadgroup(128)" in src_code
+    assert f"max_total_threads_per_threadgroup({threads})" in src_code
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert "max_total_threads_per_threadgroup(128)" in src_code
assert f"max_total_threads_per_threadgroup({threads})" in src_code
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/metal/test_metal_gemm_v2_linux.py` at line 201, The assertion
in assert_metal_gemm_v2_global_cooperative_tensor_codegen currently hard-codes
the value 128 in the assertion check for max_total_threads_per_threadgroup, but
the function accepts a threads parameter that may differ from the default.
Replace the hard-coded 128 value with the threads parameter so that the
assertion correctly validates the requested thread count instead of always
expecting 128, allowing non-default callers to pass the assertion correctly.

Comment on lines +205 to 276
@staticmethod
def _get_padded_stride(buffer):
continuous = int(buffer.shape[-1])
element_bits = int(tvm.DataType(buffer.dtype).bits)
padded = continuous
if (element_bits * continuous) % 256 == 0:
padded += 128 // element_bits
return padded

def lower(
self,
layout_map: dict,
target: Target,
thread_bounds: Range,
thread_var: tir.Var,
mbar_phase_expr: tir.PrimExpr | None = None,
):
thread_nums = thread_bounds.extent
_, m_warp, n_warp = self._make_mps_emitter(target, int(thread_nums))
warp_row_tiles = int(self.M // m_warp)
warp_col_tiles = int(self.N // n_warp)

from tilelang.metal.intrinsics.metal_macro_generator import MPSIntrinEmitter

@T.prim_func
def _gemm_ss_shared() -> None:
A_local = T.alloc_local((warp_rows * 64), a_dtype, scope="metal.simdgroup")
B_local = T.alloc_local((warp_cols * 64), b_dtype, scope="metal.simdgroup")
C_simd = T.alloc_local((num_simd_c * 64), accum_dtype, scope="metal.simdgroup")
if clear_accum:
for _i in T.serial(num_simd_c):
T.make_filled_simdgroup_matrix(C_simd.data, _i, T.cast(0, accum_dtype))
else:
mps_emitter.simd_load(C_simd, C_buf)
for ki in T.serial(0, (block_K // micro_size_k)):
mps_emitter.ldmatrix_a(A_local, A_region, ki)
mps_emitter.ldmatrix_b(B_local, B_region, ki)
mps_emitter.mma(A_local, B_local, C_simd)

mps_emitter.simd_store(C_simd, C_buf)

return _Simplify(_gemm_ss_shared, inline_let=True)
else:
a_stride = self._get_padded_stride(self.A) if self.is_gemm_ss() else None
b_stride = self._get_padded_stride(self.B) if self.is_gemm_ss() else None

c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64
inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1
output_dtype = self.accum_dtype
accum_dtype = T.float32 if self.is_gemm_gg() and str(output_dtype) in ("float16", "bfloat16") else output_dtype
mps_emitter = MPSIntrinEmitter(
a_dtype=self.a_dtype,
b_dtype=self.b_dtype,
accum_dtype=accum_dtype,
a_transposed=self.trans_A,
b_transposed=self.trans_B,
block_row_warps=m_warp,
block_col_warps=n_warp,
warp_row_tiles=warp_row_tiles,
warp_col_tiles=warp_col_tiles,
chunk=self.chunk,
thread_var=thread_var,
a_stride_override=a_stride,
b_stride_override=b_stride,
inner_k_steps=inner_k_steps,
)

a_dtype = self.a_dtype
b_dtype = self.b_dtype
warp_rows = mps_emitter.warp_rows
warp_cols = mps_emitter.warp_cols
num_simd_c = warp_rows * warp_cols
block_K = mps_emitter.chunk
micro_size_x = mps_emitter.micro_size_x
micro_size_y = mps_emitter.micro_size_y
micro_size_k = mps_emitter.micro_size_k
inner_k_steps = mps_emitter.inner_k_steps
a_tile_elems = micro_size_x * micro_size_k
b_tile_elems = micro_size_k * micro_size_y
c_tile_elems = micro_size_x * micro_size_y

A_region = self.ARegion
B_region = self.BRegion
C_region = self.CRegion
C_buf = C_region.buffer
clear_accum = self.clear_accum
c_in_cooperative_tensor = is_metal_cooperative_tensor(C_buf) or is_fragment(C_buf)
assert block_K >= micro_size_k, f"block_K ({block_K}) must be >= micro_size_k ({micro_size_k})"

if not (self.is_gemm_ss() or self.is_gemm_gg()):
raise ValueError(f"Unsupported gemm combination, A: {self.A.scope()}, B: {self.B.scope()}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

c_bytes_per_thread calculation uses simdgroup tile size (64) instead of cooperative tensor tile size.

At line 232, c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64 uses the fixed value 64, which corresponds to the simdgroup 8x8 micro-tile. However, GemmMetal uses cooperative tensor mode with 16x32=512 element micro-tiles.

This affects the inner_k_steps heuristic on line 233. If the intent is to measure register pressure per thread, the calculation should account for the actual tile size being used.

🔧 Suggested fix
-        c_bytes_per_thread = warp_row_tiles * warp_col_tiles * 64
+        # Cooperative tensor micro-tile is 16x32 = 512 elements
+        ct_micro_elems = 16 * 32
+        c_bytes_per_thread = warp_row_tiles * warp_col_tiles * ct_micro_elems
         inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1

Or alternatively, compute this after creating the emitter to use consistent values:

+        micro_size_x_ct = 16
+        micro_size_y_ct = 32
+        c_tile_elems_approx = micro_size_x_ct * micro_size_y_ct
+        c_bytes_per_thread = warp_row_tiles * warp_col_tiles * c_tile_elems_approx
         inner_k_steps = 2 if c_bytes_per_thread <= 128 else 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/metal/op/gemm/gemm_metal.py` around lines 205 - 276, The
c_bytes_per_thread calculation in the lower method uses a hardcoded tile size of
64 bytes, but this doesn't match the actual cooperative tensor micro-tile size
being used. Move the c_bytes_per_thread calculation to after the
MPSIntrinEmitter is created (after line 239 where mps_emitter is instantiated)
and replace the hardcoded 64 value with the actual micro-tile dimensions from
the emitter: use micro_size_x * micro_size_y (which are extracted from
mps_emitter on lines 249-250) multiplied by the appropriate element size in
bytes to calculate the correct bytes per thread, which will ensure the
inner_k_steps heuristic is based on the actual tile size being used.

Comment thread src/transform/storage_rewrite.cc Outdated
Comment thread tilelang/utils/language.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants