[Feature][Blackwell] Add SM120 T.float4_e2m1fn FP4 GEMM support. by TerminusAkivili · Pull Request #2171 · tile-ai/tilelang

TerminusAkivili · 2026-05-08T13:58:26Z

Summary

This PR adds SM120 fragment-MMA GEMM support for semantic T.float4_e2m1fn
operands.

TileLang programs continue to declare FP4 operands as T.float4_e2m1fn.
For the SM120 performance path, lowering maps those semantic operands onto the
hidden T.float4_e2m1_unpacked byte carrier internally. Users do not need to
write FP4 GEMM operands as uint8 tensors in TileLang programs.

Supported SM120 GEMM combinations:

A dtype	B dtype	Accumulator	SM120 path
`T.float4_e2m1fn`	`T.float4_e2m1fn`	`T.float32`	FP4 x FP4 fragment MMA
`T.float8_e4m3fn`	`T.float4_e2m1fn`	`T.float32`	A8W4 fragment MMA
`T.float4_e2m1fn`	`T.float8_e4m3fn`	`T.float32`	W4A8 fragment MMA

Design Goals

Preserve semantic TileLang signatures: FP4 tensors stay T.float4_e2m1fn at
the language level.
Use T.float4_e2m1_unpacked as the hidden physical carrier for the SM120 FP4
GEMM path.
Align the SM120 FP4 performance path with the byte-carrier / ordinary
ldmatrix model used by [Feature] Support Blackwell FP4(float4_e2m1fn) GEMM for SM100 & SM120 #2182.
Add SM120 FP4/A8W4/W4A8 support without changing existing int4/uint4
ldmatrix behavior.
Reject unsupported FP4 MMA K tile shapes early instead of generating kernels
that silently skip a K tail.

Design

The key separation is between the public dtype and the physical carrier:

TileLang program signatures use T.float4_e2m1fn.
SM120 GEMM lowering uses hidden custom[float4_e2m1_unpacked]8 shared/local
carrier buffers where the hardware path needs byte slots.
Runtime tensors for this SM120 path may use byte-compatible storage with
logical (M, K) / (N, K) shapes.
The FFI binder accepts byte-compatible runtime tensors for semantic SM120 FP4
operands while preserving the public T.float4_e2m1fn handle dtype.
The SM120 MMA path shifts FP4 payload bits only for FP4 operands before
calling the CuTe SM120 F8/F6/F4 atom.

Main Changes

CUDA Templates

Add SM120 cute::SM120_16x8x32_TN dispatch for FP4xFP4, FP8xFP4, and
FP4xFP8 into FP32.
Apply the FP4 operand register shift only to operands that are actually FP4.
Bridge TileLang FP4 template types to CuTe FP4 types while keeping existing
packed helper types.

CUDA Lowering

Lower semantic SM120 FP4 GEMM operands through hidden
float4_e2m1_unpacked carrier buffers where required by the performance path.
Emit ordinary tl::ptx_ldmatrix_x* for the hidden unpacked-carrier path.
Keep packed FP4 fallback vector load/store safe for odd or not-proven-even
logical offsets by lowering those cases through per-lane nibble helpers.
Preserve shared-memory alias information for CuTeDSL codegen.
Preserve FP4 storage casts when layout lowering changes the physical carrier
dtype.

Python Lowering

Use T.float4_e2m1_unpacked as the hidden local/shared carrier for semantic
SM120 FP4 operands.
Build SM120 FP4 shared layouts through an unpacked-carrier view while keeping
public buffer dtypes semantic.
Validate mixed A/B dtypes explicitly for A8W4 and W4A8.
Reject T.gemm block K values that are not divisible by the selected
instruction K tile.

Examples

Add mainline-style SM120 examples for FP4xFP4 and A8W4:
- examples/gemm_fp4/example_gemm_fp4_sm120.py
- examples/gemm_fp4/example_gemm_a8w4_sm120.py
The examples keep semantic T.float4_e2m1fn kernel signatures and use
byte-compatible host tensors only as an interoperability detail.

Tests

Add focused SM120 lowering coverage in
testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py.
The tests check that semantic FP4 GEMM uses hidden unpacked shared/local
carriers, ordinary tl::ptx_ldmatrix_x*, and no tl::ptx_ldmatrix_b4x16 on
the main performance path.

Why The Review Fixes Matter

FP4 storage has two concerns that should not be conflated: the user-facing dtype
and the physical carrier used by the hardware path. The hidden
float4_e2m1_unpacked carrier lets SM120 GEMM keep the semantic
T.float4_e2m1fn API while matching the ordinary-ldmatrix byte-carrier
performance model used by #2182.

For packed fallback cases, FP4 byte storage is byte-addressed while logical FP4
elements are nibble-addressed. A vector reinterpret load/store is safe only when
the logical base offset is known to be even. If the offset is odd, or if codegen
cannot prove it is even, vectorized byte reinterpretation can read or write the
wrong nibble without producing a compilation error. This PR routes those cases
through per-lane nibble helpers.

SM120 FP4/A8W4/W4A8 MMA consumes K in fixed m16n8k32 instruction chunks.
Allowing a block_K such as 48 would execute only the representable K=32
portion and miss the K tail. This PR turns that silent numerical error into an
explicit unsupported-shape error.

Validation

Local SM120 validation used an RTX PRO 6000 / compute capability 12.0
environment.

Build and focused examples:

cmake --build build -j$(nproc)
PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python examples/gemm_fp4/example_gemm_fp4_sm120.py
PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python examples/gemm_fp4/example_gemm_a8w4_sm120.py

Generated CUDA and TIR were inspected for the expected SM120 FP4 markers:

public TIR handles: float4_e2m1fn
internal carrier: custom[float4_e2m1_unpacked]8
ordinary ldmatrix: tl::ptx_ldmatrix_x*
not present on the main performance path: tl::ptx_ldmatrix_b4x16
SM120 m16n8k32 FP4/A8W4/W4A8 dtype dispatch

Focused tests:

PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python -m pytest \
  testing/python/language/test_tilelang_language_float4_e2m1_unpacked_dtype.py \
  testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py -q

Observed focused-test results:

testing/python/language/test_tilelang_language_float4_e2m1_unpacked_dtype.py: 5 passed
testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py: 2 passed

Performance comparison against #2182 used the same SM120 GEMM shapes for
FP4xFP4, A8W4, and W4A8, with 6 shapes, 2 block_K settings, and 9 repeats per
point. No block-scale cases were included.

Latency delta is this PR divided by #2182 minus 1:

mean geomean:
FP4xFP4  +0.41%
A8W4     +0.55%
W4A8     +0.75%
all      +0.57%

median geomean:
FP4xFP4  +1.37%
A8W4     -1.25%
W4A8     +0.23%
all      +0.11%

Equivalent TOPS delta is #2182 latency divided by this PR latency minus 1:

mean geomean:
FP4xFP4  -0.41%
A8W4     -0.55%
W4A8     -0.74%
all      -0.57%

median geomean:
FP4xFP4  -1.35%
A8W4     +1.27%
W4A8     -0.23%
all      -0.11%

The overall result is effectively performance-neutral versus #2182: mean TOPS
is about 0.57% lower, while median TOPS is about 0.11% lower.

Notes And Non-Goals

Mixed A8W4/W4A8 dispatch is selected from explicit FP8/FP4 dtype pairs.
float4_e2m1_unpacked is a hidden physical carrier for the SM120 path, not
the public GEMM dtype users are expected to write.
uint8 remains a runtime storage/interoperability detail for byte-compatible
host tensors, not the public TileLang FP4 GEMM dtype.
Existing int4/uint4 ldmatrix offset behavior stays on the existing path.

coderabbitai · 2026-05-08T13:58:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements SM120 (CUDA 12.0+) FP4 (float4_e2m1fn) GEMM support across TileLang: examples and host unpacking, CUDA/TI codegen FP4 storage/indexing/vector/scalar handling, FP4-aware cp.async injection, b4x16 ldmatrix helpers, CuTe SM120 MMA dispatch for FP4/mixed operands, layout/macro generation changes, and GemmMMA integration.

Changes

SM120 FP4 GEMM Support

Layer / File(s)	Summary
Examples & Host Helpers `examples/gemm_fp4/...`	Adds FP4 LUT constant, `unpack_fp4_storage_to_float`, `require_sm120()`, TileLang kernel generators (`matmul_a8w4` / `matmul_fp4`), `main()` harnesses, deterministic/random inputs, zero-input checks, float32 reference comparisons, and error assertions.
TL Templates: LDSM & CUDA FP4 Types `src/tl_templates/cuda/ldsm.h`, `src/tl_templates/cuda/cuda_fp4.h`	Adds SM120-only `ptx_ldmatrix_b4x16_x1/x2/x4` helpers and expands cuda_fp4 compile-time guards and `make_fp4_e2_64_t`.
MMA Dispatch & Instruction Support `src/tl_templates/cuda/gemm_mma.h`, `src/tl_templates/cuda/instruction/mma.h`	Maps `fp4_e2_t` to CuTe `float_e2m1_t`, registers SM120 16x8x32 TN dispatchers for FP4×FP4 and mixed FP8/FP4, and updates `tl::mma_sync` to left-shift FP4 operands before dispatcher invocation.
CUDA Codegen: Buffer / Vector / Scalar Access `src/backend/cuda/codegen/codegen_cuda.cc`, `src/backend/cuda/codegen/codegen_cuda.h`	Centralizes FP4 storage classification, adds `IsFp4*` helpers and `GetFp4PaddedSharedIndex`, applies padded-shared index remapping and packed-byte divisor logic, and implements FP4-aware scalar/vector load-store codegen and cp.async/ldmatrix emission paths.
PTX Async Injector & FP4-padded cp.async `src/transform/lower_ptx_async_copy.cc`, `src/transform/ptx_async_copy_injector.h`	Introduces `fp4_padded_shared_copy` flag and FP4-padded cp.async specialization that splits transfers into 16-FP4-element segments with padded index remapping; forwards flag through InjectPTXAsyncCopy/PTXAsyncCopyInjector.
Copy Lowering & LDSM Geometry `src/backend/cuda/op/copy.cc`	Threads FP4 padded mode into Copy lowering, gates FP4 ldmatrix lowering to SM120 and non-transposed paths, computes `elems_per_reg`/`elems_per_inst` for 4-bit types, and updates vectorization, access_ptr extents, local loads, and loop unroll trip counts.
Copy Eligibility Analysis `src/backend/cuda/op/copy_analysis.cc`	Adds FP4-specific gating: `CheckLDSMCopy` requires SM120 and exact src/dst dtype match for FP4; `CheckSTSMCopy` rejects STSM copies when either side is FP4.
Macro Generation & Layout Utilities `tilelang/cuda/intrinsics/macro/mma_macro_generator.py`, `tilelang/cuda/intrinsics/layout/*.py`, `tilelang/cuda/intrinsics/layout/utils.py`	Special-cases `float4_e2m1fn` to `k_dim=32`, routes 4-/8-bit types through shared_16x32→mma_32x16 transforms, computes FP4-dependent access extents (`4*num`), and adds FP4-specific layout mapping helpers plus `get_ldmatrix_offset` support.
GemmMMA Integration `tilelang/cuda/op/gemm/gemm_mma.py`	Adds FP8/FP4 dtype predicates, `_validate_mma_dtypes()` to enforce allowed mixed operand pairs (FP8+FP4 or identical), and allocates local fragments per operand dtype during lowering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

tile-ai/tilelang#2138: Both PRs modify CUDA copy lowering (src/backend/cuda/op/copy.cc) and the cp.async/ptx async copy lowering path.
tile-ai/tilelang#2126: Related FP4 datatype and MMA/codegen pipeline additions.
tile-ai/tilelang#1524: Similar GetBufferRef / FP4 pointer/index arithmetic fixes.

Suggested reviewers

LJC00118
LeiWang1999
SiriusNEO

Poem

🐰 I nibble nibbles, pack them small,
SM120 wakes — kernels call.
Padded rows and cp.async chime,
TileLang hops through tiled time.
Rabbits cheer: GEMM runs fine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.10% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and specifically summarizes the main feature: adding SM120 support for FP4 (T.float4_e2m1fn) GEMM operations, which aligns with the substantial changes across CUDA templates, lowering, Python bindings, and examples.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-08T13:58:44Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/tl_templates/cuda/cuda_fp4.h (1)
166-187: ⚡ Quick win

Verify register allocation for fp4_e2_t values[64] in device code.

The 64-element local array is constant-indexed throughout (values[0]–values[63]), so nvcc at -O2+ should scalar-replace it into registers. However, unlike the explicitly-parameterized make_fp4_e2_32_t which guarantees register-only arguments, register spilling to local memory is possible at lower optimisation levels or with larger surrounding register pressure. Consider adding a __forceinline__ annotation to maximise inlining and scalar replacement at call sites.
Proposed annotation
-template <typename... Args>
-TL_DEVICE fp4_e2_64_t make_fp4_e2_64_t(Args... args) {
+template <typename... Args>
+TL_DEVICE __forceinline__ fp4_e2_64_t make_fp4_e2_64_t(Args... args) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tl_templates/cuda/cuda_fp4.h` around lines 166 - 187, The local array
fp4_e2_t values[64] in make_fp4_e2_64_t may be spilled under some compile
conditions; annotate the function to force inlining (e.g., add a
__forceinline__/always-inline device inline attribute to make_fp4_e2_64_t) so
nvcc can scalar-replace values[0]..values[63] into registers and inline the
make_fp4_e2_32_t calls; update the function declaration for make_fp4_e2_64_t
accordingly (keeping fp4_e2_t values[64] and the existing make_fp4_e2_32_t
usages unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 1973-2003: The FP4 padded shared-memory vector path
(IsFp4PaddedSharedStorage + code using GetFp4PaddedSharedIndex and the
byte_offset lambda when constructing the reinterpret cast for t.lanes()) can
incorrectly span the padded 16-element row boundary; add a guard or split logic:
either assert the logical base alignment (e.g., Ensure base % 16 == 0 for the
requested load/store) or detect when the access crosses a 16-element row by
computing the start and end logical indices (base + offset and base + offset +
t.lanes()-1) and comparing their 16-element row indices (truncdiv(..., 16)); if
it crosses, split the operation into two row-aligned fragments (like the
existing t.lanes()==32 two-fragment approach) and merge them, otherwise keep the
current single contiguous byte reinterpretation; apply the same fix to the other
similar blocks identified (around the other ranges mentioned).
- Around line 4428-4444: The allocator treats only scope == "local" as the path
that emits local backing arrays but FP4 fragments use the semantic storage name
"local.fragment", so allocations for these still hit the unsupported-scope
branch; update the scope checks used around is_int4_scalar_local, the FP4
alignas(16) branch, and the place that prints/omits the storage scope to treat
"local.fragment" as equivalent to "local" (either normalize scope to "local"
earlier or change conditions from scope == "local" to (scope == "local" || scope
== "local.fragment")), ensuring PrintStorageScope/PrintType and the
backing-array emission path handle FP4 fragments the same as regular local
allocations (references: is_int4_scalar_local, op->dtype.is_float4_e2m1fn(),
PrintStorageScope, PrintType, and the "local.fragment" semantic storage).

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 121-124: The FP4 fast-path in mma_macro_generator.py sets
self.k_dim = 32 without respecting self.chunk, causing micro_size_k to exceed
chunk when chunk < 32; update the FP4 branch in the initializer (the block
setting self.k_dim) to clamp k_dim by self.chunk (e.g., self.k_dim = min(32,
self.chunk)) and add the same clamp/guard in the subclass override (the code
around lines 873–877) so both places respect chunk; optionally emit a clear
ValueError or assertion if chunk < required minimum to fail early with a helpful
message referencing the dtype and chunk size.

---

Nitpick comments:
In `@src/tl_templates/cuda/cuda_fp4.h`:
- Around line 166-187: The local array fp4_e2_t values[64] in make_fp4_e2_64_t
may be spilled under some compile conditions; annotate the function to force
inlining (e.g., add a __forceinline__/always-inline device inline attribute to
make_fp4_e2_64_t) so nvcc can scalar-replace values[0]..values[63] into
registers and inline the make_fp4_e2_32_t calls; update the function declaration
for make_fp4_e2_64_t accordingly (keeping fp4_e2_t values[64] and the existing
make_fp4_e2_32_t usages unchanged).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a09f3145-ce2d-4b0d-bb75-d916a099b2be

📥 Commits

Reviewing files that changed from the base of the PR and between a797e51 and 140f774.

📒 Files selected for processing (16)

examples/gemm_fp4/example_gemm_a8w4_sm120.py
examples/gemm_fp4/example_gemm_fp4_sm120.py
src/backend/cuda/codegen/codegen_cuda.cc
src/backend/cuda/codegen/codegen_cuda.h
src/backend/cuda/op/copy.cc
src/backend/cuda/op/copy_analysis.cc
src/tl_templates/cuda/cuda_fp4.h
src/tl_templates/cuda/gemm_mma.h
src/tl_templates/cuda/instruction/mma.h
src/tl_templates/cuda/ldsm.h
src/transform/lower_ptx_async_copy.cc
src/transform/ptx_async_copy_injector.h
tilelang/cuda/intrinsics/layout/mma_layout.py
tilelang/cuda/intrinsics/layout/utils.py
tilelang/cuda/intrinsics/macro/mma_macro_generator.py
tilelang/cuda/op/gemm/gemm_mma.py

TerminusAkivili · 2026-05-11T16:40:04Z

Hi @LeiWang1999, no rush at all. Feel free to check it whenever it's convenient for you. I'd love your feedback. Thank you!

Hale423 · 2026-05-15T10:34:54Z

Oh, seems like this PR overlaps with part of my PR #2182. Just wanted to clarify that there was no intent to duplicate the work as it is carried over from the earlier FP4 branch (didn't notice this pr the moment I create my new one). I'm totally okey to coordinate scope if any feedback received from maintainers, thanks for your work.

TerminusAkivili · 2026-05-15T20:07:08Z

Thanks @Hale423 for the clarification! I took a closer look at the SM120 overlap, and I think the two PRs are taking slightly different directions.
My understanding is that #2182 uses a uint8 carrier for the SM120 FP4 path and maps that into FP4/F8F6F4 MMA lowering, which gives it a more byte-aligned fast path. #2171 keeps the TileLang-facing API semantic: kernels use T.float4_e2m1fn / T.float8_e4m3fn, while packed storage, shared layout, ldmatrix, and fragment handling stay inside lowering/codegen.
In local default-shape testing, #2182 is faster for FP4xFP4 by around 8-9% on average. That seems expected because both A and B pay the FP4 shared-layout/copy overhead in #2171’s semantic b4x16_p64 path, while #2182’s byte-carrier path is closer to the packed fast path. For A8W4 the gap is much smaller, close to parity on larger shapes, since only one operand is FP4.
Happy to coordinate scope based on maintainer feedback. One possible follow-up would be to keep #2171’s semantic public API while adding an internal byte-carrier-style fast path for FP4xFP4.

TerminusAkivili · 2026-05-15T22:19:45Z

The downside is that it introduces FP4xFP4-specific logic across layout, copy analysis, pipeline planning, codegen, and CUDA helpers, which increases maintenance cost and expands the scope of this PR.
A cleaner long-term direction may be a more general packed low-bit dtype storage lowering mechanism, where semantic dtype, packed global storage, shared-memory carrier/layout, and MMA load path are described in a unified way. But that would touch a much larger surface area.
So I think which direction to take, and how to structure it, should depend on the maintainers’ feedback and guidance.

Hale423 · 2026-05-16T01:02:36Z

Confirmed, feel free to implement any idea on SM120, I'm willing to coordinate scope, thanks for sharing!

TerminusAkivili · 2026-05-16T16:50:02Z

@Hale423 I also tried a separate RFC version that keeps the public API as T.float4_e2m1fn, but internally uses a byte-carrier lowering path similar to #2182. Its benchmark results are roughly on par with #2182.
The branch/commit is here for reference:
TerminusAkivili@b8e2818
Since that version touches a larger surface area, I’m not planning to include it in #2171 before maintainer feedback. If we also count that direction, I think the core scopes are no longer conflicting.

Hale423 · 2026-05-18T07:46:00Z

@Hale423 I also tried a separate RFC version that keeps the public API as T.float4_e2m1fn, but internally uses a byte-carrier lowering path similar to #2182. Its benchmark results are roughly on par with #2182. The branch/commit is here for reference: TerminusAkivili@6c118bc Since that version touches a larger surface area, I’m not planning to include it in #2171 before maintainer feedback. If we also count that direction, I think the core scopes are no longer conflicting.

Got it, thanks for your clarification

LeiWang1999 · 2026-05-27T07:14:56Z

Looks interesting, but for f8f6f4, I think we need to introduce a hidden type, float4_e2m1_unpacked – that's exactly what we're working on. Thanks.

TerminusAkivili · 2026-05-27T15:51:47Z

@LeiWang1999 Thanks for the update and for working on this. I’ll keep an eye on the progress.

TerminusAkivili · 2026-05-28T15:02:01Z

I’d appreciate it if you could review this PR when you have time. Thanks! @LeiWang1999

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/backend/cuda/codegen/codegen_cuda.cc Outdated

Comment thread src/cuda/codegen/codegen_cuda.cc Outdated

Comment thread tilelang/cuda/intrinsics/macro/mma_macro_generator.py Outdated

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 3 times, most recently from 3e5823d to 7f254a9 Compare May 8, 2026 16:39

TerminusAkivili changed the title ~~[feature][Blackwell] Add SM120 FP4 and A8W4 GEMM support~~ [feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support. May 8, 2026

TerminusAkivili changed the title ~~[feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support.~~ [Feature][Blackwell] Add SM120 T.float4_e2m1fn FP4 GEMM support. May 11, 2026

TerminusAkivili marked this pull request as draft May 11, 2026 15:26

TerminusAkivili marked this pull request as ready for review May 11, 2026 16:06

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from cb5bf3d to 795cb39 Compare May 12, 2026 07:46

TerminusAkivili closed this May 16, 2026

TerminusAkivili deleted the sm120-fp4-a8w4-clean-pr branch May 16, 2026 16:09

TerminusAkivili restored the sm120-fp4-a8w4-clean-pr branch May 16, 2026 16:43

TerminusAkivili reopened this May 16, 2026

LeiWang1999 self-requested a review May 19, 2026 09:24

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 3 times, most recently from aa1a1c5 to be0a064 Compare May 28, 2026 10:12

TerminusAkivili marked this pull request as draft May 28, 2026 12:37

TerminusAkivili marked this pull request as ready for review May 28, 2026 13:21

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from 3ca55aa to 7fa0f92 Compare May 28, 2026 14:22

TerminusAkivili marked this pull request as draft May 28, 2026 18:43

TerminusAkivili marked this pull request as ready for review May 28, 2026 23:41

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 4 times, most recently from 586a8e3 to 5d60b15 Compare May 30, 2026 16:56

TerminusAkivili marked this pull request as draft May 30, 2026 17:54

TerminusAkivili marked this pull request as ready for review May 30, 2026 18:33

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from 5d60b15 to 5e4ed5d Compare May 30, 2026 18:39

Add SM120 FP4 and A8W4 GEMM support

241139a

TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from f0be868 to 241139a Compare June 17, 2026 10:17

Conversation

TerminusAkivili commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Goals

Design

Main Changes

CUDA Templates

CUDA Lowering

Python Lowering

Examples

Tests

Why The Review Fixes Matter

Validation

Notes And Non-Goals

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TerminusAkivili commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hale423 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TerminusAkivili commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TerminusAkivili commented May 15, 2026

Uh oh!

Hale423 commented May 16, 2026

Uh oh!

TerminusAkivili commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hale423 commented May 18, 2026

Uh oh!

LeiWang1999 commented May 27, 2026

Uh oh!

TerminusAkivili commented May 27, 2026

Uh oh!

TerminusAkivili commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TerminusAkivili commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

TerminusAkivili commented May 11, 2026 •

edited

Loading

Hale423 commented May 15, 2026 •

edited

Loading

TerminusAkivili commented May 15, 2026 •

edited

Loading

TerminusAkivili commented May 16, 2026 •

edited

Loading