Skip to content

[Feature][Blackwell] Add SM120 T.float4_e2m1fn FP4 GEMM support.#2171

Open
TerminusAkivili wants to merge 1 commit into
tile-ai:mainfrom
TerminusAkivili:sm120-fp4-a8w4-clean-pr
Open

[Feature][Blackwell] Add SM120 T.float4_e2m1fn FP4 GEMM support.#2171
TerminusAkivili wants to merge 1 commit into
tile-ai:mainfrom
TerminusAkivili:sm120-fp4-a8w4-clean-pr

Conversation

@TerminusAkivili

@TerminusAkivili TerminusAkivili commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds SM120 fragment-MMA GEMM support for semantic T.float4_e2m1fn
operands.

TileLang programs continue to declare FP4 operands as T.float4_e2m1fn.
For the SM120 performance path, lowering maps those semantic operands onto the
hidden T.float4_e2m1_unpacked byte carrier internally. Users do not need to
write FP4 GEMM operands as uint8 tensors in TileLang programs.

Supported SM120 GEMM combinations:

A dtype B dtype Accumulator SM120 path
T.float4_e2m1fn T.float4_e2m1fn T.float32 FP4 x FP4 fragment MMA
T.float8_e4m3fn T.float4_e2m1fn T.float32 A8W4 fragment MMA
T.float4_e2m1fn T.float8_e4m3fn T.float32 W4A8 fragment MMA

Design Goals

  • Preserve semantic TileLang signatures: FP4 tensors stay T.float4_e2m1fn at
    the language level.
  • Use T.float4_e2m1_unpacked as the hidden physical carrier for the SM120 FP4
    GEMM path.
  • Align the SM120 FP4 performance path with the byte-carrier / ordinary
    ldmatrix model used by [Feature] Support Blackwell FP4(float4_e2m1fn) GEMM for SM100 & SM120 #2182.
  • Add SM120 FP4/A8W4/W4A8 support without changing existing int4/uint4
    ldmatrix behavior.
  • Reject unsupported FP4 MMA K tile shapes early instead of generating kernels
    that silently skip a K tail.

Design

The key separation is between the public dtype and the physical carrier:

  • TileLang program signatures use T.float4_e2m1fn.
  • SM120 GEMM lowering uses hidden custom[float4_e2m1_unpacked]8 shared/local
    carrier buffers where the hardware path needs byte slots.
  • Runtime tensors for this SM120 path may use byte-compatible storage with
    logical (M, K) / (N, K) shapes.
  • The FFI binder accepts byte-compatible runtime tensors for semantic SM120 FP4
    operands while preserving the public T.float4_e2m1fn handle dtype.
  • The SM120 MMA path shifts FP4 payload bits only for FP4 operands before
    calling the CuTe SM120 F8/F6/F4 atom.

Main Changes

CUDA Templates

  • Add SM120 cute::SM120_16x8x32_TN dispatch for FP4xFP4, FP8xFP4, and
    FP4xFP8 into FP32.
  • Apply the FP4 operand register shift only to operands that are actually FP4.
  • Bridge TileLang FP4 template types to CuTe FP4 types while keeping existing
    packed helper types.

CUDA Lowering

  • Lower semantic SM120 FP4 GEMM operands through hidden
    float4_e2m1_unpacked carrier buffers where required by the performance path.
  • Emit ordinary tl::ptx_ldmatrix_x* for the hidden unpacked-carrier path.
  • Keep packed FP4 fallback vector load/store safe for odd or not-proven-even
    logical offsets by lowering those cases through per-lane nibble helpers.
  • Preserve shared-memory alias information for CuTeDSL codegen.
  • Preserve FP4 storage casts when layout lowering changes the physical carrier
    dtype.

Python Lowering

  • Use T.float4_e2m1_unpacked as the hidden local/shared carrier for semantic
    SM120 FP4 operands.
  • Build SM120 FP4 shared layouts through an unpacked-carrier view while keeping
    public buffer dtypes semantic.
  • Validate mixed A/B dtypes explicitly for A8W4 and W4A8.
  • Reject T.gemm block K values that are not divisible by the selected
    instruction K tile.

Examples

  • Add mainline-style SM120 examples for FP4xFP4 and A8W4:
    • examples/gemm_fp4/example_gemm_fp4_sm120.py
    • examples/gemm_fp4/example_gemm_a8w4_sm120.py
  • The examples keep semantic T.float4_e2m1fn kernel signatures and use
    byte-compatible host tensors only as an interoperability detail.

Tests

  • Add focused SM120 lowering coverage in
    testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py.
  • The tests check that semantic FP4 GEMM uses hidden unpacked shared/local
    carriers, ordinary tl::ptx_ldmatrix_x*, and no tl::ptx_ldmatrix_b4x16 on
    the main performance path.

Why The Review Fixes Matter

FP4 storage has two concerns that should not be conflated: the user-facing dtype
and the physical carrier used by the hardware path. The hidden
float4_e2m1_unpacked carrier lets SM120 GEMM keep the semantic
T.float4_e2m1fn API while matching the ordinary-ldmatrix byte-carrier
performance model used by #2182.

For packed fallback cases, FP4 byte storage is byte-addressed while logical FP4
elements are nibble-addressed. A vector reinterpret load/store is safe only when
the logical base offset is known to be even. If the offset is odd, or if codegen
cannot prove it is even, vectorized byte reinterpretation can read or write the
wrong nibble without producing a compilation error. This PR routes those cases
through per-lane nibble helpers.

SM120 FP4/A8W4/W4A8 MMA consumes K in fixed m16n8k32 instruction chunks.
Allowing a block_K such as 48 would execute only the representable K=32
portion and miss the K tail. This PR turns that silent numerical error into an
explicit unsupported-shape error.

Validation

Local SM120 validation used an RTX PRO 6000 / compute capability 12.0
environment.

Build and focused examples:

cmake --build build -j$(nproc)
PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python examples/gemm_fp4/example_gemm_fp4_sm120.py
PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python examples/gemm_fp4/example_gemm_a8w4_sm120.py

Generated CUDA and TIR were inspected for the expected SM120 FP4 markers:

public TIR handles: float4_e2m1fn
internal carrier: custom[float4_e2m1_unpacked]8
ordinary ldmatrix: tl::ptx_ldmatrix_x*
not present on the main performance path: tl::ptx_ldmatrix_b4x16
SM120 m16n8k32 FP4/A8W4/W4A8 dtype dispatch

Focused tests:

PYTHONPATH=$PWD${PYTHONPATH:+:$PYTHONPATH} python -m pytest \
  testing/python/language/test_tilelang_language_float4_e2m1_unpacked_dtype.py \
  testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py -q

Observed focused-test results:

testing/python/language/test_tilelang_language_float4_e2m1_unpacked_dtype.py: 5 passed
testing/python/language/test_tilelang_language_float4_e2m1_unpacked_gemm.py: 2 passed

Performance comparison against #2182 used the same SM120 GEMM shapes for
FP4xFP4, A8W4, and W4A8, with 6 shapes, 2 block_K settings, and 9 repeats per
point. No block-scale cases were included.

Latency delta is this PR divided by #2182 minus 1:

mean geomean:
FP4xFP4  +0.41%
A8W4     +0.55%
W4A8     +0.75%
all      +0.57%

median geomean:
FP4xFP4  +1.37%
A8W4     -1.25%
W4A8     +0.23%
all      +0.11%

Equivalent TOPS delta is #2182 latency divided by this PR latency minus 1:

mean geomean:
FP4xFP4  -0.41%
A8W4     -0.55%
W4A8     -0.74%
all      -0.57%

median geomean:
FP4xFP4  -1.35%
A8W4     +1.27%
W4A8     -0.23%
all      -0.11%

The overall result is effectively performance-neutral versus #2182: mean TOPS
is about 0.57% lower, while median TOPS is about 0.11% lower.

Notes And Non-Goals

  • Mixed A8W4/W4A8 dispatch is selected from explicit FP8/FP4 dtype pairs.
  • float4_e2m1_unpacked is a hidden physical carrier for the SM120 path, not
    the public GEMM dtype users are expected to write.
  • uint8 remains a runtime storage/interoperability detail for byte-compatible
    host tensors, not the public TileLang FP4 GEMM dtype.
  • Existing int4/uint4 ldmatrix offset behavior stays on the existing path.

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements SM120 (CUDA 12.0+) FP4 (float4_e2m1fn) GEMM support across TileLang: examples and host unpacking, CUDA/TI codegen FP4 storage/indexing/vector/scalar handling, FP4-aware cp.async injection, b4x16 ldmatrix helpers, CuTe SM120 MMA dispatch for FP4/mixed operands, layout/macro generation changes, and GemmMMA integration.

Changes

SM120 FP4 GEMM Support

Layer / File(s) Summary
Examples & Host Helpers
examples/gemm_fp4/...
Adds FP4 LUT constant, unpack_fp4_storage_to_float, require_sm120(), TileLang kernel generators (matmul_a8w4 / matmul_fp4), main() harnesses, deterministic/random inputs, zero-input checks, float32 reference comparisons, and error assertions.
TL Templates: LDSM & CUDA FP4 Types
src/tl_templates/cuda/ldsm.h, src/tl_templates/cuda/cuda_fp4.h
Adds SM120-only ptx_ldmatrix_b4x16_x1/x2/x4 helpers and expands cuda_fp4 compile-time guards and make_fp4_e2_64_t.
MMA Dispatch & Instruction Support
src/tl_templates/cuda/gemm_mma.h, src/tl_templates/cuda/instruction/mma.h
Maps fp4_e2_t to CuTe float_e2m1_t, registers SM120 16x8x32 TN dispatchers for FP4×FP4 and mixed FP8/FP4, and updates tl::mma_sync to left-shift FP4 operands before dispatcher invocation.
CUDA Codegen: Buffer / Vector / Scalar Access
src/backend/cuda/codegen/codegen_cuda.cc, src/backend/cuda/codegen/codegen_cuda.h
Centralizes FP4 storage classification, adds IsFp4* helpers and GetFp4PaddedSharedIndex, applies padded-shared index remapping and packed-byte divisor logic, and implements FP4-aware scalar/vector load-store codegen and cp.async/ldmatrix emission paths.
PTX Async Injector & FP4-padded cp.async
src/transform/lower_ptx_async_copy.cc, src/transform/ptx_async_copy_injector.h
Introduces fp4_padded_shared_copy flag and FP4-padded cp.async specialization that splits transfers into 16-FP4-element segments with padded index remapping; forwards flag through InjectPTXAsyncCopy/PTXAsyncCopyInjector.
Copy Lowering & LDSM Geometry
src/backend/cuda/op/copy.cc
Threads FP4 padded mode into Copy lowering, gates FP4 ldmatrix lowering to SM120 and non-transposed paths, computes elems_per_reg/elems_per_inst for 4-bit types, and updates vectorization, access_ptr extents, local loads, and loop unroll trip counts.
Copy Eligibility Analysis
src/backend/cuda/op/copy_analysis.cc
Adds FP4-specific gating: CheckLDSMCopy requires SM120 and exact src/dst dtype match for FP4; CheckSTSMCopy rejects STSM copies when either side is FP4.
Macro Generation & Layout Utilities
tilelang/cuda/intrinsics/macro/mma_macro_generator.py, tilelang/cuda/intrinsics/layout/*.py, tilelang/cuda/intrinsics/layout/utils.py
Special-cases float4_e2m1fn to k_dim=32, routes 4-/8-bit types through shared_16x32→mma_32x16 transforms, computes FP4-dependent access extents (4*num), and adds FP4-specific layout mapping helpers plus get_ldmatrix_offset support.
GemmMMA Integration
tilelang/cuda/op/gemm/gemm_mma.py
Adds FP8/FP4 dtype predicates, _validate_mma_dtypes() to enforce allowed mixed operand pairs (FP8+FP4 or identical), and allocates local fragments per operand dtype during lowering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • LJC00118
  • LeiWang1999
  • SiriusNEO

Poem

🐰 I nibble nibbles, pack them small,
SM120 wakes — kernels call.
Padded rows and cp.async chime,
TileLang hops through tiled time.
Rabbits cheer: GEMM runs fine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically summarizes the main feature: adding SM120 support for FP4 (T.float4_e2m1fn) GEMM operations, which aligns with the substantial changes across CUDA templates, lowering, Python bindings, and examples.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/tl_templates/cuda/cuda_fp4.h (1)

166-187: ⚡ Quick win

Verify register allocation for fp4_e2_t values[64] in device code.

The 64-element local array is constant-indexed throughout (values[0]values[63]), so nvcc at -O2+ should scalar-replace it into registers. However, unlike the explicitly-parameterized make_fp4_e2_32_t which guarantees register-only arguments, register spilling to local memory is possible at lower optimisation levels or with larger surrounding register pressure. Consider adding a __forceinline__ annotation to maximise inlining and scalar replacement at call sites.

Proposed annotation
-template <typename... Args>
-TL_DEVICE fp4_e2_64_t make_fp4_e2_64_t(Args... args) {
+template <typename... Args>
+TL_DEVICE __forceinline__ fp4_e2_64_t make_fp4_e2_64_t(Args... args) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tl_templates/cuda/cuda_fp4.h` around lines 166 - 187, The local array
fp4_e2_t values[64] in make_fp4_e2_64_t may be spilled under some compile
conditions; annotate the function to force inlining (e.g., add a
__forceinline__/always-inline device inline attribute to make_fp4_e2_64_t) so
nvcc can scalar-replace values[0]..values[63] into registers and inline the
make_fp4_e2_32_t calls; update the function declaration for make_fp4_e2_64_t
accordingly (keeping fp4_e2_t values[64] and the existing make_fp4_e2_32_t
usages unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 1973-2003: The FP4 padded shared-memory vector path
(IsFp4PaddedSharedStorage + code using GetFp4PaddedSharedIndex and the
byte_offset lambda when constructing the reinterpret cast for t.lanes()) can
incorrectly span the padded 16-element row boundary; add a guard or split logic:
either assert the logical base alignment (e.g., Ensure base % 16 == 0 for the
requested load/store) or detect when the access crosses a 16-element row by
computing the start and end logical indices (base + offset and base + offset +
t.lanes()-1) and comparing their 16-element row indices (truncdiv(..., 16)); if
it crosses, split the operation into two row-aligned fragments (like the
existing t.lanes()==32 two-fragment approach) and merge them, otherwise keep the
current single contiguous byte reinterpretation; apply the same fix to the other
similar blocks identified (around the other ranges mentioned).
- Around line 4428-4444: The allocator treats only scope == "local" as the path
that emits local backing arrays but FP4 fragments use the semantic storage name
"local.fragment", so allocations for these still hit the unsupported-scope
branch; update the scope checks used around is_int4_scalar_local, the FP4
alignas(16) branch, and the place that prints/omits the storage scope to treat
"local.fragment" as equivalent to "local" (either normalize scope to "local"
earlier or change conditions from scope == "local" to (scope == "local" || scope
== "local.fragment")), ensuring PrintStorageScope/PrintType and the
backing-array emission path handle FP4 fragments the same as regular local
allocations (references: is_int4_scalar_local, op->dtype.is_float4_e2m1fn(),
PrintStorageScope, PrintType, and the "local.fragment" semantic storage).

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 121-124: The FP4 fast-path in mma_macro_generator.py sets
self.k_dim = 32 without respecting self.chunk, causing micro_size_k to exceed
chunk when chunk < 32; update the FP4 branch in the initializer (the block
setting self.k_dim) to clamp k_dim by self.chunk (e.g., self.k_dim = min(32,
self.chunk)) and add the same clamp/guard in the subclass override (the code
around lines 873–877) so both places respect chunk; optionally emit a clear
ValueError or assertion if chunk < required minimum to fail early with a helpful
message referencing the dtype and chunk size.

---

Nitpick comments:
In `@src/tl_templates/cuda/cuda_fp4.h`:
- Around line 166-187: The local array fp4_e2_t values[64] in make_fp4_e2_64_t
may be spilled under some compile conditions; annotate the function to force
inlining (e.g., add a __forceinline__/always-inline device inline attribute to
make_fp4_e2_64_t) so nvcc can scalar-replace values[0]..values[63] into
registers and inline the make_fp4_e2_32_t calls; update the function declaration
for make_fp4_e2_64_t accordingly (keeping fp4_e2_t values[64] and the existing
make_fp4_e2_32_t usages unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a09f3145-ce2d-4b0d-bb75-d916a099b2be

📥 Commits

Reviewing files that changed from the base of the PR and between a797e51 and 140f774.

📒 Files selected for processing (16)
  • examples/gemm_fp4/example_gemm_a8w4_sm120.py
  • examples/gemm_fp4/example_gemm_fp4_sm120.py
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/backend/cuda/codegen/codegen_cuda.h
  • src/backend/cuda/op/copy.cc
  • src/backend/cuda/op/copy_analysis.cc
  • src/tl_templates/cuda/cuda_fp4.h
  • src/tl_templates/cuda/gemm_mma.h
  • src/tl_templates/cuda/instruction/mma.h
  • src/tl_templates/cuda/ldsm.h
  • src/transform/lower_ptx_async_copy.cc
  • src/transform/ptx_async_copy_injector.h
  • tilelang/cuda/intrinsics/layout/mma_layout.py
  • tilelang/cuda/intrinsics/layout/utils.py
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/cuda/op/gemm/gemm_mma.py

Comment thread src/backend/cuda/codegen/codegen_cuda.cc Outdated
Comment thread src/cuda/codegen/codegen_cuda.cc Outdated
Comment thread tilelang/cuda/intrinsics/macro/mma_macro_generator.py Outdated
@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 3 times, most recently from 3e5823d to 7f254a9 Compare May 8, 2026 16:39
@TerminusAkivili TerminusAkivili changed the title [feature][Blackwell] Add SM120 FP4 and A8W4 GEMM support [feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support. May 8, 2026
@TerminusAkivili TerminusAkivili changed the title [feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support. [Feature][Blackwell] Add SM120 T.float4_e2m1fn FP4 GEMM support. May 11, 2026
@TerminusAkivili TerminusAkivili marked this pull request as draft May 11, 2026 15:26
@TerminusAkivili TerminusAkivili marked this pull request as ready for review May 11, 2026 16:06
@TerminusAkivili

TerminusAkivili commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

Hi @LeiWang1999, no rush at all. Feel free to check it whenever it's convenient for you. I'd love your feedback. Thank you!

@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from cb5bf3d to 795cb39 Compare May 12, 2026 07:46
@Hale423

Hale423 commented May 15, 2026

Copy link
Copy Markdown
Contributor

Oh, seems like this PR overlaps with part of my PR #2182. Just wanted to clarify that there was no intent to duplicate the work as it is carried over from the earlier FP4 branch (didn't notice this pr the moment I create my new one). I'm totally okey to coordinate scope if any feedback received from maintainers, thanks for your work.

@TerminusAkivili

TerminusAkivili commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @Hale423 for the clarification! I took a closer look at the SM120 overlap, and I think the two PRs are taking slightly different directions.
My understanding is that #2182 uses a uint8 carrier for the SM120 FP4 path and maps that into FP4/F8F6F4 MMA lowering, which gives it a more byte-aligned fast path. #2171 keeps the TileLang-facing API semantic: kernels use T.float4_e2m1fn / T.float8_e4m3fn, while packed storage, shared layout, ldmatrix, and fragment handling stay inside lowering/codegen.
In local default-shape testing, #2182 is faster for FP4xFP4 by around 8-9% on average. That seems expected because both A and B pay the FP4 shared-layout/copy overhead in #2171’s semantic b4x16_p64 path, while #2182’s byte-carrier path is closer to the packed fast path. For A8W4 the gap is much smaller, close to parity on larger shapes, since only one operand is FP4.
Happy to coordinate scope based on maintainer feedback. One possible follow-up would be to keep #2171’s semantic public API while adding an internal byte-carrier-style fast path for FP4xFP4.

@TerminusAkivili

Copy link
Copy Markdown
Contributor Author

The downside is that it introduces FP4xFP4-specific logic across layout, copy analysis, pipeline planning, codegen, and CUDA helpers, which increases maintenance cost and expands the scope of this PR.
A cleaner long-term direction may be a more general packed low-bit dtype storage lowering mechanism, where semantic dtype, packed global storage, shared-memory carrier/layout, and MMA load path are described in a unified way. But that would touch a much larger surface area.
So I think which direction to take, and how to structure it, should depend on the maintainers’ feedback and guidance.

@Hale423

Hale423 commented May 16, 2026

Copy link
Copy Markdown
Contributor

Confirmed, feel free to implement any idea on SM120, I'm willing to coordinate scope, thanks for sharing!

@TerminusAkivili TerminusAkivili deleted the sm120-fp4-a8w4-clean-pr branch May 16, 2026 16:09
@TerminusAkivili TerminusAkivili restored the sm120-fp4-a8w4-clean-pr branch May 16, 2026 16:43
@TerminusAkivili

TerminusAkivili commented May 16, 2026

Copy link
Copy Markdown
Contributor Author

@Hale423 I also tried a separate RFC version that keeps the public API as T.float4_e2m1fn, but internally uses a byte-carrier lowering path similar to #2182. Its benchmark results are roughly on par with #2182.
The branch/commit is here for reference:
TerminusAkivili@b8e2818
Since that version touches a larger surface area, I’m not planning to include it in #2171 before maintainer feedback. If we also count that direction, I think the core scopes are no longer conflicting.

@Hale423

Hale423 commented May 18, 2026

Copy link
Copy Markdown
Contributor

@Hale423 I also tried a separate RFC version that keeps the public API as T.float4_e2m1fn, but internally uses a byte-carrier lowering path similar to #2182. Its benchmark results are roughly on par with #2182. The branch/commit is here for reference: TerminusAkivili@6c118bc Since that version touches a larger surface area, I’m not planning to include it in #2171 before maintainer feedback. If we also count that direction, I think the core scopes are no longer conflicting.

Got it, thanks for your clarification

@LeiWang1999 LeiWang1999 self-requested a review May 19, 2026 09:24
@LeiWang1999

Copy link
Copy Markdown
Member

Looks interesting, but for f8f6f4, I think we need to introduce a hidden type, float4_e2m1_unpacked – that's exactly what we're working on. Thanks.

@TerminusAkivili

Copy link
Copy Markdown
Contributor Author

@LeiWang1999 Thanks for the update and for working on this. I’ll keep an eye on the progress.

@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 3 times, most recently from aa1a1c5 to be0a064 Compare May 28, 2026 10:12
@TerminusAkivili TerminusAkivili marked this pull request as draft May 28, 2026 12:37
@TerminusAkivili TerminusAkivili marked this pull request as ready for review May 28, 2026 13:21
@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from 3ca55aa to 7fa0f92 Compare May 28, 2026 14:22
@TerminusAkivili

Copy link
Copy Markdown
Contributor Author

I’d appreciate it if you could review this PR when you have time. Thanks! @LeiWang1999

@TerminusAkivili TerminusAkivili marked this pull request as draft May 28, 2026 18:43
@TerminusAkivili TerminusAkivili marked this pull request as ready for review May 28, 2026 23:41
@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch 4 times, most recently from 586a8e3 to 5d60b15 Compare May 30, 2026 16:56
@TerminusAkivili TerminusAkivili marked this pull request as draft May 30, 2026 17:54
@TerminusAkivili TerminusAkivili marked this pull request as ready for review May 30, 2026 18:33
@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from 5d60b15 to 5e4ed5d Compare May 30, 2026 18:39
@TerminusAkivili TerminusAkivili force-pushed the sm120-fp4-a8w4-clean-pr branch from f0be868 to 241139a Compare June 17, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants