Description
TileLang JIT compilation hangs indefinitely (12+ hours) on NVIDIA RTX PRO 6000 Blackwell (sm_120) when running DeepSeek-V4-Flash official inference code. The process consumes 100% GPU and 99% CPU but produces no output, no cache files, and no error messages.
This is a different issue from previously reported sm_120 bugs (which showed clear error messages like "Unsupported target for gemm" or shared memory overflow). In our case, there is no error at all — the compiler simply appears to enter an infinite loop or deadlock during kernel code generation.
Environment
- GPU: 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB each)
- GPU Architecture: sm_120
- CUDA Version: 13.2 (driver 570.x)
- PyTorch: 2.13.0.dev20260531+cu132
- Python: 3.12
- TileLang: Tested both 0.1.8 and 0.1.10 (prebuilt wheels from PyPI)
- OS: Ubuntu 24.04
- Container: Podman rootless with
nvidia/cuda:13.2.0-devel-ubuntu24.04
Reproduction
Model
- Model: DeepSeek-V4-Flash (official inference code from
deepseek-ai/deepseek-v4-flash)
- Architecture: MoE with 256 routed experts, FP4 expert weights, FP8 shared weights
- Model Size: ~149 GB total (77 GB per MP=2 shard)
- Custom Kernels: FP4 GEMM (
fp4_gemm), FP8 GEMM (fp8_gemm), sparse attention (sparse_attn), Hyper-Connection Sinkhorn (hc_split_sinkhorn)
Steps to Reproduce
- Load the model with official inference code:
import torch
from model import Transformer, ModelArgs
from generate import generate
args = ModelArgs(**json.load(open("config.json")))
args.max_batch_size = 1
args.max_seq_len = 128 # Even with tiny context!
model = Transformer(args).cuda()
load_model(model, "model0-mp2.safetensors", strict=False)
- Run a single-token forward pass:
prompt_tokens = tokenizer.encode("Hi")
completion = generate(model, [prompt_tokens], 1, tokenizer.eos_token_id, 0.0)
- Observe: Process hangs indefinitely after "Model loaded" message
Observed Behavior
- ✅ Model loads successfully (~83 GB on GPU 0, ~80 GB on GPU 1)
- ✅ Health check (FastAPI) responds instantly
- ❌ First forward pass triggers JIT compilation
- ❌ Compilation never completes (tested 12+ hours)
- ❌ No TileLang cache files written (
/root/.tilelang/cache/ remains empty)
- ❌ No
.cu, .ptx, or .so artifacts produced
- ❌ No error messages, no exceptions, no timeouts
- ❌ GPUs at 100% utilization, CPU at 99% (92 threads active)
- ❌ Process must be killed with SIGKILL
What We've Tried
| Attempt |
Result |
| TileLang 0.1.8 |
❌ Hangs >12 hours |
| TileLang 0.1.10 |
❌ Hangs >30 minutes |
TILELANG_TARGET="cuda -arch=sm_89" |
❌ Still hangs |
TILELANG_EXECUTION_BACKEND=torch |
❌ Still hangs |
TILELANG_EXECUTION_BACKEND=nvrtc |
❌ Still hangs |
TILELANG_EXECUTION_BACKEND=cython |
❌ Still hangs |
TILELANG_EXECUTION_BACKEND=dlpack |
❌ Still hangs |
| Context size = 128 tokens |
❌ Same hang |
| Context size = 1024 tokens |
❌ Same hang |
| Single token generation |
❌ Same hang |
| NCCL timeout = 24 hours |
✅ Prevents timeout, but compilation never finishes |
Expected Behavior
TileLang should either:
- Successfully compile the kernels within a reasonable time (< 1 hour for 61 layers), OR
- Produce a clear error message if sm_120 is not supported for the specific kernel patterns used, OR
- Timeout with an informative error after a configurable compilation timeout
Additional Context
Kernel Patterns Used (from DeepSeek-V4-Flash)
The model uses these TileLang @tilelang.jit decorated kernels:
act_quant_kernel — Block-wise FP8 quantization (block_size=128)
fp4_quant_kernel — Block-wise FP4 quantization (block_size=32)
fp8_gemm_kernel — FP8 act × FP8 weight GEMM (C[M,N] = A[M,K] @ B[N,K]^T)
fp4_gemm_kernel — FP8 act × FP4 weight GEMM (C[M,N] = A_fp8[M,K] @ B_fp4[N,K]^T)
sparse_attn_kernel — Sparse multi-head attention with index gathering
hc_split_sinkhorn_kernel — Hyper-Connection Sinkhorn iterations
All kernels use pass_configs = {TL_DISABLE_WARP_SPECIALIZED: True, TL_DISABLE_TMA_LOWER: True}.
Thread Dump
When the hang occurs:
- Rank 0 process: 92 threads, all showing empty stack traces (
/proc/PID/task/*/stack)
- Rank 1 process: 24 threads
- Both processes stuck in what appears to be TVM scheduler/compiler code
- No syscalls observed (strace shows no file I/O, no network activity)
Related Issues
Our issue is different: We get no error at all, just an infinite hang.
Possible Causes
- TVM auto-scheduler infinite loop — The scheduler may be searching for an optimal schedule that doesn't exist for sm_120, causing an infinite search loop
- Missing sm_120 codegen path — Specific kernel patterns (FP4×FP8 mixed GEMM, sparse attention) may not have sm_120 implementations, causing the compiler to loop trying to find a fallback
- Prebuilt wheel incompatibility — PyPI wheels may not include the latest sm_120 codegen fixes from the repo
Request
- Investigate the infinite hang during JIT compilation on sm_120 for the kernel patterns listed above
- Add compilation timeout with informative error messages if codegen cannot complete
- Consider building from source — Is the PyPI wheel missing sm_120 fixes that are in the git repo?
We're happy to provide:
- Full container image definition
- Minimal reproduction script
- Core dumps or additional diagnostics
- Testing of any proposed fixes
Labels
bug, Blackwell, sm_120, JIT, compilation, hang
Note: This issue affects production deployment of DeepSeek-V4-Flash on the latest NVIDIA Blackwell workstation GPUs. A fix would enable local inference of 1M-context models on consumer/workstation hardware.
Description
TileLang JIT compilation hangs indefinitely (12+ hours) on NVIDIA RTX PRO 6000 Blackwell (sm_120) when running DeepSeek-V4-Flash official inference code. The process consumes 100% GPU and 99% CPU but produces no output, no cache files, and no error messages.
This is a different issue from previously reported sm_120 bugs (which showed clear error messages like "Unsupported target for gemm" or shared memory overflow). In our case, there is no error at all — the compiler simply appears to enter an infinite loop or deadlock during kernel code generation.
Environment
nvidia/cuda:13.2.0-devel-ubuntu24.04Reproduction
Model
deepseek-ai/deepseek-v4-flash)fp4_gemm), FP8 GEMM (fp8_gemm), sparse attention (sparse_attn), Hyper-Connection Sinkhorn (hc_split_sinkhorn)Steps to Reproduce
Observed Behavior
/root/.tilelang/cache/remains empty).cu,.ptx, or.soartifacts producedWhat We've Tried
TILELANG_TARGET="cuda -arch=sm_89"TILELANG_EXECUTION_BACKEND=torchTILELANG_EXECUTION_BACKEND=nvrtcTILELANG_EXECUTION_BACKEND=cythonTILELANG_EXECUTION_BACKEND=dlpackExpected Behavior
TileLang should either:
Additional Context
Kernel Patterns Used (from DeepSeek-V4-Flash)
The model uses these TileLang
@tilelang.jitdecorated kernels:act_quant_kernel— Block-wise FP8 quantization (block_size=128)fp4_quant_kernel— Block-wise FP4 quantization (block_size=32)fp8_gemm_kernel— FP8 act × FP8 weight GEMM (C[M,N] = A[M,K] @ B[N,K]^T)fp4_gemm_kernel— FP8 act × FP4 weight GEMM (C[M,N] = A_fp8[M,K] @ B_fp4[N,K]^T)sparse_attn_kernel— Sparse multi-head attention with index gatheringhc_split_sinkhorn_kernel— Hyper-Connection Sinkhorn iterationsAll kernels use
pass_configs = {TL_DISABLE_WARP_SPECIALIZED: True, TL_DISABLE_TMA_LOWER: True}.Thread Dump
When the hang occurs:
/proc/PID/task/*/stack)Related Issues
Our issue is different: We get no error at all, just an infinite hang.
Possible Causes
Request
We're happy to provide:
Labels
bug,Blackwell,sm_120,JIT,compilation,hangNote: This issue affects production deployment of DeepSeek-V4-Flash on the latest NVIDIA Blackwell workstation GPUs. A fix would enable local inference of 1M-context models on consumer/workstation hardware.