Skip to content

[BUG] JIT compilation hangs indefinitely on sm_120 (RTX PRO 6000 Blackwell) — no error, no timeout, zero output #2328

Description

@edwardtian

Description

TileLang JIT compilation hangs indefinitely (12+ hours) on NVIDIA RTX PRO 6000 Blackwell (sm_120) when running DeepSeek-V4-Flash official inference code. The process consumes 100% GPU and 99% CPU but produces no output, no cache files, and no error messages.

This is a different issue from previously reported sm_120 bugs (which showed clear error messages like "Unsupported target for gemm" or shared memory overflow). In our case, there is no error at all — the compiler simply appears to enter an infinite loop or deadlock during kernel code generation.


Environment

  • GPU: 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB each)
  • GPU Architecture: sm_120
  • CUDA Version: 13.2 (driver 570.x)
  • PyTorch: 2.13.0.dev20260531+cu132
  • Python: 3.12
  • TileLang: Tested both 0.1.8 and 0.1.10 (prebuilt wheels from PyPI)
  • OS: Ubuntu 24.04
  • Container: Podman rootless with nvidia/cuda:13.2.0-devel-ubuntu24.04

Reproduction

Model

  • Model: DeepSeek-V4-Flash (official inference code from deepseek-ai/deepseek-v4-flash)
  • Architecture: MoE with 256 routed experts, FP4 expert weights, FP8 shared weights
  • Model Size: ~149 GB total (77 GB per MP=2 shard)
  • Custom Kernels: FP4 GEMM (fp4_gemm), FP8 GEMM (fp8_gemm), sparse attention (sparse_attn), Hyper-Connection Sinkhorn (hc_split_sinkhorn)

Steps to Reproduce

  1. Load the model with official inference code:
import torch
from model import Transformer, ModelArgs
from generate import generate

args = ModelArgs(**json.load(open("config.json")))
args.max_batch_size = 1
args.max_seq_len = 128  # Even with tiny context!

model = Transformer(args).cuda()
load_model(model, "model0-mp2.safetensors", strict=False)
  1. Run a single-token forward pass:
prompt_tokens = tokenizer.encode("Hi")
completion = generate(model, [prompt_tokens], 1, tokenizer.eos_token_id, 0.0)
  1. Observe: Process hangs indefinitely after "Model loaded" message

Observed Behavior

  • ✅ Model loads successfully (~83 GB on GPU 0, ~80 GB on GPU 1)
  • ✅ Health check (FastAPI) responds instantly
  • ❌ First forward pass triggers JIT compilation
  • Compilation never completes (tested 12+ hours)
  • ❌ No TileLang cache files written (/root/.tilelang/cache/ remains empty)
  • ❌ No .cu, .ptx, or .so artifacts produced
  • ❌ No error messages, no exceptions, no timeouts
  • ❌ GPUs at 100% utilization, CPU at 99% (92 threads active)
  • ❌ Process must be killed with SIGKILL

What We've Tried

Attempt Result
TileLang 0.1.8 ❌ Hangs >12 hours
TileLang 0.1.10 ❌ Hangs >30 minutes
TILELANG_TARGET="cuda -arch=sm_89" ❌ Still hangs
TILELANG_EXECUTION_BACKEND=torch ❌ Still hangs
TILELANG_EXECUTION_BACKEND=nvrtc ❌ Still hangs
TILELANG_EXECUTION_BACKEND=cython ❌ Still hangs
TILELANG_EXECUTION_BACKEND=dlpack ❌ Still hangs
Context size = 128 tokens ❌ Same hang
Context size = 1024 tokens ❌ Same hang
Single token generation ❌ Same hang
NCCL timeout = 24 hours ✅ Prevents timeout, but compilation never finishes

Expected Behavior

TileLang should either:

  1. Successfully compile the kernels within a reasonable time (< 1 hour for 61 layers), OR
  2. Produce a clear error message if sm_120 is not supported for the specific kernel patterns used, OR
  3. Timeout with an informative error after a configurable compilation timeout

Additional Context

Kernel Patterns Used (from DeepSeek-V4-Flash)

The model uses these TileLang @tilelang.jit decorated kernels:

  1. act_quant_kernel — Block-wise FP8 quantization (block_size=128)
  2. fp4_quant_kernel — Block-wise FP4 quantization (block_size=32)
  3. fp8_gemm_kernel — FP8 act × FP8 weight GEMM (C[M,N] = A[M,K] @ B[N,K]^T)
  4. fp4_gemm_kernel — FP8 act × FP4 weight GEMM (C[M,N] = A_fp8[M,K] @ B_fp4[N,K]^T)
  5. sparse_attn_kernel — Sparse multi-head attention with index gathering
  6. hc_split_sinkhorn_kernel — Hyper-Connection Sinkhorn iterations

All kernels use pass_configs = {TL_DISABLE_WARP_SPECIALIZED: True, TL_DISABLE_TMA_LOWER: True}.

Thread Dump

When the hang occurs:

  • Rank 0 process: 92 threads, all showing empty stack traces (/proc/PID/task/*/stack)
  • Rank 1 process: 24 threads
  • Both processes stuck in what appears to be TVM scheduler/compiler code
  • No syscalls observed (strace shows no file I/O, no network activity)

Related Issues

Our issue is different: We get no error at all, just an infinite hang.


Possible Causes

  1. TVM auto-scheduler infinite loop — The scheduler may be searching for an optimal schedule that doesn't exist for sm_120, causing an infinite search loop
  2. Missing sm_120 codegen path — Specific kernel patterns (FP4×FP8 mixed GEMM, sparse attention) may not have sm_120 implementations, causing the compiler to loop trying to find a fallback
  3. Prebuilt wheel incompatibility — PyPI wheels may not include the latest sm_120 codegen fixes from the repo

Request

  1. Investigate the infinite hang during JIT compilation on sm_120 for the kernel patterns listed above
  2. Add compilation timeout with informative error messages if codegen cannot complete
  3. Consider building from source — Is the PyPI wheel missing sm_120 fixes that are in the git repo?

We're happy to provide:

  • Full container image definition
  • Minimal reproduction script
  • Core dumps or additional diagnostics
  • Testing of any proposed fixes

Labels

bug, Blackwell, sm_120, JIT, compilation, hang


Note: This issue affects production deployment of DeepSeek-V4-Flash on the latest NVIDIA Blackwell workstation GPUs. A fix would enable local inference of 1M-context models on consumer/workstation hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions