[BUG] JIT compilation hangs indefinitely on sm_120 (RTX PRO 6000 Blackwell) — no error, no timeout, zero output

## Description

TileLang JIT compilation hangs indefinitely (12+ hours) on NVIDIA RTX PRO 6000 Blackwell (sm_120) when running DeepSeek-V4-Flash official inference code. The process consumes 100% GPU and 99% CPU but produces no output, no cache files, and no error messages.

This is a **different issue** from previously reported sm_120 bugs (which showed clear error messages like "Unsupported target for gemm" or shared memory overflow). In our case, there is **no error at all** — the compiler simply appears to enter an infinite loop or deadlock during kernel code generation.

---

## Environment

- **GPU:** 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB each)
- **GPU Architecture:** sm_120
- **CUDA Version:** 13.2 (driver 570.x)
- **PyTorch:** 2.13.0.dev20260531+cu132
- **Python:** 3.12
- **TileLang:** Tested both 0.1.8 and 0.1.10 (prebuilt wheels from PyPI)
- **OS:** Ubuntu 24.04
- **Container:** Podman rootless with `nvidia/cuda:13.2.0-devel-ubuntu24.04`

---

## Reproduction

### Model
- **Model:** DeepSeek-V4-Flash (official inference code from `deepseek-ai/deepseek-v4-flash`)
- **Architecture:** MoE with 256 routed experts, FP4 expert weights, FP8 shared weights
- **Model Size:** ~149 GB total (77 GB per MP=2 shard)
- **Custom Kernels:** FP4 GEMM (`fp4_gemm`), FP8 GEMM (`fp8_gemm`), sparse attention (`sparse_attn`), Hyper-Connection Sinkhorn (`hc_split_sinkhorn`)

### Steps to Reproduce

1. Load the model with official inference code:
```python
import torch
from model import Transformer, ModelArgs
from generate import generate

args = ModelArgs(**json.load(open("config.json")))
args.max_batch_size = 1
args.max_seq_len = 128  # Even with tiny context!

model = Transformer(args).cuda()
load_model(model, "model0-mp2.safetensors", strict=False)
```

2. Run a single-token forward pass:
```python
prompt_tokens = tokenizer.encode("Hi")
completion = generate(model, [prompt_tokens], 1, tokenizer.eos_token_id, 0.0)
```

3. **Observe:** Process hangs indefinitely after "Model loaded" message

### Observed Behavior
- ✅ Model loads successfully (~83 GB on GPU 0, ~80 GB on GPU 1)
- ✅ Health check (FastAPI) responds instantly
- ❌ First forward pass triggers JIT compilation
- ❌ **Compilation never completes** (tested 12+ hours)
- ❌ No TileLang cache files written (`/root/.tilelang/cache/` remains empty)
- ❌ No `.cu`, `.ptx`, or `.so` artifacts produced
- ❌ No error messages, no exceptions, no timeouts
- ❌ GPUs at 100% utilization, CPU at 99% (92 threads active)
- ❌ Process must be killed with SIGKILL

### What We've Tried

| Attempt | Result |
|---------|--------|
| TileLang 0.1.8 | ❌ Hangs >12 hours |
| TileLang 0.1.10 | ❌ Hangs >30 minutes |
| `TILELANG_TARGET="cuda -arch=sm_89"` | ❌ Still hangs |
| `TILELANG_EXECUTION_BACKEND=torch` | ❌ Still hangs |
| `TILELANG_EXECUTION_BACKEND=nvrtc` | ❌ Still hangs |
| `TILELANG_EXECUTION_BACKEND=cython` | ❌ Still hangs |
| `TILELANG_EXECUTION_BACKEND=dlpack` | ❌ Still hangs |
| Context size = 128 tokens | ❌ Same hang |
| Context size = 1024 tokens | ❌ Same hang |
| Single token generation | ❌ Same hang |
| NCCL timeout = 24 hours | ✅ Prevents timeout, but compilation never finishes |

---

## Expected Behavior

TileLang should either:
1. Successfully compile the kernels within a reasonable time (< 1 hour for 61 layers), OR
2. Produce a clear error message if sm_120 is not supported for the specific kernel patterns used, OR
3. Timeout with an informative error after a configurable compilation timeout

---

## Additional Context

### Kernel Patterns Used (from DeepSeek-V4-Flash)

The model uses these TileLang `@tilelang.jit` decorated kernels:

1. **`act_quant_kernel`** — Block-wise FP8 quantization (block_size=128)
2. **`fp4_quant_kernel`** — Block-wise FP4 quantization (block_size=32)
3. **`fp8_gemm_kernel`** — FP8 act × FP8 weight GEMM (C[M,N] = A[M,K] @ B[N,K]^T)
4. **`fp4_gemm_kernel`** — FP8 act × FP4 weight GEMM (C[M,N] = A_fp8[M,K] @ B_fp4[N,K]^T)
5. **`sparse_attn_kernel`** — Sparse multi-head attention with index gathering
6. **`hc_split_sinkhorn_kernel`** — Hyper-Connection Sinkhorn iterations

All kernels use `pass_configs = {TL_DISABLE_WARP_SPECIALIZED: True, TL_DISABLE_TMA_LOWER: True}`.

### Thread Dump

When the hang occurs:
- Rank 0 process: 92 threads, all showing empty stack traces (`/proc/PID/task/*/stack`)
- Rank 1 process: 24 threads
- Both processes stuck in what appears to be TVM scheduler/compiler code
- No syscalls observed (strace shows no file I/O, no network activity)

### Related Issues

- **#1233** "blackwell compile issue" — Similar hardware but showed clear compile error (fixed in v0.1.7)
- **#985** "SM120 error with bundled TVM" — Showed "Unsupported target" error (fixed)
- **#2201** "sm_120 shared memory overflow" — Showed shared memory error (fixed in 0.1.9)
- **PR #2324** — Active NVFP4/SM120 support PR (not yet merged)

**Our issue is different:** We get **no error at all**, just an infinite hang.

---

## Possible Causes

1. **TVM auto-scheduler infinite loop** — The scheduler may be searching for an optimal schedule that doesn't exist for sm_120, causing an infinite search loop
2. **Missing sm_120 codegen path** — Specific kernel patterns (FP4×FP8 mixed GEMM, sparse attention) may not have sm_120 implementations, causing the compiler to loop trying to find a fallback
3. **Prebuilt wheel incompatibility** — PyPI wheels may not include the latest sm_120 codegen fixes from the repo

---

## Request

1. **Investigate the infinite hang** during JIT compilation on sm_120 for the kernel patterns listed above
2. **Add compilation timeout** with informative error messages if codegen cannot complete
3. **Consider building from source** — Is the PyPI wheel missing sm_120 fixes that are in the git repo?

We're happy to provide:
- Full container image definition
- Minimal reproduction script
- Core dumps or additional diagnostics
- Testing of any proposed fixes

---

## Labels

`bug`, `Blackwell`, `sm_120`, `JIT`, `compilation`, `hang`

---

*Note: This issue affects production deployment of DeepSeek-V4-Flash on the latest NVIDIA Blackwell workstation GPUs. A fix would enable local inference of 1M-context models on consumer/workstation hardware.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] JIT compilation hangs indefinitely on sm_120 (RTX PRO 6000 Blackwell) — no error, no timeout, zero output #2328

Description

Environment

Reproduction

Model

Steps to Reproduce

Observed Behavior

What We've Tried

Expected Behavior

Additional Context

Kernel Patterns Used (from DeepSeek-V4-Flash)

Thread Dump

Related Issues

Possible Causes

Request

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attempt	Result
TileLang 0.1.8	❌ Hangs >12 hours
TileLang 0.1.10	❌ Hangs >30 minutes
`TILELANG_TARGET="cuda -arch=sm_89"`	❌ Still hangs
`TILELANG_EXECUTION_BACKEND=torch`	❌ Still hangs
`TILELANG_EXECUTION_BACKEND=nvrtc`	❌ Still hangs
`TILELANG_EXECUTION_BACKEND=cython`	❌ Still hangs
`TILELANG_EXECUTION_BACKEND=dlpack`	❌ Still hangs
Context size = 128 tokens	❌ Same hang
Context size = 1024 tokens	❌ Same hang
Single token generation	❌ Same hang
NCCL timeout = 24 hours	✅ Prevents timeout, but compilation never finishes

Uh oh!

[BUG] JIT compilation hangs indefinitely on sm_120 (RTX PRO 6000 Blackwell) — no error, no timeout, zero output #2328

Description

Description

Environment

Reproduction

Model

Steps to Reproduce

Observed Behavior

What We've Tried

Expected Behavior

Additional Context

Kernel Patterns Used (from DeepSeek-V4-Flash)

Thread Dump

Related Issues

Possible Causes

Request

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions