flashmla-mbarrier-debug

name: flashmla-mbarrier-debug description: Diagnose and fix FlashMLA CuTe DSL hangs or incorrect outputs in flashmla/flashmla_dsl.py by tracing mbarrier/TMA producer-consumer dependencies against flashmla/splitkv_mla.cu. Use when execution stalls around qkt_gemm_one_tile_sQ mbarrier_wait, when "before wait" appears without "after wait", or when tensor o / calc_cos validation fails.

Focus on flashmla/flashmla_dsl.py.
Use flashmla/splitkv_mla.cu only as immutable reference.
Keep these Python-vs-CUDA differences unchanged unless explicitly requested:
- Preserve if warpgroup_idx == 1 launch path.
- Preserve if warp_idx % 4 == 0 launch gate.

Run with the environment's Python binary directly:

CUTE_DSL_KEEP_PTX=0 CUTE_DSL_KEEP_IR=0 \
/home/wuguanyu02/miniconda3/envs/fllm2/bin/python -u flashmla/flashmla_dsl.py

Treat qkt_gemm_one_tile_sQ before wait without matching after wait as deadlock signal.

For each barrier used by consumer wait, identify:

Use these anchor points:

Consumer path: warpgroup_cooperative_qkt_gemm -> qkt_gemm_one_tile_sQ/rQ
Producer path: launch_kv_tiles_copy_tma
Initial K0 wait path: first for i in range_constexpr(9) in warpgroup 0 branch
Reference behavior: splitkv_mla.cu launch_kv_tiles_copy_tma and QKT pipeline

Add only short probes:

Do not spam all threads. Restrict to one lane per warpgroup, e.g. tid==0 or tid==128.

Primary checks:

Known pitfall seen in this repo:

In launch_kv_tiles_copy_tma, using range_constexpr(start, end+1) with calls like (4, 9) and (0, 4) can create overlap/out-of-range effects in the 9-tile pipeline.
This can break barrier state and stall mbarrier_wait.

Patch one cause at a time and re-run immediately.

Priority order:

Keep immutable constraints:

Required:

Kernel run exits normally.
tensor o is printed (if current main script prints it; otherwise add temporary print).
calc_cos for both batches is below 1e-4.

If the run still fails or hangs:

export CUTE_DSL_KEEP_PTX=1
export CUTE_DSL_KEEP_IR=1
/home/wuguanyu02/miniconda3/envs/fllm2/bin/python -u flashmla/flashmla_dsl.py

Then inspect generated IR/PTX around: