name: debug-distributed-hang description: Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.
Debugging Distributed Hangs in SGLang
Overview
Hangs in distributed inference happen when ranks diverge in state, causing collective operations (AllGather, AllReduce, Broadcast, Barrier) to deadlock. Common causes:
- Size mismatch: ranks pass different tensor sizes to a collective
- Branch divergence: one rank enters a collective, another skips it
- Cascading state drift: a small non-determinism (e.g., floating-point) propagates into different batch structures
- Resource exhaustion: one rank OOMs or crashes, others wait forever
Prerequisites
- py-spy:
pip install py-spyor system package. Requires root orCAP_SYS_PTRACEto attach to running processes. - cuda-gdb: Ships with the CUDA toolkit. Ensure it's on your
PATH.
Step 1: Confirm and Locate the Hang
1a. Watchdog / py-spy
SGLang's watchdog automatically dumps py-spy traces on timeout. Look for:
Scheduler watchdog timeout (self.watchdog_timeout=300, self.soft=False)
The py-spy dump shows the stack trace of each thread. The hanging thread is typically blocked in a CUDA synchronize or NCCL collective:
Thread (active): "MainThread"
cuStreamSynchronize (libcuda.so)
...
forward_extend (model_runner.py)
SGLang has two watchdog modes (see python/sglang/srt/utils/watchdog.py):
- Hard watchdog (
soft=False, default): dumps py-spy traces then sendsSIGQUITto kill the parent process. - Soft watchdog (
soft=True): only logs the timeout without killing the process, giving you more time to manually attach debuggers or collect coredumps.
If the watchdog doesn't trigger, manually dump:
py-spy dump --pid <scheduler_pid>
1b. NCCL Debug Logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL
Look for the last collective logged before the hang. Mismatched sizes show up as one rank waiting and another never entering.
1c. CUDA Coredump
When a process hangs, you can trigger a GPU coredump on demand to see which kernel is stuck. Set these env vars before launching:
export CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1
export CUDA_COREDUMP_PIPE="/tmp/cuda_pipe_%h_%p"
export CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h_%p"
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory'
While the process is hanging, find the pipe via /proc/<pid>/fd/ and write to it to trigger the dump:
ls /proc/<pid>/fd/ -la 2>/dev/null | grep cuda_pipe
dd if=/dev/zero bs=1M count=1 > /tmp/cuda_pipe_<hostname>_<pid>
Alternatively, if you don't need to keep the process alive, kill -SIGABRT <pid> also triggers a CUDA coredump (but terminates the process).
Then open with cuda-gdb --batch -ex "target cudacore <coredump_file>". On load, it immediately shows which kernel is stuck. For example:
Opening GPU coredump: <coredump_file>
[Current focus set to CUDA kernel 0, grid 622721, cluster (4,0,0), block (16,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
#0 0x00007f8029b2b040 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(512,1,1)>>> ()
This told us the hang was in an NCCL AllGather — not a compute kernel. Combined with the py-spy stack pointing to LogitsProcessor.forward → tensor_model_parallel_all_gather, we knew it was an AllGather size mismatch between TP ranks.
1d. Identify the Collective
From the stack traces and logs, identify:
- Which collective hangs (AllGather, AllReduce, Broadcast)
- Which code path invokes it (e.g.,
LogitsProcessor,tensor_model_parallel_all_gather) - Whether it's a size mismatch or a missing participant
Step 2: Per-Rank Logging
The key technique: each rank writes its own log file so you can diff them.
Setup Pattern
import os
_debug_files = {}
def get_debug_file(rank):
key = f"rank{rank}"
if key not in _debug_files:
_debug_files[key] = open(f"/tmp/debug_rank{rank}.log", "w")
return _debug_files[key]
Gate logging behind an env var to avoid overhead in production. SGLANG_DEBUG_HANG is not a built-in SGLang env var — you need to add this check yourself in the code you're instrumenting:
if os.environ.get("SGLANG_DEBUG_HANG"):
f = get_debug_file(rank)
f.write(f"EVENT_NAME key1={val1} key2={val2}\n")
f.flush()
What to Log
Log structured events at key state-mutation points:
f.write(f"SCHED_BATCH step={step} num_reqs={n} extend_lens={lens}\n")
f.write(f"VERIFY predict_hash={hash} accept_len={alen}\n")
f.write(f"CACHE_INSERT rid={rid} num_tokens={n}\n")
Use consistent event names (uppercase prefix) for easy grep/diff.
Hash Large Tensors
For tensor values, compute a hash instead of dumping raw data:
import hashlib
h = hashlib.md5(tensor.cpu().numpy().tobytes()).hexdigest()[:8]
f.write(f"LOGITS logits_hash={h}\n")
For token ID lists, str(list).encode() works:
h = hashlib.md5(str(tensor.tolist()).encode()).hexdigest()[:8]
Avoid Implicit Synchronization
tensor.cpu(), tensor.tolist(), and tensor.numpy() all trigger CUDA synchronization. This can:
- Change timing and mask or move the hang
- Deadlock if the log point is between two collectives that must run back-to-back
Prefer logging values that are already on CPU (e.g., Python ints, list lengths, request IDs). When you must hash a GPU tensor, do it at a point where the GPU is already idle (e.g., between scheduler steps, not inside a model forward pass).
Step 3: Diff to Find the Diverge Point
Basic Diff
# Extract specific event type
grep "^VERIFY" /tmp/debug_rank0.log > /tmp/v_r0.txt
grep "^VERIFY" /tmp/debug_rank1.log > /tmp/v_r1.txt
diff /tmp/v_r0.txt /tmp/v_r1.txt | head -20
Count Events
grep -c "^VERIFY" /tmp/debug_rank*.log
If counts differ, one rank executed more iterations — that's already a diverge signal.
Find First Diverge
The first diff line tells you the exact step where ranks diverge. All lines before it are identical — the root cause is at or before this step.
Step 4: Binary-Search the Root Cause
Once you find the diverging event, trace backwards:
4a. Identify Inputs
For the diverging operation, list all its inputs. Add hash logging for each:
f.write(
f"OP_INPUTS input_a_hash={h_a} input_b_hash={h_b} "
f"input_c_hash={h_c} input_d_hash={h_d}\n"
)
4b. Diff Inputs Across Ranks
Compare the hashes. Some inputs will match, some won't. The non-matching input is where divergence entered.
4c. Recurse
For the non-matching input, trace where it was produced and repeat: hash its inputs, diff across ranks, find the divergent one. Continue until you reach the root cause.
Step 5: Common Root Causes and Fixes
Floating-Point Non-Determinism
Symptom: All "logical" inputs are identical (same logits after all-gather), but derived floating-point values (softmax, probabilities) differ across GPUs.
Example: EAGLE speculative decoding — F.softmax → top_k_renorm_prob → top_p_renorm_prob produces slightly different target_probs on each GPU. The sampling kernel then picks different tokens. These flow into output_ids → radix cache → different prefix match depths → different extend_seq_lens → AllGather size mismatch → hang.
Random Number Divergence
Symptom: Operations using torch.rand produce different values on each rank.
Fix: Generate on rank 0 and broadcast, or use a shared seed.
Conditional Code Paths
Symptom: A condition (e.g., memory check, queue length) evaluates differently on different ranks, causing one rank to enter a collective while another skips it.
Fix: Synchronize the condition value before branching, or restructure to ensure all ranks take the same path.
Pipeline Parallel (PP) Send/Recv Mismatch
Symptom: In PP setups, one stage issues a send that the next stage never recvs (or vice versa), causing both to block indefinitely. Unlike TP hangs (collective mismatches), PP hangs typically involve point-to-point operations.
Fix: Ensure all stages agree on the number of microbatches and the sequence of send/recv calls for each microbatch.
Step 6: Verify the Fix
Run the failing test multiple times to confirm the fix is stable. Intermittent hangs require many runs. A test that hung ~30% of the time needs at least 10 clean passes to be confident.
Quick Reference
| Technique | When to Use |
|---|---|
| py-spy dump | First step — see where each rank is stuck |
NCCL_DEBUG=INFO |
Identify which collective and sizes |
CUDA coredump + cuda-gdb |
See which GPU kernel is blocked |
| Per-rank log files | Compare rank states over time |
| Hash of tensors | Efficiently compare large tensors across ranks |
diff on extracted events |
Find the exact step of divergence |
broadcast(result, src=0) |
Fix floating-point or sampling non-determinism |