debug-distributed-hang - SKILL.md Agent Skill

name: debug-distributed-hang description: Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

Debugging Distributed Hangs in SGLang

Overview

Hangs in distributed inference happen when ranks diverge in state, causing collective operations (AllGather, AllReduce, Broadcast, Barrier) to deadlock. Common causes:

Size mismatch: ranks pass different tensor sizes to a collective
Branch divergence: one rank enters a collective, another skips it
Cascading state drift: a small non-determinism (e.g., floating-point) propagates into different batch structures
Resource exhaustion: one rank OOMs or crashes, others wait forever

Prerequisites

py-spy: pip install py-spy or system package. Requires root or CAP_SYS_PTRACE to attach to running processes.
cuda-gdb: Ships with the CUDA toolkit. Ensure it's on your PATH.

Step 1: Confirm and Locate the Hang

1a. Watchdog / py-spy

SGLang's watchdog automatically dumps py-spy traces on timeout. Look for:

Scheduler watchdog timeout (self.watchdog_timeout=300, self.soft=False)

The py-spy dump shows the stack trace of each thread. The hanging thread is typically blocked in a CUDA synchronize or NCCL collective:

Thread (active): "MainThread"
    cuStreamSynchronize (libcuda.so)
    ...
    forward_extend (model_runner.py)

SGLang has two watchdog modes (see python/sglang/srt/utils/watchdog.py):

Hard watchdog (soft=False, default): dumps py-spy traces then sends SIGQUIT to kill the parent process.
Soft watchdog (soft=True): only logs the timeout without killing the process, giving you more time to manually attach debuggers or collect coredumps.

If the watchdog doesn't trigger, manually dump:

py-spy dump --pid <scheduler_pid>

1b. NCCL Debug Logging

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL

Look for the last collective logged before the hang. Mismatched sizes show up as one rank waiting and another never entering.

1c. CUDA Coredump

When a process hangs, you can trigger a GPU coredump on demand to see which kernel is stuck. Set these env vars before launching:

export CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1
export CUDA_COREDUMP_PIPE="/tmp/cuda_pipe_%h_%p"
export CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h_%p"
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory'

While the process is hanging, find the pipe via /proc/<pid>/fd/ and write to it to trigger the dump:

ls /proc/<pid>/fd/ -la 2>/dev/null | grep cuda_pipe
dd if=/dev/zero bs=1M count=1 > /tmp/cuda_pipe_<hostname>_<pid>

Alternatively, if you don't need to keep the process alive, kill -SIGABRT <pid> also triggers a CUDA coredump (but terminates the process).

Then open with cuda-gdb --batch -ex "target cudacore <coredump_file>". On load, it immediately shows which kernel is stuck. For example:

Opening GPU coredump: <coredump_file>
[Current focus set to CUDA kernel 0, grid 622721, cluster (4,0,0), block (16,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
#0  0x00007f8029b2b040 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(512,1,1)>>> ()

This told us the hang was in an NCCL AllGather — not a compute kernel. Combined with the py-spy stack pointing to LogitsProcessor.forward → tensor_model_parallel_all_gather, we knew it was an AllGather size mismatch between TP ranks.

1d. Identify the Collective

From the stack traces and logs, identify:

Which collective hangs (AllGather, AllReduce, Broadcast)
Which code path invokes it (e.g., LogitsProcessor, tensor_model_parallel_all_gather)
Whether it's a size mismatch or a missing participant

Step 2: Per-Rank Logging

The key technique: each rank writes its own log file so you can diff them.

Setup Pattern

import os

_debug_files = {}

def get_debug_file(rank):
    key = f"rank{rank}"
    if key not in _debug_files:
        _debug_files[key] = open(f"/tmp/debug_rank{rank}.log", "w")
    return _debug_files[key]

Gate logging behind an env var to avoid overhead in production. SGLANG_DEBUG_HANG is not a built-in SGLang env var — you need to add this check yourself in the code you're instrumenting:

if os.environ.get("SGLANG_DEBUG_HANG"):
    f = get_debug_file(rank)
    f.write(f"EVENT_NAME key1={val1} key2={val2}\n")
    f.flush()

What to Log

Log structured events at key state-mutation points:

f.write(f"SCHED_BATCH step={step} num_reqs={n} extend_lens={lens}\n")
f.write(f"VERIFY predict_hash={hash} accept_len={alen}\n")
f.write(f"CACHE_INSERT rid={rid} num_tokens={n}\n")

Use consistent event names (uppercase prefix) for easy grep/diff.

Hash Large Tensors

For tensor values, compute a hash instead of dumping raw data:

import hashlib
h = hashlib.md5(tensor.cpu().numpy().tobytes()).hexdigest()[:8]
f.write(f"LOGITS logits_hash={h}\n")

For token ID lists, str(list).encode() works:

h = hashlib.md5(str(tensor.tolist()).encode()).hexdigest()[:8]

Avoid Implicit Synchronization

tensor.cpu(), tensor.tolist(), and tensor.numpy() all trigger CUDA synchronization. This can:

Change timing and mask or move the hang
Deadlock if the log point is between two collectives that must run back-to-back

Prefer logging values that are already on CPU (e.g., Python ints, list lengths, request IDs). When you must hash a GPU tensor, do it at a point where the GPU is already idle (e.g., between scheduler steps, not inside a model forward pass).

Step 3: Diff to Find the Diverge Point

Basic Diff

# Extract specific event type
grep "^VERIFY" /tmp/debug_rank0.log > /tmp/v_r0.txt
grep "^VERIFY" /tmp/debug_rank1.log > /tmp/v_r1.txt
diff /tmp/v_r0.txt /tmp/v_r1.txt | head -20

Count Events

grep -c "^VERIFY" /tmp/debug_rank*.log

If counts differ, one rank executed more iterations — that's already a diverge signal.

Find First Diverge

The first diff line tells you the exact step where ranks diverge. All lines before it are identical — the root cause is at or before this step.

Step 4: Binary-Search the Root Cause

Once you find the diverging event, trace backwards:

4a. Identify Inputs

For the diverging operation, list all its inputs. Add hash logging for each:

f.write(
    f"OP_INPUTS input_a_hash={h_a} input_b_hash={h_b} "
    f"input_c_hash={h_c} input_d_hash={h_d}\n"
)

4b. Diff Inputs Across Ranks

Compare the hashes. Some inputs will match, some won't. The non-matching input is where divergence entered.

4c. Recurse

For the non-matching input, trace where it was produced and repeat: hash its inputs, diff across ranks, find the divergent one. Continue until you reach the root cause.

Step 5: Common Root Causes and Fixes

Floating-Point Non-Determinism

Symptom: All "logical" inputs are identical (same logits after all-gather), but derived floating-point values (softmax, probabilities) differ across GPUs.

Example: EAGLE speculative decoding — F.softmax → top_k_renorm_prob → top_p_renorm_prob produces slightly different target_probs on each GPU. The sampling kernel then picks different tokens. These flow into output_ids → radix cache → different prefix match depths → different extend_seq_lens → AllGather size mismatch → hang.

Random Number Divergence

Symptom: Operations using torch.rand produce different values on each rank.

Fix: Generate on rank 0 and broadcast, or use a shared seed.

Conditional Code Paths

Symptom: A condition (e.g., memory check, queue length) evaluates differently on different ranks, causing one rank to enter a collective while another skips it.

Fix: Synchronize the condition value before branching, or restructure to ensure all ranks take the same path.

Pipeline Parallel (PP) Send/Recv Mismatch

Symptom: In PP setups, one stage issues a send that the next stage never recvs (or vice versa), causing both to block indefinitely. Unlike TP hangs (collective mismatches), PP hangs typically involve point-to-point operations.

Fix: Ensure all stages agree on the number of microbatches and the sequence of send/recv calls for each microbatch.

Step 6: Verify the Fix

Run the failing test multiple times to confirm the fix is stable. Intermittent hangs require many runs. A test that hung ~30% of the time needs at least 10 clean passes to be confident.

Quick Reference

Technique	When to Use
py-spy dump	First step — see where each rank is stuck
`NCCL_DEBUG=INFO`	Identify which collective and sizes
CUDA coredump + `cuda-gdb`	See which GPU kernel is blocked
Per-rank log files	Compare rank states over time
Hash of tensors	Efficiently compare large tensors across ranks
`diff` on extracted events	Find the exact step of divergence
`broadcast(result, src=0)`	Fix floating-point or sampling non-determinism