name: debug-agent-locate-kernel
description: Identify which GPU kernel is faulting/hanging in ATOM via rocm-debug-agent (for faults/asserts) or rocgdb (for silent livelocks). debug-agent dumps wave registers + faulting PC + (with --save-code-objects) disassembled code object on memory faults / ASSERT_TRAP. rocgdb attaches to a live process and lists in-flight info dispatches + HSA info queues — works when the kernel isn't faulting but just stuck (e.g. atomic-counter deadlock). Use when: server crashes with "Memory access fault by GPU node-N", server hangs with GPU at 100% but no token output, kernel asserting s_trap, or HIP_LAUNCH_BLOCKING=1 makes a hang vanish. Do NOT use for: numerical bugs (use dump-bisect-debug), compile errors, OOM.
version: 1.2.0
scope: ATOM on AMD ROCm (debug-agent at /opt/rocm/lib/librocm-debug-agent.so.2; rocgdb at /opt/rocm/bin/rocgdb)
last_updated: 2026-05-20
Tool selection: debug-agent vs rocgdb
| Symptom | Use first |
|---|---|
Memory access fault by GPU node-N / MEMORY_VIOLATION / ASSERT_TRAP in log |
debug-agent — only it dumps wave regs + faulting PC + code-object disassembly |
| Silent livelock (GPU 100%, no fault, no log output) | rocgdb — debug-agent never fires (no trap event); rocgdb's info dispatches lists in-flight kernels directly |
HIP_LAUNCH_BLOCKING=1 makes hang disappear (async race) |
rocgdb first to name the stuck kernel(s); debug-agent only if you need wave-level detail later |
| Need disassembly / per-lane register values | debug-agent (rocgdb doesn't go to that depth on AMD) |
The two tools cannot be combined. debug-agent loads HSA_TOOLS_LIB=librocm-debug-agent.so.2 which occupies the HSA debugger hook — rocgdb attached to the same process reports "No queues / No dispatches are currently active" because the agent has the slot. To use rocgdb, launch the server with plain start_atom_server.sh (no run_debug_agent.sh wrapper).
When to use
Symptoms that point to this skill:
Memory access fault by GPU node-N (Agent handle: 0x...) on address 0x...inatom_server.log- Server alive (
curl /v1/modelsreturns) butrocm-smi --showuseshows 100% GPU and noEngine Core: output sendfor >30s — silent kernel hang / livelock - Workers stuck at
torch.cuda.synchronize()perpy-spy dump --pid <rank-pid>— prior kernel never completes HIP_LAUNCH_BLOCKING=1makes the bug disappear → you have an async race; agent will tell you which kernel- Reproduces only at certain batch shapes (e.g. MTP-3 + long prefill)
Do NOT use this skill for: precision bugs (use [[dump-bisect-debug]]), build/compile errors (use /build-fix), OOM.
Required tools (verify before starting)
ls /opt/rocm/lib/librocm-debug-agent.so.2 # debug-agent (fault path)
ls /opt/rocm/bin/rocgdb # rocgdb (livelock path)
ls /opt/rocm/llvm/bin/llvm-objdump # for disassembling code objects
which py-spy && py-spy --version # Python stacks of stuck workers
If any missing: install rocm-debug-agent, rocgdb, llvm, pip install py-spy. Stop here if not available.
Critical pre-flight
ulimit -c 0— disables gpucore dumps. ROCm fault dumps gpucore files of 30-50 GB each per rank; on 8-GPU TP this fills disk in seconds. The launcher script sets this for you.--enforce-eager/--level 0— optional fallbacks, not required. Try the default launch first; the debug agent runs fine under hipgraph in most cases. Only reach for these flags when symptoms point at graph mode:--enforce-eagerdisables CUDAGraph capture. Try this when the agent reports faults that don't reproduce in eager mode, or when capture/replay itself crashes under the agent's no-caching-allocator behavior.--level 0disables Inductor. Try this on AMD when you hit thecluster_dimsautotune bug or other Inductor-side crashes during warmup.- They are independent — apply only the one(s) the symptom points at. The launcher script does NOT inject either; pass via
EXTRA_ARGSwhen you want them.
- Clean GPU state — kill any prior
spawn_main/openai_serverprocesses. StaleKFD processentries (rocm-smi --showpidsshowing UNKNOWN PIDs holding VRAM) cause the next launch to OOM at NCCL barrier.scripts/start_atom_server.shdoes the standard cleanup. - Model-specific env — pass on the command line or export before calling. Examples:
- V4-Pro requires
ATOM_USE_TRITON_MOE=1 - Kimi-K2.5-MXFP4 requires
--trust-remote-code+HSA_NO_SCRATCH_RECLAIM=1
- V4-Pro requires
Launcher scripts (in repo)
| Script | Purpose |
|---|---|
scripts/start_atom_server.sh [MODEL] [TP] [PORT] [EXTRA_ARGS...] |
Standard launcher: clears GPU, clears compile cache, backgrounds server, redirects to atom_server.log. |
scripts/stop_atom_server.sh |
SIGTERM atom.entrypoints, force-kill spawn workers, wait for VRAM release. |
scripts/run_debug_agent.sh [MODEL] [TP] [PORT] [EXTRA_ARGS...] |
Wraps start_atom_server.sh with HSA_TOOLS_LIB=librocm-debug-agent.so.2 + --save-code-objects. Server output goes to atom_server.log; code objects land in /app/logs_claude/debug_run/. |
scripts/run_debug_agent.sh --simple [MODEL] [TP] [EXTRA_ARGS...] |
Same wrapper but invokes start_simple_inference.sh (offline, no port). Default log: /app/logs_claude/simple_inference_debug_agent.log (override via LOG_FILE=). Use for offline batch repros (e.g. V4 MTP-3 prefill hang). |
scripts/wait_server_ready.sh [PORT] [MAX_MIN] [POLL] [LOG_FILE] |
Poll /v1/models until ready or startup error detected. Allow MAX_MIN ≥ 5 under the agent (3-5× slower than normal). |
Workflow
Step 1: Reproduce under the agent
bash scripts/stop_atom_server.sh # ensure clean
ATOM_USE_TRITON_MOE=1 \
bash scripts/run_debug_agent.sh \
/data/DeepSeek-V4-Pro 8 8000 \
--method mtp --num-speculative-tokens 3 &
# If launch fails / faults look graph-mode-specific, retry with
# `--enforce-eager` (and `--level 0` on AMD for the Inductor cluster_dims bug)
# appended to EXTRA_ARGS.
bash scripts/wait_server_ready.sh 8000 5 30 # 2-4 min under agent
cd /app/logs_claude && python <repro_script>.py # smallest hang trigger
Server load is 3-5× slower under the debug agent. Expect ready at 2-4 min, repro at 30-90s after first big batch.
Step 2: Find the fault wave dump
grep -E "stopped, reason|Memory access fault|MEMORY_VIOLATION|Disassembly" \
/app/logs_claude/atom_server.log | head -20
Each fault produces a block like:
wave_27876: pc=0x7f20f5e534c4 (kernel_code_entry=0x7f20f5e52900 <FQN OF KERNEL>) (stopped, reason: <REASON>)
scalar registers: ...
vector registers: ... ← v0..v? show per-lane values; v6 often holds index values being processed
trap registers: ...
general registers: pc=...
Disassembly for function <FQN>:
code object: memory://<pid>#offset=<hex>&size=<bytes>
loaded at: [<base>-<top>]
=> <pc>: <faulting instruction>
The <FQN> is the demangled kernel name. That's the suspect kernel. Common cases:
| Kernel name fragment | What it actually is |
|---|---|
at::native::index_copy_kernel_impl<OpaqueType<N>> |
Tensor.index_copy_(dim, idx, src) for dtype with N-byte size (4=int32/float32, 8=int64/float64) |
at::native::scatter_kernel |
Tensor.scatter_(dim, idx, src) |
at::native::index_kernel_impl |
Advanced indexing READ tensor[idx] |
_swa_write_kernel / _update_compressor_states_kernel |
ATOM Triton kernel — name in state_writes.py |
Step 3: Read the trap reason
| reason | what it means |
|---|---|
ASSERT_TRAP |
Kernel hit s_trap 2 — almost always a CUDA_KERNEL_ASSERT(...) failed device-side. For PyTorch index_copy_/scatter_ this is the bound check 0 <= idx < self.size(dim). Recompile PyTorch with TORCH_USE_HIP_DSA=1 for the assert text — usually unavailable, infer from kernel name. |
MEMORY_VIOLATION |
Real OOB load/store. The pc instruction is the access; back-trace the address from s_*/v_* registers. |
INVALID_OPCODE |
Corrupted code object — usually an allocator stomp on the kernel binary (very rare). |
Step 4: Disassemble the code object
The trap dump points to code object: memory://<pid>#offset=<hex>&size=<bytes>. The agent saved it under /app/logs_claude/debug_run/. Find it:
ls /app/logs_claude/debug_run/ | grep "<pid>" | grep "size_<bytes>"
# Returns e.g.: 7_memory___2188702_offset_0x546c3060_size_4026672
Disassemble:
/opt/rocm/llvm/bin/llvm-objdump --disassemble-all \
/app/logs_claude/debug_run/<file> > /app/logs_claude/fault.s
grep -nE "<faulting-pc-low-bits>|s_trap|s_endpgm" /app/logs_claude/fault.s | head -20
The PC's surrounding instructions tell you what the kernel was doing. For s_trap 2 followed by s_endpgm you've confirmed an assert (PyTorch CUDA_KERNEL_ASSERT lowering). For random other instructions it's a true memory violation — read the address from registers (e.g. v[0:1] typically holds the destination address being stored).
Step 5: Verify it's actually that kernel (PC can lie)
Wave debugger PC reports can be off when the wave is mid-flight or when the trap fires from a sibling wave. Especially common with Triton — a swa_write trap might be a downstream kernel's fault attributed back. Cross-check:
- Does the trap reproduce only when this code path runs? Disable the call (comment out in Python), retest.
- Does
HIP_LAUNCH_BLOCKING=1make it disappear? Then it's an async race, not a static OOB; the PC kernel is the victim, not necessarily the root cause. Bisect for the racer (next step). - Does inserting
torch.cuda.synchronize()right before this kernel call eliminate the trap? Then root cause is upstream of this point on the same stream.
Step 6: Bisect the racer (when PC is racer-victim)
- Comment out one suspect call at a time. The one whose absence fixes it is the racer (or one of the racing parties).
- If neither alone but both together fail: the race is between them sharing storage / launch slot. Add
torch.cuda.synchronize()between them as a workaround, but THIS IS NOT A SHIPPABLE FIX — see Step 7. py-spy dump --pid <rank-pid>on stuck ranks: shows the Python frame waiting on the GPU. If it's at your insertedsynchronize(), the racer is upstream of that line.
Step 7: Real fix vs workaround
Per [[atom-patterns]] / DeepSeek V4 guidance, do not ship cuda.synchronize() workarounds without root-causing the race — they mask one workload and surface a worse hang on a larger one. Common real fixes:
| Symptom | Real fix |
|---|---|
Race involves freshly-allocated transient tensors (e.g. from torch.where, arange, .reshape, .to(int64)) |
Pre-allocate them in _alloc_v4_metadata_buffers (ATOM) or as module-level scratch. Eliminates allocator churn entirely. |
Multiple index_copy_ / scatter_ in sequence |
Replace with a single Triton kernel that writes all destinations once. |
| Per-fwd kernel reads stale forward_vars from prior fwd | Switch H2D path off prep_stream to default stream (matches ATOM prepare_mtp_decode v2 pattern). |
| Cross-rank inconsistency causes one rank to OOB | Ensure all ranks see identical batch shapes before launching kernel; check cu_seqlens_q / state_slot_mapping parity. |
rocgdb workflow (for silent livelocks — when debug-agent gives no wave dump)
debug-agent only fires on MEMORY_VIOLATION / ASSERT_TRAP etc. — for a silent livelock (GPU stuck at 100% with no kernel making progress, no fault, no log), it sits idle and gives you nothing. rocgdb fills that gap: attached to a live worker, it can enumerate in-flight HSA dispatches and queue head/tail pointers, naming the stuck kernel directly.
Pre-flight (rocgdb only)
which rocgdb # /opt/rocm/bin/rocgdb
rocgdb --version | head -3 # confirm GNU gdb 16.x rocm-rel
The "Symbol PySlice_Type has different size" warnings on attach are benign — Python symbol size mismatch between rocgdb's bundled Python and the venv. Wave-debug commands still work.
Step R1: Launch WITHOUT debug-agent
Use plain start_atom_server.sh — NOT run_debug_agent.sh. debug-agent's HSA_TOOLS_LIB occupies the HSA debugger hook and rocgdb will report "No agents / No dispatches / No queues are currently active" because the agent has the slot. Run only ONE of the two tools at a time on the same process.
bash scripts/stop_atom_server.sh
<MODEL_ENV> bash scripts/start_atom_server.sh <MODEL> <TP> <PORT> <EXTRA_ARGS...> &
bash scripts/wait_server_ready.sh <PORT> 10 5 /app/logs_claude/atom_server.log
<run workload that triggers the hang>
Step R2: Pick the right worker PID (NOT the dispatcher)
ATOM at TP=N has 1 openai_server + 1 spawn dispatcher + N spawn workers. Only the workers hold GPU queues — attaching to the dispatcher returns "No dispatches" (it has no GPU work).
# Process tree: dispatcher has PPID = openai_server; workers have PPID = dispatcher
ps -ef | grep spawn_main | grep -v grep
# Workers' PPID equals the dispatcher PID and they sit ~99% CPU during forward;
# the dispatcher itself shows lower CPU. Pick any worker.
WORKER_PID=<one of the worker PIDs>
Step R3: Dump GPU state non-interactively
cat > /tmp/rocgdb_cmds.txt <<'EOF'
set pagination off
set confirm off
set logging file /app/logs_claude/rocgdb_dump.txt
set logging overwrite on
set logging on
echo === info agents ===\n
info agents
echo \n=== info dispatches ===\n
info dispatches
echo \n=== info queues ===\n
info queues
echo \n=== main thread bt ===\n
bt 30
detach
quit
EOF
timeout 90 rocgdb -p $WORKER_PID -x /tmp/rocgdb_cmds.txt -batch
detach (not just quit) is required or the worker stays SIGSTOP'd after rocgdb exits — kills your repro and leaves zombies.
Step R3.5: Anchor on the stuck kernel name + PC — BEFORE any theory
This is the single most important step, and the easiest to skip. info dispatches already prints the demangled kernel name of the in-flight dispatch, and the AMDGPU wave backtrace prints the exact stuck PC inside that kernel:
# AMDGPU waves show up as rocgdb "threads"; each prints its kernel + PC:
timeout 90 rocgdb -p $WORKER_PID -batch \
-ex "set pagination off" -ex "set confirm off" -ex "info threads" \
| grep -iE "AMDGPU Wave|in void|ncclDev|aiter::" | head
# -> e.g. #0 0x...b3c in void aiter::allgather_vec<bf16,8>(...) <-- kernel
# all waves at the same PC + threadIdx 0..ngpus-1 = spinning in a barrier
Then map that PC to a source location inside the kernel (which loop/barrier), and READ that source, before forming any hypothesis:
# pick one wave thread id from `info threads`, then:
rocgdb -p $WORKER_PID -batch -ex "thread <id>" -ex "info line *\$pc" -ex "bt"
# no line info? disassemble the kernel and locate the PC offset (kernel+NNNN):
/opt/rocm/llvm/bin/llvm-objdump -d <module>.so | less # find the spin loop (s_cbranch back to s_load/atomic)
Let the kernel name + PC drive the investigation — not a narrative. The name tells you the exact source file; the PC tells you the exact line. A collective kernel (ncclDevKernel_*runRing, aiter::allgather_vec, reduce_scatter_*) stuck with all sync-lane waves (threadIdx < ngpus) at one PC = a cross-rank barrier spin (start_sync/end_sync while(flag < ...)): some rank never wrote the expected flag. Open that kernel's start_sync/end_sync and reason about why a peer's flag is missing (grid size differs across ranks, flag counter desynced from unequal call counts, e.g. TBO uneven ubatch splits) — do NOT guess at unrelated fixes (a missing end_sync cannot be the cause when the wave is spinning in start_sync, which runs first).
Step R4: Read the dump
/app/logs_claude/rocgdb_dump.txt contains four sections:
| Section | What to read |
|---|---|
info agents |
One row per GPU (8 for TP=8). Confirms rocgdb sees the HSA runtime. If empty → debug-agent is still loaded, restart without it. |
info dispatches |
The smoking gun. Each in-flight kernel: dispatch ID, grid, workgroup, fence, demangled kernel name. Two-or-more dispatches active = concurrent streams. |
info queues |
HSA queue table with Read and Write pointers per queue. Write > Read = packets pending; queue is stalled if the head dispatch never completes. Type DMA queues handle memcpy, HSA queues handle kernel launches. |
bt |
Python main thread's C stack. Look for hipMemcpyAsync → memcpy_and_sync → _local_scalar_dense_cuda to confirm .item() is blocked waiting for GPU. |
Step R5: Cross-reference kernel name → source
The demangled name in info dispatches is the AITER (or PyTorch) kernel symbol. For AITER ASM-precompiled kernels, grep the kernel name across aiter/aiter/ops/ to find the Python wrapper, then check whatever singleton workspace / semaphore the wrapper allocates — that is the most common shared resource a cross-stream race fights over.
Step R6: Tell-tale of the shared-workspace deadlock class
Two or more _clean-suffixed dispatches on different queues, each with Fence: B|Aa|Ra (full memory fence) = the classic shared-workspace race. Split-K GEMMs use a reduction phase that atomic-increments a counter in a per-process workspace; if that workspace is a singleton (e.g. @functools.lru_cache(maxsize=1) over device only) and two splitk kernels run concurrently on different streams, their counters interleave → neither hits its expected count → both deadlock.
Fix shape (general): make the workspace cache stream-keyed (e.g. lru_cache over (device, stream_id)) so each stream gets its own counter. Workaround shape: serialize the streams (current_stream.wait_stream(other_stream)) — masks the bug for one workload but resurfaces on a larger one; not shippable per [[atom-patterns]].
rocgdb anti-patterns
- Do Step R3.5 before any hypothesis. Anchor on the kernel name (
info dispatches) + wave PC, map PC → source line, read it. Don't theorize or chase gpucore/debug-agent/other files before the PC is located. - Don't fix before locating the PC. Spinning in
start_sync⇒end_sync(and everything after it) is off the deadlock path; changing it is a wasted edit→rebuild→retest. - Different stuck collectives across runs/ranks = one race, not "just waiting for a dead rank". Open the named kernel anyway.
- Don't attach to the dispatcher (PPID = openai_server). It has no GPU queues; you'll get "No dispatches" and waste 90s on the timeout.
- Don't combine debug-agent + rocgdb on the same process. The debug-agent's HSA tool hook shadows rocgdb's queue/dispatch visibility — you'll see agents but no dispatches.
- Don't run rocgdb interactively when the worker is in HSA wait — it can take 30+ seconds to attach, and an accidental
^CSIGSTOPs the worker permanently. Use-x scriptfile -batchwithdetachbeforequit. - Don't trust
info threads' Python frame names — rocgdb's Python integration doesn't speak the venv ABI. Usepy-spy dump --pid $WORKER_PIDin parallel for the Python-side stack.
Recovery checklist (after agent run)
bash scripts/stop_atom_server.sh— agent leaves zombie KFD entries; if you skip, next launch OOMs at NCCL barrier.pkill -9 -f spawn_main— sometimesstop_atom_server.shmisses workers stuck in fault state.- Wait 10s, then
rocm-smi --showmemuse— all GPUs must show 0% before relaunching. If not,rocm-smi --showpidsto find lingering UNKNOWN PIDs (killed but KFD hasn't cleaned yet — wait or escalate). rm /app/logs_claude/debug_run/memory_*— code objects are 4 MB each, accumulate fast across runs.- Drop
--save-code-objectsfrom production launches — disk pressure (~500 MB per run).
Anti-patterns
- Don't assume
--enforce-eager --level 0is mandatory. Default launch is fine for most agent runs; reach for these flags only when symptoms point at hipgraph or Inductor (see pre-flight item 2). Adding them blindly hides graph-mode-only bugs. - Don't grep
atom_server.logfor "error" or "Traceback" — agent's wave dump has neither; grep"stopped, reason"instead. - Don't trust PC literally — see Step 5. Especially Triton kernels are notorious for cross-wave PC misattribution. Bisect-confirm.
- Don't leave
--save-code-objectson for routine runs — each run dumps ~500 MB. Only enable for the bisect run. - Don't add
torch.cuda.synchronize()"fixes" and ship — they mask the race for one workload and surface a worse hang (livelock) on a larger one. Find the allocator/stream root cause.
Sample wave dump (what to expect in atom_server.log)
Trimmed example. Key fields are the mangled function name in kernel_code_entry=...,
the stopped, reason: ... tag, the code object: memory://<pid>#offset=<hex>&size=<bytes>
line that points at the saved file, and the => <pc>: <instruction> arrow showing
the faulting PC. Vector registers (only v0/v1/v6 shown — full dump prints v0..v15
and beyond) often reveal address / index values that pin down the operand.
[atom 15:31:09] Scheduled prefill batch: 19 reqs, 9573 tokens, req_ids: (1, 2, ..., 19)
... (some [aiter] type-hints chatter, then the agent's wave dump arrives) ...
--------------------------------------------------------
wave_27876: pc=0x7f20f5e534c4 (kernel_code_entry=0x7f20f5e52900 <void at::native::index_elementwise_kernel<128, 4, at::native::index_copy_kernel_impl<at::native::OpaqueType<4> >(at::TensorIterator&, long, long, long)::{lambda(int)#1}>(long, at::native::index_copy_kernel_impl<at::native::OpaqueType<4> >(at::TensorIterator&, long, long, long)::{lambda(int)#1})>) (stopped, reason: ASSERT_TRAP)
scalar registers:
s0: ffffffff s1: ffffffff s2: 00000000 s3: f8000000
...
s32: 0ec00000 s33: 00000002 ...
system registers:
mode: 000003f0 trapsts: 80000000 status: 80010041
trap registers:
ttmp4: 00006ce4 ttmp5: 00000000 ...
vector registers:
v0: [0] 95f02814 [1] 95f02818 [2] 95f0281c [3] 95f02820 ... [58] 95f028fc [59] 00000a40 ...
v1: [0] 00007f20 [1] 00007f20 ... ← v0:v1 = per-lane dst address
v6: [0] 000080a6 [1] 000080a7 ... [58] 000080e0 [59] 000027ec ← per-lane src VALUES being stored
general registers:
m0: 000103c0
pc: 00007f20f5e534c4 exec: f800000000000000
vcc: ffffffffffffffff
Disassembly for function void at::native::index_elementwise_kernel<128, 4, at::native::index_copy_kernel_impl<at::native::OpaqueType<4> >(...)>:
code object: memory://2188702#offset=0x546c3060&size=4026672
loaded at: [0x7f20f5e00000-0x7f20f615ff09]
=> 0x7f20f5e534c4 <+3012>: s_endpgm
0x7f20f5e534c8 <+3016>: v_cndmask_b32_e32 v0, s0, v0, vcc
How to read this:
- Kernel =
at::native::index_copy_kernel_impl<OpaqueType<4>>→ PyTorchTensor.index_copy_(dim, idx, src)for 4-byte dtype (int32 / float32). - Reason =
ASSERT_TRAP→ some lane'sindex_valuefailed0 <= idx < self.size(dim). Look atv6per-lane values to see what was being processed (if v6 holds the stored value here; the relevant register varies by kernel). - PC lands on
s_endpgmbecause the assert lowering iss_trap 2; s_endpgm— the actual condition test was earlier (look ~10 instructions back in the disassembly fors_cbranch_*+s_trap). - Code object at
memory://2188702#offset=0x546c3060&size=4026672→ saved as7_memory___2188702_offset_0x546c3060_size_4026672in/app/logs_claude/debug_run/. Usellvm-objdump --disassemble-allon it.
Cross-references
- [[dump-bisect-debug]] — for numerical bugs (wrong output, not crashes)
- [[capture-trace]] — for performance investigation (which kernels eat time)
- [[atom-patterns]] — V4 attention buffer / stream conventions