name: rocm-crash-debug description: > Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".
ROCm/HIP Crash Debug Skill for AMD GPUs
Use this when an amdpilot trial fails with a GPU error, OOM kill, or produces wrong results. AMD ROCm crashes often leave minimal diagnostic information. This skill captures structured crash evidence at kernel boundaries so failures become reproducible offline samples.
Why This Exists
Problem: ROCm/HIP errors on MI300X/MI325X/MI355X crash the process before normal debug output is flushed. A trial that exits with code 137 (OOM kill) or produces 0.00 metric tells you nothing about what went wrong. The agent can't learn from a failure it can't observe.
Solution: Instrument kernel call boundaries with pre-execution logging. When a crash occurs, the last logged boundary tells you exactly which op failed and what inputs caused it. Combined with tensor dumps, a crash becomes an offline-analyzable sample.
This approach is adapted from SGLang's @debug_kernel_api decorator (PR #20910), which was
inspired by FlashInfer's API logging. The concepts are backend-agnostic; this skill specializes
them for ROCm/HIP and integrates with amdpilot's trial/experiment infrastructure.
Step 1: Enable Kernel API Logging in Container
Set environment variables when launching the Docker container. These work with SGLang's
existing kernel_api_logging.py infrastructure on ROCm builds.
Level 1 — Function Names Only (minimal overhead)
SGLANG_KERNEL_API_LOGLEVEL=1
SGLANG_KERNEL_API_LOGDEST=stderr
Use this as the default for all amdpilot trials. Overhead is negligible and it captures the last kernel boundary before any crash.
Level 3 — Input Metadata (shapes, dtypes, devices)
SGLANG_KERNEL_API_LOGLEVEL=3
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this when a trial has failed and you need to understand what data caused the crash. Shows tensor shapes, dtypes, device placement, and contiguity.
Level 5 — Tensor Statistics (min/max/mean/NaN/Inf counts)
SGLANG_KERNEL_API_LOGLEVEL=5
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this for numerical correctness issues (NaN/Inf in outputs, wrong results). Adds statistical summaries of every input tensor at each kernel boundary.
Level 10 — Full Crash Dumps (inputs.pt + metadata.json)
SGLANG_KERNEL_API_LOGLEVEL=10
SGLANG_KERNEL_API_DUMP_DIR=/workspace/.amdpilot/crash_dumps
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this for hard-to-reproduce crashes. Saves complete input tensors and metadata before each kernel call. When the process crashes, the last dump directory contains a reproducible snapshot.
Note on HIP Graphs: When HIP_GRAPH_CAPTURE is active, level-10 tensor dumps are
automatically skipped (same as CUDA Graph behavior), but boundary logging at levels 1-5
continues. This prevents dump operations from corrupting graph capture.
Step 2: Inject via amdpilot Container Config
In task.yaml, add the logging environment variables:
container:
env:
SGLANG_KERNEL_API_LOGLEVEL: "3"
SGLANG_KERNEL_API_LOGDEST: "/workspace/.amdpilot/kernel_api.log"
For the orchestrator's Docker container startup, these flow through ContainerConfig.env
to docker run -e flags. No code changes needed — just config.
For amdpilot executor integration, the recommended approach:
- Default: Level 1 on all trials (always capture last boundary)
- After first failure: Supervisor escalates to level 3+ for retry trials
- Known crash patterns: Level 10 for targeted dump collection
Step 3: ROCm-Specific Diagnostic Commands
When a trial fails, collect these AMD-specific diagnostics from inside or outside the container:
GPU State (run from host after container exits)
# GPU memory and utilization at time of failure
rocm-smi --showuse --showmeminfo vram
# Check for GPU hangs or resets
dmesg | grep -i "amdgpu\|drm\|gpu" | tail -20
# Check for Xnack / page fault issues (MI300X/MI355X specific)
dmesg | grep -i "retry fault\|xnack" | tail -10
Container Exit Analysis
# Get exit code and OOM details
docker inspect --format='{{.State.ExitCode}} {{.State.OOMKilled}}' <container>
# Get last N lines of container stderr
docker logs --tail 50 <container> 2>&1
ROCm Error Codes (common on MI300X/MI325X/MI355X)
| Error | Meaning | What to Check |
|---|---|---|
hipErrorIllegalAddress |
Out-of-bounds GPU memory access | Tensor shapes, index bounds, contiguity |
hipErrorAssert |
Device-side assert triggered | Input validation in kernel, index range |
hipErrorOutOfMemory |
GPU VRAM exhaustion | Batch size, model size, KV cache config |
hipErrorLaunchFailure |
Kernel launch failed | Shared memory size, block dimensions |
hipErrorNoBinaryForGpu |
No binary for target GPU arch | Check gfx target matches (gfx942 vs gfx950) |
| Signal 137 | OOM killed by host kernel | Container memory limit, host swap pressure |
| Signal 139 | Segmentation fault | Usually host-side pointer corruption |
rocgdb Quick Attach (for interactive debug)
# Inside container — attach to running process
rocgdb -p <pid>
# Or launch with rocgdb
rocgdb --args python your_script.py
(gdb) catch throw
(gdb) run
Step 4: Multi-GPU / Multi-Process Debugging
MI355X nodes typically run 8 GPUs. Use per-process log files:
SGLANG_KERNEL_API_LOGDEST="/workspace/.amdpilot/kernel_api_rank%i.log"
SGLANG_KERNEL_API_DUMP_DIR="/workspace/.amdpilot/crash_dumps/rank%i"
The %i placeholder is replaced by the process rank. This prevents log interleaving
across TP/DP workers.
For RCCL (ROCm collective communication) hangs:
# Enable RCCL debug logging
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,COLL
# Set timeout for collective operations
NCCL_TIMEOUT=300
Step 5: Filter Dumps to Reduce Disk Usage
Level-10 dumps can consume significant disk space. Filter to specific ops:
# Only dump attention-related ops
SGLANG_KERNEL_API_DUMP_INCLUDE="*attention*,*flash*,*sdpa*"
# Exclude high-frequency trivial ops
SGLANG_KERNEL_API_DUMP_EXCLUDE="*elementwise*,*copy*,*fill*"
Step 6: Common ROCm Crash Patterns on MI300X/MI355X
Pattern 1: gfx950 Binary Mismatch
Symptom: hipErrorNoBinaryForGpu or silent wrong results
Cause: Container built for gfx942 (MI300X) but running on gfx950 (MI355X)
Fix: Verify Docker image target matches hardware
# Check GPU architecture
rocminfo | grep "Name:" | head -1
# Should show: gfx950 for MI355X, gfx942 for MI300X/MI325X
Pattern 2: VRAM Fragmentation OOM
Symptom: Signal 137 or hipErrorOutOfMemory despite sufficient total VRAM
Cause: HBM3e memory fragmentation after multiple allocation/free cycles
Fix: Set PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Pattern 3: Flash Attention FP8 Crash
Symptom: Crash in attention kernel with FP8 quantized models
Cause: FP8 flash attention path not supported on all ROCm versions
Fix: Check env-probe skill output for FP8 flash attn availability; fall back to
non-FP8 attention if flagged
Pattern 4: Triton Kernel Compilation Failure
Symptom: Hang during first forward pass, or ModuleNotFoundError in Triton
Cause: Triton cache corruption or version mismatch
Fix: Clear Triton cache: rm -rf ~/.triton/cache
Pattern 5: RCCL Collective Timeout
Symptom: Process hangs on all_reduce or all_gather, eventually killed
Cause: GPU-to-GPU communication failure, often RDMA fabric issue
Fix: Check rocm-smi --showtopo for link health; set NCCL_TIMEOUT
Step 7: Integration with amdpilot Dashboard
Crash artifacts flow into the dashboard through existing fields:
| Artifact | DB Field | Dashboard View |
|---|---|---|
| Last kernel boundary | trials.failure_reason |
Trial detail panel |
| Input shapes/dtypes | trials.failure_reason (structured) |
Trial detail panel |
| Full crash dump path | events.detail_json |
Trajectory viewer |
| GPU state at crash | events.detail_json |
System info panel |
| Kernel API log file | Agent output (trial_N.txt) |
Agent log viewer |
Post-Trial Crash Collection Script
After a trial fails, run this inside the container to collect structured diagnostics:
#!/bin/bash
# collect_crash_diagnostics.sh
OUT="/workspace/.amdpilot/crash_diagnostics.json"
python3 -c "
import json, os, glob
diag = {
'kernel_api_log': '',
'last_kernel_boundary': '',
'crash_dumps': [],
'gpu_arch': '',
'exit_code': ${EXIT_CODE:-0},
}
# Read kernel API log
log_path = '/workspace/.amdpilot/kernel_api.log'
if os.path.exists(log_path):
with open(log_path) as f:
lines = f.readlines()
diag['kernel_api_log'] = ''.join(lines[-50:])
# Extract last kernel boundary
for line in reversed(lines):
if 'Kernel API Call:' in line:
diag['last_kernel_boundary'] = line.strip()
break
# List crash dumps
dump_dir = '/workspace/.amdpilot/crash_dumps'
if os.path.isdir(dump_dir):
diag['crash_dumps'] = glob.glob(f'{dump_dir}/**/metadata.json', recursive=True)[-5:]
# GPU arch
import subprocess
try:
out = subprocess.check_output(['rocminfo'], text=True, timeout=5)
for line in out.splitlines():
if 'Name:' in line and 'gfx' in line:
diag['gpu_arch'] = line.strip().split()[-1]
break
except Exception:
pass
print(json.dumps(diag, indent=2))
" > "$OUT"
The orchestrator can read this file post-trial and inject into failure_reason and
events.detail_json for dashboard display.
Step 8: Escalation Flow for amdpilot Supervisor
Recommended supervisor behavior when a trial fails:
First failure (any exit code != 0):
- Read kernel API log (level 1 always active)
- Extract last kernel boundary → store in
failure_reason - If signal 137: check container OOM, reduce batch size on retry
- If signal 139: escalate to level 10, retry with crash dumps enabled
Second failure (same error pattern):
- Escalate to level 3 or 5
- Retry with expanded logging
- Collect full container diagnostics post-failure
Third failure (still stuck):
- Enable level 10 with targeted dump filters
- Collect crash_diagnostics.json
- Log structured event for data flywheel collection
This escalation flow produces progressively richer crash evidence without paying the full level-10 cost on every trial.
Integration with Other Skills
- env-probe: Run env-probe first to detect known ROCm gotchas (inductor defaults, hipBLASLt bugs) before they cause crashes
- rocprofv3-profiler: Use profiler skill for performance issues; use this skill for correctness/crash issues
- amd-rocm-porting: When porting CUDA code, enable level 3+ to catch HIP translation errors at kernel boundaries
Validation
This skill has been validated against:
- SGLang sglang-20691 run: 3 trials with OOM kill (signal 137) — level 1 logging would have identified the last kernel boundary before memory exhaustion
- vllm#33123 run: 9+ trials at 0.00 metric — level 3 logging would show whether the test harness is reaching the kernel path at all
- General MI355X container crashes during development: level 10 dumps enabled post-mortem analysis of flash attention FP8 path failures
Adding New Crash Patterns
When you encounter a new ROCm crash pattern:
- Add the error signature to the "Common ROCm Crash Patterns" section
- Add a check to
collect_crash_diagnostics.sh - Document what level of logging is needed to diagnose it
- File the crash dump as a data flywheel sample if the fix is non-trivial