name: rocm-profiler-analysis description: > Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
ROCm Profiler Analysis
Use this skill when you need to turn a profiling run into structured optimization evidence instead of a raw trace file.
This skill is the AMD/ROCm/MI355X adaptation of SGLang's torch-profiler analysis workflow. It is designed for our current amdpilot stack:
- MI355X / gfx950 nodes
- ROCm 7.2
- SGLang / vLLM issue-driven runs
- dashboard artifacts, not just terminal output
Why This Exists
Raw traces are not good enough for agents or dashboards. They tell you that time was spent somewhere, but they do not directly answer:
- Which kernel families dominate prefill or decode on ROCm?
- Which kernels still have overlap headroom?
- Which hotspots map back to Python or operator-level code paths?
- Which results are actually relevant to gfx950 / MI355X, and which are only generic?
- Which profiling outputs should be written into our canonical experiment/trial schema?
This skill standardizes that path.
Main Workflow
Preserve the same four subcommands as the upstream SGLang profiler skill:
triagebreakdownoverlapperfetto-fix
For normal agent use, default to triage.
triage
Use this when you want one compact answer with three main outputs:
- kernel table
- overlap-opportunity table
- fuse-opportunity table
breakdown
Use this when you need one-trace category share analysis without overlap reasoning.
overlap
Use this when you have both:
- a graph-off mapping trace
- a graph-on formal trace
and need to tie overlap headroom back to code paths.
perfetto-fix
Use this only when Perfetto renders overlapped lanes incorrectly and you need a repaired trace for human inspection.
Recommended Inputs
This skill supports two input shapes:
Existing trace directory / trace file
trace.jsontrace.json.gz- profiler output directory
Live server / live experiment
- trigger profiling against a running SGLang or vLLM server
- then immediately analyze the result and attach artifacts back to the run
For amdpilot integration, prefer the second path for optimization-stage profiling and the first path for post-hoc investigation.
AMD / ROCm Adaptation Rules
1. Use ROCm-native kernel categories
Do not reuse CUDA/H100/B200 assumptions. On our nodes, category tables should explicitly account for ROCm-specific paths:
- RCCL / communication
- Triton kernels
- CK / composable kernel paths
- AITER paths
- hipBLASLt / rocBLAS GEMM
- MIOpen / attention runtime kernels
- quantization
- normalization
- memory / copy / scheduler overhead
See references/rocm-kernel-categories.md.
2. Keep hardware relevance explicit
Every profiling result must declare whether it is truly relevant to our MI355X node:
observed_arch: actual arch from the runarch_match:exact | compatible | unknownhardware_relevance_reason: short human-readable explanation
Do not hide gfx942 vs gfx950 differences.
3. Treat profiling as structured artifacts
Do not stop at stdout tables. Write stable artifacts that can be attached to an experiment or trial. Minimum recommended outputs:
profile_summary.mdprofile_metadata.jsonkernel_table.jsonoverlap_opportunities.jsonfuse_opportunities.jsonperfetto_fixed_trace.jsonif used
See references/artifact-contract.md.
Canonical Metadata Contract
profile_metadata.json should contain enough information to tie profiling results back to the
dashboard and DB.
Minimum fields:
experiment_idtrial_idobserved_archarch_matchhardware_relevance_reasonrocm_versionbase_imageresource_classgpu_device_idsgpu_clocks_mhzpreflight_passedserver_flagsbenchmark_config_hashmodel_nameprofile_stagesource_trace_path
This is the difference between "a useful local notebook" and "a reusable profiling artifact".
Dashboard / DB Integration
The intended downstream path is:
- run profile
- emit structured artifacts
- attach artifacts to experiment / trial
- surface the summary and tables in dashboard
This skill should feed:
- experiment detail page profiling section
- trial-level artifact list
- trajectory context for optimization retries
- future data-flywheel / SFT signals
See references/dashboard-integration.md.
MI355X / gfx950 Specific Guidance
On our node, prefer profiling plans that stay grounded in actual machine facts:
- 8x MI355X
- gfx950
- ROCm 7.2
- explicit GPU ID allocation from the experiment
- exact Docker image tag, not just "ROCm 7.2"
When you compare profiles across runs, never compare them without also checking:
base_imageresource_classgpu_device_idsserver_flagsbenchmark_config_hash
Otherwise the comparison is not trustworthy.
Suggested Rollout
Phase A: Analysis adaptation
Make the kernel classification and overlap heuristics ROCm-aware.
Phase B: Artifactization
Write the profiling outputs into stable JSON + Markdown artifacts.
Phase C: Live integration
Trigger profiling from real optimization stages and surface the artifacts in dashboard.
Relationship to Other AMD Skills
rocprofv3-profiler Use that skill when you need low-level AMD hardware counters or kernel-level bottleneck data. Use this skill when you need SGLang/vLLM trace triage tied back to Python/operator semantics.
env-probe Run env-probe before profiling if you suspect hidden runtime defaults are skewing results.
rocm-crash-debug Use crash-debug when the run is failing. Use this skill when the run is healthy enough to generate profiling evidence.
Reviewer Checklist
Before calling this skill "done", verify:
- It is explicitly MI355X / gfx950 / ROCm-aware
- It produces structured artifacts, not just console output
- It carries experiment/trial linkage fields
- It distinguishes
arch_matchprogrammatically - It can be attached to dashboard and DB without ad-hoc parsing