rocm-perf-lab

star 2

Deterministic multi-kernel GPU performance analysis and guarded closed-loop optimization framework for ROCm 7.x using rocprofv3, rocpd, roofline modeling, ATT analysis, and critical-path–weighted kernel optimization.

kalowery

By kalowery schedule Updated 2/20/2026

play_arrow Run Skill in Manus View GitHub

name: rocm-perf-lab description: Deterministic multi-kernel GPU performance analysis and guarded closed-loop optimization framework for ROCm 7.x using rocprofv3, rocpd, roofline modeling, ATT analysis, and critical-path–weighted kernel optimization.

ROCm Perf Lab Skill

Use this skill when performing structured GPU performance engineering on ROCm-based AMD GPUs.

Assumptions:

ROCm 7.x
rocprofv3 available
rocpd SQLite dispatch database available
HIP kernels (standalone .cu sources for optimization)
Applications may contain multiple kernels per execution

Core Capabilities

1. Deterministic GPU Runtime Modeling

Runtime derived from rocpd kernel dispatch timestamps
Computed as SUM(dispatch_end - dispatch_start)
Not based on host wall-clock time
Works for multi-launch and multi-kernel workloads

The .rocpd_profile database is the authoritative timing source.

2. Multi-Kernel Critical Path Analysis

Reconstructs dispatch DAG from rocpd trace data
Supports cross-stream execution
Computes:
- critical_path_ns
- Per-kernel slack
- Critical-path contribution weighting

Optimization prioritization is based on measured critical-path impact, not isolated kernel time.

3. Architecture-Aware Roofline Modeling

Extracts and computes:

FP32 FLOPs (CDNA3-aware VALU width scaling)
MFMA contributions (if present)
DRAM bytes using RDREQ / WRREQ counters
Arithmetic intensity
Achieved GFLOP/s
Achieved GB/s
First-order bound classification (memory vs compute)

Validated on gfx942-class GPUs (MI300X / MI325).

4. ATT Deep Analysis

Optional ATT pass provides:

Wave occupancy
Stall breakdown
Instruction mix signals
Latency indicators

ATT enriches feature extraction but does not define runtime.

5. Rule-Based Bottleneck Classification

Combines roofline position and ATT-derived features to produce deterministic labels such as:

Memory-bandwidth bound
Latency bound
Under-occupied
Divergence limited
Compute throughput limited

Classification is deterministic for identical inputs.

6. Guarded Closed-Loop LLM Optimization

Command:

rocm-perf optimize <binary>

Scope (v1):

Standalone HIP kernel source files (.cu)
No ABI changes
Kernel signature must remain identical
Transformation limited to safe loop unrolling (factor 2–8)

Loop behavior:

Profile baseline (.rocpd_profile authoritative)
Rank kernels by critical-path contribution
Generate loop-unrolling proposal
Enforce signature invariance and basic structural checks
Compile via hipcc
Re-profile
Accept only if measured runtime improves
Automatically revert on regression

Compilation acts as the primary structural validator.

What This Skill Does Not Do

No AST-based structural verification
No cross-file refactoring
No automatic ABI changes
No formal numerical equivalence proofs
No speculative optimization without measurement

All improvements are empirical and hardware-validated.

CLI Overview

Primary commands:

rocm-perf profile <binary>
rocm-perf optimize <binary>

See references/cli.md for additional details.

Install via CLI

npx skills add https://github.com/kalowery/rocm-perf-lab --skill rocm-perf-lab

Repository Details

star Stars 2

call_split Forks 1

navigation Branch main

article Path SKILL.md

More from Creator

kalowery

kalowery Explore all skills →