name: amdsmi-improvement-evaluation description: A structured approach to evaluating AMD SMI's current state and identifying improvement opportunities.
AMD SMI Improvement Evaluation
Build a practical understanding of AMD SMI and identify high-value improvement areas for ROCm users who need GPU monitoring, diagnostics, and management. This includes validating whether AMD SMI reports correct GPU information across different AMD GPU models and driver environments.
Current Source Notes
ROCm/amdsmistill exists, but its README says the repository is deprecated and that source work has moved toROCm/rocm-systems/projects/amdsmi.- The latest public AMD SMI documentation is published under ROCm Software 7.2.2 / AMD SMI 26.2.2.
- The documentation frames AMD SMI as a unified user-space interface for GPU management and monitoring in HPC environments.
- The documented interfaces include C++, Python, Go, CLI usage, Docker usage, and Reliability, Availability, Serviceability (RAS) concepts.
Official Example Inventory
From ROCm/rocm-systems on the develop branch, the current example directory
contains:
CMakeLists.txtamd_smi_afid_example.pyamd_smi_cper.ccamd_smi_cper_example.pyamd_smi_drm_example.ccamd_smi_nic.ccamd_smi_nodrm_example.ccamd_smi_violation_example.pyamdsmi_esmi_intg_example.cc
These examples give us an initial test set for assessing build friction, runtime requirements, documentation clarity, and API ergonomics.
Investigation Tracks
1. CLI Experience
Questions:
- How easy is it to discover the right
amd-smicommand for common tasks? - Are outputs readable for humans and stable enough for scripts?
- Are errors actionable when permissions, drivers, or supported hardware are missing?
- Is JSON or machine-readable output available where operators need it?
Evidence to collect:
amd-smi --help- Common query commands for devices, utilization, power, temperature, clocks, firmware, ECC/RAS, topology, and processes
- Behavior with missing GPU access or unsupported host environment
2. Python API Experience
Questions:
- How much boilerplate is needed to initialize, enumerate devices, query metrics, and handle errors?
- Do Python return shapes feel consistent and typed enough for downstream tools?
- Can a user build a simple exporter or health check without reading C++ API details?
Evidence to collect:
- Minimal device inventory script
- Minimal utilization/power/thermal polling script
- Error-handling behavior and exception patterns
- Differences between Python examples and docs
3. C++ API Experience
Questions:
- Are headers, linking, and CMake integration straightforward?
- Are API names and structs discoverable?
- Are lifetime, initialization, and cleanup rules obvious?
- Are example binaries focused enough to teach one concept at a time?
Evidence to collect:
- Build steps for official C++ examples
- CMake dependency behavior with installed ROCm
- Compile/link errors on a clean environment
- Required privileges and driver/library paths
4. Observability Coverage
Questions:
- Which operational signals are easy to retrieve?
- Which signals are available only through the CLI, only through APIs, or only through privileged contexts?
- Are important health signals missing, unclear, or difficult to correlate?
Signal categories:
- Device identity and firmware
- Utilization and process attribution
- Memory usage and bandwidth
- Power, energy, clocks, and performance levels
- Temperature and throttling
- ECC, RAS, CPER, XGMI, and violation data
- Topology, NUMA locality, and NIC integration
5. Output Correctness
Questions:
- Does AMD SMI report the correct GPU model, ASIC name, device ID, revision, VRAM size, memory type, firmware versions, PCIe information, and serial/UUID identifiers?
- Do reported clock, power, temperature, utilization, memory, and throttling values match trusted system sources within expected tolerances?
- Are incorrect fields isolated to specific GPU families, ROCm versions, kernel drivers, firmware versions, containers, or VM environments?
- Does the CLI disagree with the Python, C++, or Go APIs for the same GPU?
- When AMD SMI cannot determine a value, does it return an explicit unsupported or unavailable state instead of a misleading value?
Baseline comparison sources:
amd-smi- AMD SMI Python API
- AMD SMI C++ API examples
rocminforocm-smi, where available- Linux sysfs under
/sys/class/drm/card*/device lspci -nn,lspci -vv, and PCI IDs- Kernel logs from
dmesg - Vendor specifications for the tested GPU model
Fields to validate:
- GPU marketing name / product name
- ASIC name
- Device ID, vendor ID, subsystem ID, revision ID
- BDF / PCI bus address
- VRAM size and memory vendor/type
- Firmware and VBIOS versions
- Driver and ROCm versions
- Power, energy, clocks, and performance level
- Temperature sensors and throttling state
- Utilization and active process reporting
- ECC/RAS support and counters
Evidence to collect:
- Exact command or API call
- Raw AMD SMI output
- Raw baseline output
- Expected value
- Observed AMD SMI value
- Whether the value is wrong, missing, stale, rounded, mislabeled, or ambiguously named
- GPU model, ROCm version, kernel version, driver version, OS, and environment
6. Documentation and Example Quality
Questions:
- Does the documentation guide a first-time user from install to a working script?
- Are examples named and scoped clearly?
- Are expected outputs shown?
- Are environment assumptions explicit?
- Are version differences visible enough?
Evidence to collect:
- Missing prerequisites
- Broken links or stale repository references
- Ambiguous install/build instructions
- Example commands that cannot be copied directly
Improvement Backlog Template
For each issue or opportunity, capture:
- Title
- Type of User affected
- Scenario
- Current behavior
- Expected or better behavior
- Evidence
- Suggested improvement
- Impact: high / medium / low
- Effort: high / medium / low
Sources
- AMD SMI source repository: https://github.com/ROCm/amdsmi
- AMD SMI active super-repo path: https://github.com/ROCm/rocm-systems/tree/develop/projects/amdsmi
- AMD SMI documentation: https://rocm.docs.amd.com/projects/amdsmi/en/latest/
- AMD SMI examples: https://github.com/ROCm/rocm-systems/tree/develop/projects/amdsmi/example