name: auto-benchmark-rocm description: > Run AI-driven benchmark searches on AMD ROCm with tiered server-flag sweeps for vLLM/SGLang, canonical dataset preparation, SLA or fixed-QPS benchmarking, CSV export, and resume. Adapted from SGLang auto-benchmark for MI355X (gfx950) / MI300X (gfx942) on ROCm 7.x. Use when the user wants an automated benchmark workflow on AMD GPUs rather than a one-off bench_serving command. Integrates with amdpilot executor task_spec_json for resource-aware launch.
Auto Benchmark — AMD ROCm / MI355X
This skill is for repeatable, AI-driven performance tuning of vLLM and SGLang on AMD Instinct GPUs.
Adapted from the upstream SGLang sglang-auto-benchmark skill. All CUDA-specific references have
been replaced with ROCm equivalents; attention backends, environment variables, resource classes,
and Docker image selection reflect the AMD MI355X (gfx950) and MI300X (gfx942) ecosystem.
Preconditions
- GPU architecture confirmed: run
rocminfo | grep gfxto verify gfx950 (MI355X) or gfx942 (MI300X). - ROCm version checked:
cat /opt/rocm/.info/version— this skill targets ROCm 7.x (7.0 or 7.2). - Docker image identified: know which base image you are in (
rocm/sgl-devorrocm/vllm-dev). Use theenv-probeskill first if unsure. - Server can launch: vLLM or SGLang can already start and serve the target model in this container.
- Model path exists: local path under
/data/preferred over NFS/mnt/dcgpuval/for I/O speed. - Goal is clear:
- benchmark a fixed QPS list, or
- search the maximum QPS that satisfies
max_ttft_ms/max_tpot_ms.
If any precondition is not met, fix it before running a large search.
Most Important Rule
If the user wants the best command for a real production or real workload scenario, the benchmark
must use their real request distribution — real prompt lengths, output lengths, multi-turn patterns,
and sampling settings. sharegpt, random, and generated-shared-prefix are useful for sanity
checks, but they are not a substitute for real traffic.
AMD-Specific Attention Backends
On ROCm, the available attention backends differ from NVIDIA:
| Backend | Description | Notes |
|---|---|---|
aiter |
ROCm-native unified attention kernel library | Primary AMD backend, best for MI355X/MI300X |
triton |
Cross-platform Triton kernels | Works on ROCm, good fallback |
torch_native |
PyTorch native SDPA | Baseline, no AMD-specific optimization |
Do NOT use NVIDIA-specific backends: fa3, fa4, flashinfer, flashmla, trtllm_*, cutlass_*.
These will fail or silently produce wrong results on ROCm.
aiter-Specific Constraints
- MLA (Multi-Head Latent Attention): aiter ASM kernels require
heads_per_gpu % 16 == 0. For models with 64 heads, TP must be ≤ 4 (giving 16 heads/GPU). - FP8 prefill attention on gfx950: enable with
SGLANG_AITER_FP8_PREFILL_ATTN=1. - MLA persist design for FP8 KV cache: enable with
SGLANG_AITER_MLA_PERSIST=1.
Environment Variables
Replace CUDA environment variables with ROCm equivalents:
| CUDA (do not use) | ROCm equivalent | Purpose |
|---|---|---|
CUDA_VISIBLE_DEVICES |
HIP_VISIBLE_DEVICES |
GPU device selection |
CUDA_LAUNCH_BLOCKING |
HIP_LAUNCH_BLOCKING |
Synchronous kernel launch for debugging |
| — | SGLANG_USE_AITER=1 |
Explicitly enable aiter backend |
| — | SGLANG_AITER_MLA_PERSIST=1 |
Enable MLA persist design |
| — | SGLANG_AITER_FP8_PREFILL_ATTN=1 |
FP8 prefill on gfx950 |
| — | HSA_FORCE_FINE_GRAIN_PCIE=1 |
Fine-grain PCIe for host-device transfers |
GPU Topology and Resource Classes
Our node has 8× AMD Instinct MI355X (gfx950), ROCm 7.2.0.
| Resource class | GPU count | Typical use case |
|---|---|---|
single-gpu |
1 | Small models (≤8B), quick sanity checks |
multi-gpu |
4 | Medium models (32B–70B), TP=4 |
full-node |
8 | Large models (70B+), TP=8 or TP=4×DP=2 |
When server.parallel is used and dp_size is not set explicitly:
dp_size = visible_gpus / (tp_size * pp_size)
Visible GPU count is inferred from HIP_VISIBLE_DEVICES, or from server.parallel.gpu_count.
Supported Dataset Kinds
Same as upstream:
sharegpt— auto-download supported, converted to canonical JSONL.custom— oldbench_servingformat or canonical autobench JSONL.random— synthetic/random benchmark path.generated-shared-prefix— shared-prefix synthetic generator.
Canonical Dataset Format
Identical to upstream. JSONL, one request per line:
{"prompt": "Write a summary.", "output_len": 256}
{"prompt": [{"role": "user", "content": "Summarize."}], "output_len": 256}
Search Tiers
- Tier 1: Fast sanity sweep. Baseline + small one-at-a-time scan.
- Tier 2: Good default. Small cartesian on high-priority keys + expansion for rest.
- Tier 3: Full cartesian product. Slowest, but thorough when space is bounded.
YAML key order matters. Put the most important search keys first.
What Is Tunable (AMD/ROCm)
server.base_flags and server.search_space are passed to the server launcher. Any valid
vLLM/SGLang server CLI flag can be set or searched.
Kernel / Backend (AMD-specific)
attention_backend— search[aiter, triton]prefill_attention_backend— if split prefill/decode is supporteddecode_attention_backend— if split prefill/decode is supportedsampling_backend
Batching / Scheduling
max_running_requestschunked_prefill_size— common values:[4096, 8192, 16384, 131072]prefill_max_requestsmax_prefill_tokensschedule_policy—[lpm, fcfs]schedule_conservativenessnum_continuous_decode_steps
Memory / Cache
mem_fraction_static— critical for ROCm; MI355X HBM3e is larger than H100, so ranges differ. Typical search:[0.80, 0.85, 0.88, 0.90]max_total_tokenspage_sizedisable_radix_cachekv_cache_dtype—[auto, fp8_e4m3]for FP8 KV cache on MI355X
Parallel / Distributed
tp_size— must respect aiter head constraints (heads_per_gpu % 16 == 0)pp_sizedp_sizeload_balance_methodenable_dp_attentionenable_aiter_allreduce_fusion— AMD-specific distributed optimization
Runtime / HIP Graph
- Keep HIP graph enabled by default for performance benchmarking (same concept as CUDA graph on ROCm).
cuda_graph_max_bs— flag name is unchanged in vLLM/SGLang even on ROCmdisable_cuda_graph_padding- Do not put
disable_cuda_graphinto the default search space.
Optional Speculative / EAGLE Stage
Speculative decoding support on ROCm may be limited. Verify availability before enabling:
speculative_num_stepsspeculative_eagle_topkspeculative_num_draft_tokens
Order: always tune the non-speculative base server first, then optionally add speculative search.
Base Tuning Before EAGLE
Never start by tuning EAGLE first. Use this order:
- Tune the non-speculative base server first.
- Find the best normal config for the target dataset and SLA.
- Only if the user explicitly asks and draft model assets exist, run speculative search.
Running The Workflow
Prepare dataset
python3 -m sglang.auto_benchmark convert \
--kind sharegpt \
--tokenizer /data/meta-llama/Meta-Llama-3.1-70B-Instruct \
--num-prompts 1200 \
--output /tmp/sharegpt.autobench.jsonl
Run from config
python3 -m sglang.auto_benchmark run --config /path/to/config.yaml
Outputs
- Prepared canonical dataset JSONL
- Per-run
results.jsonl - Summary
results.csv - Per-candidate server logs
Integration with amdpilot
When running benchmarks through the amdpilot executor, the benchmark task spec maps to
task_spec_json in the queue DB:
{
"gpu_count": 8,
"resource_class": "full-node",
"base_image": "rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317",
"gpu_free_ids": [0, 1, 2, 3, 4, 5, 6, 7],
"disk_free_gb": 2100,
"timeout_minutes": 120
}
The dashboard GET /api/{job_name}/system_info endpoint reads from this column to display
experiment runtime info (GPU arch, ROCm version, base image, resource class).
Benchmark results should be written as structured artifacts so the dashboard can render them and feed them into the data flywheel for downstream LoRA/SFT training signal.
Config Template
Use the reference configs in references/:
| Config | Model | GPUs | Notes |
|---|---|---|---|
config-example-rocm.yaml |
Generic | 4 | Starting template |
llama3.1-70b-mi355x.yaml |
Llama 3.1 70B Instruct | 8 | Full-node TP=8 |
qwen3-32b-mi355x.yaml |
Qwen3 32B | 4 | TP=4, aiter search |
deepseek-r1-mi355x.yaml |
DeepSeek R1 671B (FP8/MXFP4) | 8 | FP8 KV cache, AllReduce fusion |
What To Report Back
After a run, summarize:
- Hardware: GPU arch (gfx950/gfx942), ROCm version, Docker image tag
- Search config: which tier, dataset kind (synthetic vs real)
- Best config found: attention backend, TP/DP, key flags
- Best QPS that satisfied SLA (or fixed QPS results)
- Whether speculative tuning was skipped or run
- Paths to artifacts: dataset JSONL,
results.jsonl,results.csv, server logs - Anomalies: OOM events, HIP graph capture failures, aiter constraint violations
Differences From Upstream (CUDA) Skill
| Aspect | CUDA / NVIDIA | ROCm / AMD |
|---|---|---|
| GPU visibility | CUDA_VISIBLE_DEVICES |
HIP_VISIBLE_DEVICES |
| Attention backends | fa3, flashinfer |
aiter, triton |
| Graph runtime | CUDA graph | HIP graph (flag names unchanged — cuda_graph_max_bs, disable_cuda_graph etc. still use "cuda" prefix in vLLM/SGLang CLI even on ROCm) |
| GPU query | nvidia-smi |
rocm-smi --showproductname |
| Arch detection | N/A | rocminfo | grep gfx |
| AllReduce fusion | N/A | --enable-aiter-allreduce-fusion |
| FP8 prefill | Built-in | SGLANG_AITER_FP8_PREFILL_ATTN=1 |
| MLA persist | Built-in | SGLANG_AITER_MLA_PERSIST=1 |
| Memory range | 0.85–0.92 typical | 0.80–0.90 typical (HBM3e larger) |
| Docker images | nvcr.io/nvidia/* |
rocm/sgl-dev, rocm/vllm-dev |