name: fp8-gemm-tuning-sglang-aiter description: Use when trying to optimize end-to-end SGLang performance with gemm tuning for FP8 models on AMD HIP/ROCm by replacing the default Triton GEMM backend with a tuned Composable Kernel (CK) path through aiter; this skill is the verified playbook for that entire process, using FP8 block-wise GEMM (gemm_a8w8_blockscale) as the primary worked example—GEMM shape/dispatch logging in SGLang, CK composable-kernel tuning, and AITER_CONFIG_GEMM_A8W8_BLOCKSCALE CSV integration. FP8 blockscale and bpreshuffle should also apply by switch the place for dumping gemm and the ck tool used for tuning.
FP8 block-wise GEMM tuning (SGLang + aiter)
Overview
This workflow tunes FP8 block-scaled GEMM used on HIP/AMD when SGLang runs with aiter (SGLANG_USE_AITER=1). Stock SGLang ≥ 0.5.6 often routes block FP8 through Triton (aiter.ops.triton.gemm_a8w8_blockscale) when aiter is enabled. To use CK (aiter.gemm_a8w8_blockscale) with a tuned kernel table, you pin aiter, capture (M, N, K) from a representative server run by implementing GEMM shape dump and dispatch logging in §5–§6 (trace the path in §4 first), run aiter’s CK tuner, point AITER_CONFIG_GEMM_A8W8_BLOCKSCALE at the produced CSV, and switch imports in fp8_utils.py so the CK symbol is used. Then rerun the same serving/benchmark pipeline and compare logs to baseline.
Assumptions: ROCm/HIP GPU, Python env where SGLang and aiter are importable from the workload script, and write access to the SGLang sources you run when adding §5 hooks.
Similar steps could potentially apply to ck bpreshuffle gemm, etc.
1. Resolve SGLang and aiter paths; verify versions
- From the workload script (for example a launch wrapper), read
PYTHONPATH,VIRTUAL_ENV, explicitpython -m sglang, orwhich python3to find which SGLang and aiter trees are used. - SGLang: require ≥ 0.5.6 (block FP8 + aiter integration expectations in this workflow). Check with
pip show sglangorpython -c "import sglang; print(getattr(sglang, '__version__', 'unknown'))". - aiter: require a checkout at or after commit
303a583c89fe392a39cad7e45d616cc43bde3278. If the current commit is not a descendant to this commit, you must update the repo either wise issues like systemetic crash could happen. If this commit does not exsit, run git pull command to update the commit info. THIS IS IMPORTANT
2. Pin aiter to the required commit (if the required commit is not an ancestor of current HEAD, or it is not found in local)
Run inside the aiter repository root:
git pull
git checkout 303a583c89fe392a39cad7e45d616cc43bde3278
git submodule sync && git submodule update --init --recursive
# Clean JIT build artifacts before rebuild
rm -rf aiter/jit/*.so
rm -rf aiter/jit/build/*
python setup.py develop
Confirm: git rev-parse HEAD prints 303a583c89fe392a39cad7e45d616cc43bde3278 (or a descendant if you intentionally stay on newer HEAD after verifying compatibility).
3. Baseline end-to-end performance
- Run the same pipeline you will use after tuning: start
launch_server(or equivalent), then bench_serving / your benchmark; capture server log and benchmark output (latency, throughput, tokens/s, etc.). - Store logs under a timestamped directory for before/after comparison.
4. Understand which GEMM path SGLang uses (read code; do not assume)
- On HIP with
SGLANG_USE_AITER=1, inspect sglang librarypython/sglang/srt/layers/quantization/fp8_utils.pyinside the_use_aiterblock: imports decide whethergemm_a8w8_blockscalecomes fromaiter(CK) oraiter.ops.triton.gemm_a8w8_blockscale(Triton). - Typical SGLang ≥ 0.5.6: Triton blockscale import is active; CK import is commented. Different SGLang revisions may differ—always read the file you actually run.
5. GEMM shape finding and dumping support (SGLang edits)
CK tuning needs a faithful list of (M, N, K) (and related metadata such as dtype) from runs of your workload. In the SGLang tree that actually runs launch_server, add support for two env-driven toggles:
SGLANG_DUMP_AITER_FP8_GEMM_SHAPES: when enabled, the server log must contain parseable GEMM shape information for downstream steps—commonly lines tagged[GEMM_shape_dump]with acsv_row: M,N,K,...suffix (or an equivalent format you document for §7–§8).SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH: when enabled, emit a concise log that identifies whichw8a8_block_fp8_linearimplementation the process is using, so you can cross-check against §4 (CK vs Triton vs other).
Implementation approach recommandation: use §4 to locate the real block-FP8 / aiter execution path in the checkout, add the smallest set of hooks that fire for the tensors that define M, N, K, and keep logging volume sane (for example rank 0 only, or rate-limited if hot).
Reference implementation (from a working HIP + aiter layout; adapt paths, imports, and callsites to the SGLang version you run—the snippets illustrate behavior, not the only valid layout):
# fp8_utils.py — add near other FP8 helpers (needs: prod, get_bool_env_var, logger)
def log_aiter_fp8_gemm_shape_dump(
x: torch.Tensor,
weight: torch.Tensor,
block_size: Optional[List[int]] = None,
layer_prefix: str = "",
) -> None:
if not get_bool_env_var("SGLANG_DUMP_AITER_FP8_GEMM_SHAPES"):
return
try:
from sglang.srt.distributed import get_tensor_model_parallel_rank
if get_tensor_model_parallel_rank() != 0:
return
except Exception:
pass
if x.dim() < 2:
M, K = int(x.numel()), 1
else:
M = int(prod(x.shape[:-1]))
K = int(x.shape[-1])
N = int(weight.shape[0])
K_w = int(weight.shape[1])
block_msg = f" block_size={block_size}" if block_size is not None else ""
prefix_msg = f" layer.prefix={layer_prefix!r}" if layer_prefix else ""
csv_row = f"{M},{N},{K_w},{weight.dtype}"
logger.warning(
"[GEMM_shape_dump] aiter FP8 block GEMM path%s%s | x.shape=%s weight.shape=%s | "
"M=%d N=%d K=%d (aiter gemm_a8w8_blockscale / bpreshuffle tune CSV) | csv_row: %s",
prefix_msg, block_msg, tuple(x.shape), tuple(weight.shape), M, N, K_w, csv_row,
)
# fp8.py — import the helper from fp8_utils, then before aiter_w8a8_block_fp8_linear(...):
if self.w8a8_block_fp8_linear is aiter_w8a8_block_fp8_linear:
log_aiter_fp8_gemm_shape_dump(
x=x[0], # or x when not a tuple
weight=layer.weight,
block_size=self.quant_config.weight_block_size,
layer_prefix=getattr(layer, "prefix", ""),
)
# fp8.py — Fp8LinearMethod.__init__, after self.w8a8_block_fp8_linear = dispatch_w8a8_block_fp8_linear()
if self.block_quant and get_bool_env_var("SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH"):
_fn = self.w8a8_block_fp8_linear
print_warning_once(
"[Fp8LinearMethod] block_quant w8a8_block_fp8_linear="
f"{getattr(_fn, '__qualname__', repr(_fn))} ({getattr(_fn, '__module__', '?')})"
)
If your tree differs, reproduce the same contract (env-gated shape lines + env-gated dispatch identity) rather than copying file names blindly.
6. Enable the new env toggles; collect server log
Now that we have made gemm logging possible, add the following flags used to start SGLang:
export SGLANG_DUMP_AITER_FP8_GEMM_SHAPES=1
export SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH=1
Run your workload again (same scenario you will tune). Validate that server.log (or your log path) reflects SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH (whatever identity string your §5 hook emits) and contains [GEMM_shape_dump] with csv_row: for the hot path. If either is missing, return to §§4–5 before continuing.
7. Parse GEMM shapes from the server log for tuning
Goal: Turn the noisy §6 server log into a small, deduplicated artifact that lists every GEMM problem you need to tune—at minimum (M, N, K), plus whatever else you chose to log in §5 if the tuner or kernels need it (activation/weight layout hints, dtype, block-scale geometry, bias presence, layer name, and so on).
Responsibilities
- Parse lines emitted under
SGLANG_DUMP_AITER_FP8_GEMM_SHAPES(whatever format §5 uses—often a stable tag plus acsv_row:-style payload, but the contract is yours as long as §8 can read it). - Normalize fields into a consistent representation for the next step (strip tqdm/progress junk on the same physical line if needed).
- Deduplicate so each distinct tuning key appears once; pick a key that matches how you will build the untuned input for
gemm_a8w8_blockscale_tune.pyin §8 (for many flows that is unique (M, N, K); include more columns in the key if you logged extra dimensions that affect kernel choice). - Write the result to a file you own (for example a cleaned log or CSV path) and treat it as the handoff into §8—adjust §8’s CSV builder if you used a minimal format instead of full log lines.
If the artifact is empty or obviously incomplete, fix §5–§6 before running the tuner.
8. CK tuning with aiter’s blockscale GEMM tuner
Work in the aiter checkout from §1–§2. The Composable Kernel entry point for this workflow typically lives under:
$AITER_ROOT/csrc/ck_gemm_a8w8_blockscale/
There you should find gemm_a8w8_blockscale_tune.py (names may vary slightly by aiter revision). Do not rely on workspace-specific wrapper scripts; treat the aiter tree as the source of truth.
What should do
Read the tuner and its CLI
Opengemm_a8w8_blockscale_tune.pyand any helpers it imports (for exampleGemmCommonTuner/mp_tunerunderaiter/utility/). Runpython3 gemm_a8w8_blockscale_tune.py --helpin that directory after settingPYTHONPATHsoimport aiterresolves (typicallyexport PYTHONPATH="$AITER_ROOT:${PYTHONPATH}"). Note input / output file flags,--libtype(e.g.ck,cktile,both),--mp(worker count formp_tuner—set this to use all GPUs you intend to parallelize across, not one, unless you are debugging), and whether split-K search (e.g. a-kflag) is optional and expensive.Build the untuned shape CSV from §7
The tuner’s-iinput is usually an untuned CSV of unique (M, N, K) rows (often with anM,N,Kheader—confirm with--helpand any aiter docs). Convert the §7 artifact into that CSV: extract the triples (and drop columns you do not need) in whatever way matches the format §5 chose and the tuner expects.Run tuning on a GPU host
cdintocsrc/ck_gemm_a8w8_blockscale, ensure ROCm/PyTorch sees every GPU you want the tuner to use (for example leaveHIP_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICESunset, or set them to the full set you intend). Invoke the tuner with your untuned CSV as-iand your desired tuned CSV path as-o. Parallelize across all of those devices: set--mpto the full visible device count (or the explicit parallelism the tuner’s--helpdocuments), not a single GPU by default—the §3–§6 workload may have run at TP < host GPU count; tuning should still exploit every available accelerator to shorten wall time on large shape lists. A typical invocation shape (flags are illustrative—confirm against--help):export AITER_ROOT=/path/to/aiter export PYTHONPATH="${AITER_ROOT}:${PYTHONPATH:-}" cd "${AITER_ROOT}/csrc/ck_gemm_a8w8_blockscale" # ROCm: PyTorch still exposes devices via torch.cuda.device_count(); use all visible GPUs for --mp. NGPU="$(python3 -c 'import torch; print(torch.cuda.device_count() or 1)')" python3 gemm_a8w8_blockscale_tune.py -i /path/to/untuned_mnk.csv -o /path/to/a8w8_blockscale_tuned_gemm.csv --libtype both --mp "${NGPU}"
9. Switch SGLang to CK gemm_a8w8_blockscale (not Triton)
In fp8_utils.py, inside if _use_aiter::
- Use CK: import
gemm_a8w8_blockscalefromaitertogether withgemm_a8w8_bpreshuffle,get_hip_quant. - Stop using Triton blockscale: comment out
from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale.
Target pattern:
from aiter import gemm_a8w8_blockscale, gemm_a8w8_bpreshuffle, get_hip_quant
# from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
(Revert or guard behind a local branch if you need to compare Triton vs CK quickly.)
10. Point aiter at the tuned CSV; rerun and compare
export AITER_CONFIG_GEMM_A8W8_BLOCKSCALE=/path/to/ck_gemm_json_out/a8w8_blockscale_tuned_gemm.csv
Restart SGLang with the same model, TP, concurrency, and benchmark flags as the baseline. Compare e2e latency / throughput and server logs to the baseline from §3. You should see improved performance when the hot shapes are covered by the CSV and the CK path is active.
Example env block (from a working script): GEMM_tuning_test4/run_sglang_test_fff.sh (shape/dispatch exports assume the §5 patches are in the SGLang tree that script runs; tune AITER_CONFIG_... separately).
Checklist (verified)
| Step | Action |
|---|---|
| 1 | Locate SGLang + aiter; verify SGLang ≥ 0.5.6; verify aiter ≥ commit 303a583c... |
| 2 | If needed: git checkout + submodules + clean JIT + python setup.py develop in aiter |
| 3 | Save baseline benchmark + server.log |
| 4 | Read fp8_utils.py / fp8.py; trace block FP8 → aiter path (§4) |
| 5 | Implement §5: env-gated SGLANG_DUMP_AITER_FP8_GEMM_SHAPES (shape lines for §7) and SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH (which w8a8_block_fp8_linear); run from patched tree |
| 6 | Export SGLANG_DUMP_AITER_FP8_GEMM_SHAPES=1 and SGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH=1; confirm dispatch logging and [GEMM_shape_dump] / csv_row in server log (§6) |
| 7 | Parse §6 log → deduped shape artifact per §7; hand off to §8’s untuned CSV builder |
| 8 | In aiter csrc/ck_gemm_a8w8_blockscale: read gemm_a8w8_blockscale_tune.py + --help; build untuned M,N,K CSV from §7; run tuner → tuned CSV for §10 |
| 9 | Edit fp8_utils.py imports: CK gemm_a8w8_blockscale on, Triton blockscale off |
| 10 | export AITER_CONFIG_GEMM_A8W8_BLOCKSCALE=...; rerun; compare to baseline |
Pitfalls
- Wrong Python / wrong tree: §5 edits must live in the same SGLang tree / interpreter that runs
launch_server. - Env toggles and tree alignment:
SGLANG_DUMP_AITER_FP8_GEMM_SHAPESandSGLANG_LOG_FP8_BLOCK_GEMM_DISPATCHtake effect only in the checkout where you added the §5 hooks; keepPYTHONPATH/ editable installs aligned withlaunch_server. - Shape coverage: Missing (M, N, K) in the CSV may fall back to slower or default kernels—use §5 dispatch logging to confirm which
w8a8_block_fp8_linearimplementation is active. - Tuner CLI drift: Always use
python3 gemm_a8w8_blockscale_tune.py --helpon the checked-out aiter revision; flag names and defaults can change. - Skip baseline ck run: Do not try to run ck backend fp8 benchmark without tuned csv, as it may stuck. The final goal is to correctly compare tuned ck execution with baseline (by default triton probably).
- Version / checkout before “clever” code fixes: Most confusing runtime errors in this flow (including aiter JIT failures such as
NameError: name 'aiter_tensor_t' is not defined) trace back to §1–§2 not being satisfied—wrong aiterHEAD, stale submodules, or stale JIT (aiter/jit/*.so,aiter/jit/build/*) from another revision. Do not patch around that in application code (for example addingaiter_tensor_tshims inaiter/jit/core.pyor SGLang). First realign the tree: run the full §2 sequence (git checkoutthe required commit,git submodule sync/git submodule update --init --recursive, remove JIT artifacts as in §2,python setup.py develop), confirmgit rev-parse HEAD, then re-run. Only treat the failure as a genuine bug to debug in source if it still reproduces on that pinned, clean-JIT checkout. - Fair speedup comparison: Measure speedup against the original end-to-end workflow—the default block-FP8 path your stack actually uses before this skill’s CK switch (often Triton
gemm_a8w8_blockscalewhen SGLang + aiter route that way; confirm withSGLANG_LOG_FP8_BLOCK_GEMM_DISPATCH/ §4)—captured in the §3 baseline with the same model, TP, concurrency, and benchmark flags. The after run is the §9–§10 configuration: CKgemm_a8w8_blockscale,AITER_CONFIG_GEMM_A8W8_BLOCKSCALEset to the §8 tuned CSV, and an otherwise identical pipeline. Do not attribute gains to this skill when the two runs differ in workload shape, scheduler flags, or visible GPUs unless that is explicitly part of the experiment.