sglang-diffusion-benchmark-profile - SKILL.md Agent Skill

name: sglang-diffusion-benchmark-profile description: Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

SGLang Diffusion Benchmark and Profile

Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.

This skill is diagnosis-first. It owns:

checked-in denoise benchmark presets
perf dump collection and before/after comparison
torch.profiler trace capture and quick hotspot ranking
mapping hot kernels back to known fast paths and fusion families
packaging confirmed kernel work with enough evidence for the appropriate kernel, Nsight, or framework-specific optimization workflow

This skill does not own low-level kernel authoring or standalone Nsight workflows.

Preflight

Before running any benchmark, profiler, or kernel-validation command:

use scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__
verify the repo is writable
export HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*
export FLASHINFER_DISABLE_VERSION_CHECK=1
choose idle GPU(s) before starting perf work

Native Backend Gate

All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.

Treat any of the following as a hard stop condition:

Falling back to diffusers backend
Using diffusers backend
Loaded diffusers pipeline

If any benchmark, perf-dump, or torch.profiler command prints one of those signals:

stop the workflow immediately
do not keep the generated numbers or traces as SGLang benchmark evidence
do not continue to hotspot classification or kernel work
first fix model resolution, pipeline selection, overlay/materialization, or other backend-selection issues so the model runs on the native SGLang diffusion path

Main Reference

benchmark-and-profile.md — canonical denoise benchmark, perf dump, and torch.profiler workflow; uses checked-in nightly-aligned presets plus current-source extras such as FLUX.2 Klein, Cosmos3, Ideogram4, ERNIE/GLM/SANA image models, FastWan2.2, LTX-2.3 one-stage/two-stage/HQ, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shape
existing-fast-paths.md — map bottlenecks to existing fused kernels, packed QKV paths, fused QK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new code
scripts/diffusion_skill_env.py — preflight helper: repo root discovery via sglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selection
scripts/bench_diffusion_denoise.py — end-to-end denoise benchmark preset runner via sglang generate; supports --no-torch-compile, validates nightly preset drift with --validate-nightly-alignment, and saves perf dumps by label for compare_perf.py

Opportunity Discovery Rule

Before calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.

Always rule out these existing families first:

HunyuanVideo VAE GroupNorm+SiLU
LTX upsampler GroupNorm+SiLU
Z-Image residual-form modulation
fused diffusion QK norm + RoPE
LTX2 split RoPE
varlen USP attention pack/scatter
NVFP4 / Nunchaku packed QKV
Nunchaku fused GELU MLP
Ulysses / USP attention overlap
turbo-layer async all-to-all overlap
torch.compile compute / communication reorder
dual-stream diffusion execution

If the user explicitly requires torch.compile to stay off, do not use the default benchmark preset invocation unchanged. Either pass the checked-in benchmark helper its no-compile switch or run the equivalent manual command without --enable-torch-compile.

For FLUX-family manual profiling runs with a quantized transformer override:

use sglang generate directly
pass the override as --transformer-path <dir>
prefer --prompt-path <file> when also fixing --output-file-name
if the base model is already cached locally and the machine has unreliable HF access, use the local cached --model-path plus HF_HUB_OFFLINE=1
remember that --profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim