name: sglang-diffusion-benchmark-profile description: Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.
SGLang Diffusion Benchmark and Profile
Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.
This skill is diagnosis-first. It owns:
- checked-in denoise benchmark presets
- perf dump collection and before/after comparison
torch.profilertrace capture and quick hotspot ranking- mapping hot kernels back to known fast paths and fusion families
- packaging confirmed kernel work with enough evidence for the appropriate kernel, Nsight, or framework-specific optimization workflow
This skill does not own low-level kernel authoring or standalone Nsight workflows.
Preflight
Before running any benchmark, profiler, or kernel-validation command:
- use
scripts/diffusion_skill_env.pyto derive the repo root fromsglang.__file__ - verify the repo is writable
- export
HF_TOKENbefore using gated Hugging Face models such asblack-forest-labs/FLUX.* - export
FLASHINFER_DISABLE_VERSION_CHECK=1 - choose idle GPU(s) before starting perf work
Native Backend Gate
All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.
Treat any of the following as a hard stop condition:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipeline
If any benchmark, perf-dump, or torch.profiler command prints one of those signals:
- stop the workflow immediately
- do not keep the generated numbers or traces as SGLang benchmark evidence
- do not continue to hotspot classification or kernel work
- first fix model resolution, pipeline selection, overlay/materialization, or other backend-selection issues so the model runs on the native SGLang diffusion path
Main Reference
- benchmark-and-profile.md — canonical denoise benchmark, perf dump, and
torch.profilerworkflow; uses checked-in nightly-aligned presets plus current-source extras such as FLUX.2 Klein, Cosmos3, Ideogram4, ERNIE/GLM/SANA image models, FastWan2.2,LTX-2.3one-stage/two-stage/HQ, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shape - existing-fast-paths.md — map bottlenecks to existing fused kernels, packed QKV paths, fused
QK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new code - scripts/diffusion_skill_env.py — preflight helper: repo root discovery via
sglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selection - scripts/bench_diffusion_denoise.py — end-to-end denoise benchmark preset runner via
sglang generate; supports--no-torch-compile, validates nightly preset drift with--validate-nightly-alignment, and saves perf dumps by label forcompare_perf.py
Opportunity Discovery Rule
Before calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.
Always rule out these existing families first:
- HunyuanVideo VAE GroupNorm+SiLU
- LTX upsampler GroupNorm+SiLU
- Z-Image residual-form modulation
- fused diffusion
QK norm + RoPE - LTX2 split RoPE
- varlen USP attention pack/scatter
- NVFP4 / Nunchaku packed QKV
- Nunchaku fused GELU MLP
- Ulysses / USP attention overlap
- turbo-layer async all-to-all overlap
torch.compilecompute / communication reorder- dual-stream diffusion execution
If the user explicitly requires torch.compile to stay off, do not use the
default benchmark preset invocation unchanged. Either pass the checked-in
benchmark helper its no-compile switch or run the equivalent manual command
without --enable-torch-compile.
For FLUX-family manual profiling runs with a quantized transformer override:
- use
sglang generatedirectly - pass the override as
--transformer-path <dir> - prefer
--prompt-path <file>when also fixing--output-file-name - if the base model is already cached locally and the machine has unreliable HF access, use the local cached
--model-pathplusHF_HUB_OFFLINE=1 - remember that
--profilechanges latency substantially; use the non-profile perf dump for the real before/after benchmark claim