sglang-diffusion-benchmark-profile

star 29.1k

Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

sgl-project By sgl-project schedule Updated 6/9/2026

name: sglang-diffusion-benchmark-profile description: Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

SGLang Diffusion Benchmark and Profile

Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.

This skill is diagnosis-first. It owns:

  • checked-in denoise benchmark presets
  • perf dump collection and before/after comparison
  • torch.profiler trace capture and quick hotspot ranking
  • mapping hot kernels back to known fast paths and fusion families
  • packaging confirmed kernel work with enough evidence for the appropriate kernel, Nsight, or framework-specific optimization workflow

This skill does not own low-level kernel authoring or standalone Nsight workflows.

Preflight

Before running any benchmark, profiler, or kernel-validation command:

  • use scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__
  • verify the repo is writable
  • export HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*
  • export FLASHINFER_DISABLE_VERSION_CHECK=1
  • choose idle GPU(s) before starting perf work

Native Backend Gate

All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.

Treat any of the following as a hard stop condition:

  • Falling back to diffusers backend
  • Using diffusers backend
  • Loaded diffusers pipeline

If any benchmark, perf-dump, or torch.profiler command prints one of those signals:

  • stop the workflow immediately
  • do not keep the generated numbers or traces as SGLang benchmark evidence
  • do not continue to hotspot classification or kernel work
  • first fix model resolution, pipeline selection, overlay/materialization, or other backend-selection issues so the model runs on the native SGLang diffusion path

Main Reference

  • benchmark-and-profile.md — canonical denoise benchmark, perf dump, and torch.profiler workflow; uses checked-in nightly-aligned presets plus current-source extras such as FLUX.2 Klein, Cosmos3, Ideogram4, ERNIE/GLM/SANA image models, FastWan2.2, LTX-2.3 one-stage/two-stage/HQ, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shape
  • existing-fast-paths.md — map bottlenecks to existing fused kernels, packed QKV paths, fused QK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new code
  • scripts/diffusion_skill_env.py — preflight helper: repo root discovery via sglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selection
  • scripts/bench_diffusion_denoise.py — end-to-end denoise benchmark preset runner via sglang generate; supports --no-torch-compile, validates nightly preset drift with --validate-nightly-alignment, and saves perf dumps by label for compare_perf.py

Opportunity Discovery Rule

Before calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.

Always rule out these existing families first:

  • HunyuanVideo VAE GroupNorm+SiLU
  • LTX upsampler GroupNorm+SiLU
  • Z-Image residual-form modulation
  • fused diffusion QK norm + RoPE
  • LTX2 split RoPE
  • varlen USP attention pack/scatter
  • NVFP4 / Nunchaku packed QKV
  • Nunchaku fused GELU MLP
  • Ulysses / USP attention overlap
  • turbo-layer async all-to-all overlap
  • torch.compile compute / communication reorder
  • dual-stream diffusion execution

If the user explicitly requires torch.compile to stay off, do not use the default benchmark preset invocation unchanged. Either pass the checked-in benchmark helper its no-compile switch or run the equivalent manual command without --enable-torch-compile.

For FLUX-family manual profiling runs with a quantized transformer override:

  • use sglang generate directly
  • pass the override as --transformer-path <dir>
  • prefer --prompt-path <file> when also fixing --output-file-name
  • if the base model is already cached locally and the machine has unreliable HF access, use the local cached --model-path plus HF_HUB_OFFLINE=1
  • remember that --profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim
Install via CLI
npx skills add https://github.com/sgl-project/sglang --skill sglang-diffusion-benchmark-profile
Repository Details
star Stars 29,123
call_split Forks 6,576
navigation Branch main
article Path SKILL.md
More from Creator