diffusion-kernel

star 0

Index for SGLang Diffusion kernel development skills.

WorldFlowAI By WorldFlowAI schedule Updated 3/16/2026

name: diffusion-kernel description: Index for SGLang Diffusion kernel development skills.

Diffusion Kernel Skills

Rule: Follow User Kernel Language Preference

If the user explicitly states a preference for Triton or CUDA, follow that preference when implementing and optimizing kernels (even if the other option could work). Do not “pick for convenience”.

Directory Layout

python/sglang/multimodal_gen/.claude/skills/diffusion-kernel/
├── SKILL.md
├── add-triton-kernel.md
├── add-cuda-kernel.md
├── diffusion-benchmark-and-profile.md
├── nsight-profiler.md
├── use-efficient-diffusion-kernels.md
├── references/
│   ├── kernel-templates.md          # Copy-paste CUDA kernel templates (sglang JIT style)
│   ├── troubleshooting.md           # Build/perf/integration issues & fixes
│   ├── h100-optimization-guide.md   # H100 (sm_90) deep dive
│   ├── a100-optimization-guide.md   # A100 (sm_80) deep dive
│   └── t4-optimization-guide.md     # T4 (sm_75, FP16 only) deep dive
└── scripts/
    ├── bench_diffusion_rmsnorm.py   # RMSNorm micro-benchmark vs PyTorch
    └── bench_diffusion_denoise.py   # End-to-end denoise benchmark (sglang generate)

Index

Before running any benchmark, profiler, or kernel-validation command, use scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__, verify the repo is writable, export FLASHINFER_DISABLE_VERSION_CHECK=1, and choose idle GPU(s) before starting perf work.

  • scripts/diffusion_skill_env.py

    Shared preflight helper for all diffusion skill commands. Use it to print the repo root, create benchmark/profile output directories, and choose idle GPUs before running sglang generate, torch profiler, nsys, or ncu.

  • add-triton-kernel.md

    Step-by-step guide for adding a new Triton kernel to SGLang Diffusion's jit_kernel/diffusion/triton/ module, including authoring, autotune, torch.compile compatibility, integration, and tests. Use for fused elementwise ops, norm variants, RoPE variants, or when NPU/CPU fallback is needed.

  • add-cuda-kernel.md

    Step-by-step guide for adding a JIT CUDA kernel. CUDA source goes in jit_kernel/csrc/diffusion/<op>.cuh; Python wrapper at jit_kernel/diffusion/<op>.py. Uses SGLang's JIT compilation system (load_jit, cache_once) and internal abstractions (TensorMatcher, device::AlignedVector, host::LaunchKernel, device::warp::reduce_sum). Use for bandwidth-bound reductions (RMSNorm, LayerNorm) or ops needing fine-grained vectorization and shared memory control. Adapted from HuggingFace kernels cuda-kernels skill.

  • use-efficient-diffusion-kernels.md

    Practical guidance for using SGLang Diffusion fused kernels and fast CUDA paths, including constraints, fallbacks, and where the fused ops are wired into the runtime.

  • diffusion-benchmark-and-profile.md

    Denoise-stage benchmark and profiling guide for SGLang Diffusion models. Three profiling levels: Level 1 (torch.profiler — kernel time ranking), Level 2 (nsys — category breakdown), Level 3 (ncu — per-kernel bandwidth/occupancy/roofline analysis). ncu is critical for kernel optimization — always use it when writing or tuning custom kernels to verify hardware saturation.

  • nsight-profiler.md

    Advanced profiling skill for NVIDIA Nsight Systems / Nsight Compute: collecting traces, reading reports, and interpreting kernel-level performance metrics.

References (GPU optimization guides, templates, troubleshooting)

Loaded by add-cuda-kernel.md. Adapted from HuggingFace kernels cuda-kernels skill.

Scripts (runnable benchmarks)

Install via CLI
npx skills add https://github.com/WorldFlowAI/sglang --skill diffusion-kernel
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator