name: vllm-omni-diffusion-perf-optim description: Guide for achieving optimal inference performance with vLLM-Omni diffusion models. Covers all lossless and lossy optimization methods (parallelism, torch.compile, CPU offload, quantization, cache acceleration), per-model support tables, and ready-to-use recipes. Use when asked to speed up diffusion inference, reduce latency, lower VRAM usage, or tune a diffusion pipeline.
vLLM-Omni Diffusion: Optimal Performance Guide
Use this guide when a user asks how to speed up diffusion inference, reduce latency, lower VRAM, or tune a diffusion pipeline in vLLM-Omni.
This skill is designed to stay up to date. Instead of hardcoding model support tables, it tells you where to look in the codebase to discover current capabilities. See Discovering Current Capabilities and Extending This Skill at the end.
Step 0: Understand the Baseline
Before optimizing, establish a baseline:
- Identify the model and its pipeline class (check
model_index.json→_class_name) - Run a baseline with
--enforce-eager(disables torch.compile) and no parallelism - Record: server inference time, e2e latency, VRAM usage, output quality
Online serving (preferred — measures real deployment latency):
# Start server
vllm serve <MODEL> --omni --port 8098 --enforce-eager
# Send request and measure e2e time
time curl -sS -X POST http://localhost:8098/v1/videos \
-F "prompt=..." -F "width=768" -F "height=480" \
-F "num_frames=41" -F "num_inference_steps=20" -F "seed=42"
# Poll until completed, record inference_time_s from status response
curl -sS http://localhost:8098/v1/videos/<VIDEO_ID> | jq '.inference_time_s'
Offline inference (useful for quick iteration):
python examples/offline_inference/text_to_video/text_to_video.py \
--model <MODEL> --enforce-eager --prompt "..." --output baseline.mp4
Important: Always report online serving numbers for deployment decisions. Offline benchmarks may differ due to process startup, torch.compile warmup, and measurement methodology.
Step 1: Apply Lossless Optimizations
These do not affect output quality. Apply in order of impact.
1.1 torch.compile (Regional Compilation)
What: Compiles repeated DiT transformer blocks via torch.compile(dynamic=True). Fuses ops, reduces kernel launch overhead.
How: Enabled by default. Use --enforce-eager to disable.
Speedup: Model- and GPU-dependent. May provide 1.1–1.5× on the denoising loop, but on some GPU architectures (e.g., H800) and models, warm-request latency may match eager.
Requirements: Model transformer must define _repeated_blocks attribute. First request pays compilation overhead (~5–15s extra).
Online serving note: The first request after server start incurs compilation warmup. Subsequent requests run at compiled speed. For latency-sensitive deployments, consider --enforce-eager to avoid first-request penalty, especially if compile does not measurably improve warm latency for your model/GPU.
Config: OmniDiffusionConfig.enforce_eager (default False = compile enabled).
Source: vllm_omni/diffusion/compile.py, vllm_omni/diffusion/worker/diffusion_model_runner.py
1.2 Multi-GPU Parallelism
All configured via DiffusionParallelConfig. Check docs/user_guide/diffusion/parallelism_acceleration.md for the per-model support table before enabling.
Sequence Parallelism (Ulysses-SP)
What: Splits sequence tokens across GPUs using all-to-all communication (DeepSpeed Ulysses).
How: --ulysses-degree N (offline) or --usp N (online serving)
Speedup: Near-linear scaling. Best for long-sequence models (video, high-res image).
from vllm_omni.diffusion.data import DiffusionParallelConfig
parallel_config = DiffusionParallelConfig(ulysses_degree=2)
omni = Omni(model="...", parallel_config=parallel_config)
Ring Attention
What: Ring-based P2P communication for attention across GPUs.
How: --ring-degree N (offline) or --ring N (online serving)
Note: Can combine with Ulysses: ulysses_degree × ring_degree = total SP GPUs.
CFG Parallel
What: Runs positive/negative CFG branches on separate GPUs. Only rank 0 computes scheduler step.
How: --cfg-parallel-size 2
Speedup: ~2× on models using classifier-free guidance.
Constraint: Requires exactly 2 GPUs. Only for models that use CFG.
# 4-GPU: CFG parallel (2) × Ulysses (2)
python text_to_image.py --model Qwen/Qwen-Image \
--cfg-parallel-size 2 --ulysses-degree 2
Tensor Parallelism (TP)
What: Shards DiT linear layers across GPUs using ColumnParallelLinear, RowParallelLinear, QKVParallelLinear.
How: --tensor-parallel-size N
Note: Only DiT blocks are sharded — text encoder is replicated on all ranks (extra VRAM per GPU). See Issue #771.
VAE Patch Parallelism
What: Shards VAE decode spatially across ranks using tiling.
How: --vae-patch-parallel-size N
Constraint: Auto-enables --vae-use-tiling.
HSDP (Hybrid Sharded Data Parallel)
What: Shards model weights across GPUs using PyTorch FSDP2. Reduces per-GPU VRAM.
How: Via DiffusionParallelConfig(use_hsdp=True). Requires multi-GPU.
Expert Parallel
What: Shards MoE experts across devices with all-to-all token routing.
How: --enable-expert-parallel
Constraint: Only for MoE models (e.g., HunyuanImage3.0).
1.3 CPU Offload
Two mutually exclusive strategies. Both single-GPU only.
Model-level (Sequential) Offload
What: Swaps DiT ↔ encoders on GPU. Only one group is on GPU at a time.
How: --enable-cpu-offload or Omni(enable_cpu_offload=True)
Tradeoff: Adds H2D transfer latency between encoder and denoising phases.
Layerwise (Blockwise) Offload
What: Keeps only 1 transformer block on GPU at a time. Async prefetch via separate CUDA stream.
How: --enable-layerwise-offload or Omni(enable_layerwise_offload=True)
Best for: Large video models (Wan A14B) where per-block compute >> H2D transfer → nearly zero-cost offload.
Requirement: Model DiT must define _layerwise_offload_blocks_attr.
VRAM savings: Dramatic (e.g., 40+ GB → ~11 GB for Wan A14B).
1.4 VAE Memory Optimizations
--vae-use-slicing: Process VAE in slices (saves VRAM).--vae-use-tiling: Process VAE in tiles (saves VRAM, enables patch parallel).
Both are boolean flags. Use when OOM during VAE decode.
1.5 Quantization
FP8 (W8A8)
What: Online quantization of DiT linear layers to FP8.
How: --quantization fp8
Requirements: Ada/Hopper GPU (SM89+). Native hardware FP8.
VRAM: ~50% reduction on DiT weights. Speedup: 1.3–1.5×.
python text_to_image.py --model Qwen/Qwen-Image --quantization fp8
Layer skipping: --ignored-layers 'add_kv_proj,to_add_out' to exclude specific layers from quantization.
GGUF (Pre-quantized)
What: Loads pre-quantized GGUF weights for transformer.
How: --quantization gguf --gguf-model <path-or-hf-id>
Source: docs/user_guide/diffusion/quantization/gguf.md
Step 2: Apply Lossy Optimizations (Optional)
These trade quality for speed. Always compare output quality against baseline.
2.1 TeaCache
What: Caches transformer computations when consecutive timesteps are similar. Skips redundant forward passes.
Speedup: 1.5–2.0× depending on rel_l1_thresh.
How:
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2},
)
CLI: --cache-backend tea_cache
Online: vllm serve <MODEL> --omni --cache-backend tea_cache --cache-config '{"rel_l1_thresh": 0.2}'
Quality tuning:
0.1–0.2: minimal quality loss (~1.5× speedup)0.4: slight quality loss (~1.8× speedup)0.6–0.8: noticeable quality loss (~2.0–2.25× speedup)
Supported models: Qwen-Image family, BAGEL. See docs/user_guide/diffusion/teacache.md.
2.2 Cache-DiT (DBCache + TaylorSeer + SCM)
What: Hybrid caching with three sub-methods:
- DBCache: Caches intermediate block outputs when residuals are small
- TaylorSeer: Taylor expansion to forecast future hidden states
- SCM: Step Computation Masking — selectively skip entire denoising steps
Speedup: 1.5–2.5× depending on configuration.
How:
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
# DBCache
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
"residual_diff_threshold": 0.24,
"max_continuous_cached_steps": 3,
# TaylorSeer (optional)
"enable_taylorseer": False,
"taylorseer_order": 1,
# SCM (optional)
"scm_steps_mask_policy": None, # "slow"/"medium"/"fast"/"ultra"
"scm_steps_policy": "dynamic",
},
)
CLI: --cache-backend cache_dit
Excluded models: NextStep11Pipeline, StableDiffusionPipeline (see _NO_CACHE_ACCELERATION in registry.py).
Source: docs/user_guide/diffusion/cache_dit_acceleration.md
2.3 Fewer Inference Steps
Reducing --num-inference-steps gives linear speedup but affects quality. Typical ranges:
- Image models: 20–50 steps
- Video models: 20–40 steps
- Distilled models: 4–8 steps
Discovering Current Capabilities
The tables below may become stale as new models and methods are added. Always verify against the live codebase using these source-of-truth files:
Parallelism support per model
Read the canonical table in docs/user_guide/diffusion/parallelism_acceleration.md.
It lists every model with ✅/❌ for each parallelism method (Ulysses-SP, Ring, CFG, TP, VAE-Patch, Expert, HSDP).
To check programmatically whether a specific model supports a method:
| Check | How |
|---|---|
| Ulysses / Ring SP | Transformer class defines _sp_plan. Search: grep -r '_sp_plan' vllm_omni/diffusion/models/ |
| CFG Parallel | Pipeline or transformer inherits CFGParallelMixin. Search: grep -r 'CFGParallelMixin' vllm_omni/diffusion/models/ |
| TP | Transformer uses ColumnParallelLinear / RowParallelLinear / QKVParallelLinear. Search: grep -r 'ParallelLinear|QKVParallel' vllm_omni/diffusion/models/<model>/ |
| Layerwise offload | Pipeline defines _layerwise_offload_blocks_attr. Search: grep -r '_layerwise_offload_blocks_attr' vllm_omni/diffusion/models/ |
| torch.compile | Transformer defines _repeated_blocks. Search: grep -r '_repeated_blocks' vllm_omni/diffusion/models/ |
| HSDP | Check DiffusionParallelConfig usage in docs and tests. |
Cache acceleration support
- Excluded models: listed in
_NO_CACHE_ACCELERATIONinvllm_omni/diffusion/registry.py. Any pipeline class in that set does not supporttea_cacheorcache_dit. - TeaCache supported models: check
docs/user_guide/diffusion/teacache.mdfor the current list. - Cache-DiT: all DiT-based models not in
_NO_CACHE_ACCELERATION. Seedocs/user_guide/diffusion/cache_dit_acceleration.md.
Quantization support
- Available methods: listed in
vllm_omni/diffusion/quantization/. Each.pyfile is a method (e.g.,fp8.py,gguf.py). - Config:
OmniDiffusionConfig.quantizationfield invllm_omni/diffusion/data.py. - Docs:
docs/user_guide/diffusion/quantization/
Available CLI flags (online serving)
Run vllm serve --help and look for --omni-related flags. Key flags:
--usp, --ring, --cfg-parallel-size, --tensor-parallel-size, --vae-patch-parallel-size,
--cache-backend, --quantization, --enforce-eager, --enable-cpu-offload,
--enable-layerwise-offload, --vae-use-slicing, --vae-use-tiling, --use-hsdp,
--enable-expert-parallel, --flow-shift, --boundary-ratio.
Quick Recipes
Recipes show both online serving (preferred for deployment) and offline variants.
Recipe A: Single GPU, lossless (Image model — online)
# Server
vllm serve Qwen/Qwen-Image --omni --port 8098 --quantization fp8
# Client
curl -X POST http://localhost:8098/v1/images/generations \
-F "prompt=A futuristic city at sunset" -F "seed=42"
Recipe B: Multi-GPU, lossless (Image model, 4 GPUs — online)
vllm serve Qwen/Qwen-Image --omni --port 8098 \
--cfg-parallel-size 2 --usp 2 --quantization fp8
Recipe C: Low VRAM, single GPU (Video model — online)
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8098 \
--enable-layerwise-offload --vae-use-slicing --vae-use-tiling
Recipe D: Multi-GPU, lossless (Video model, 8 GPUs — online)
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8098 \
--usp 4 --ring 2 --vae-patch-parallel-size 8 --quantization fp8
Recipe E: Lossy speedup with Cache-DiT (Image model — online)
vllm serve Qwen/Qwen-Image --omni --port 8098 \
--enforce-eager --cache-backend cache_dit
Recipe F: LTX-2 video baseline (online)
vllm serve Lightricks/LTX-2 --omni --port 8098 \
--enforce-eager --flow-shift 1.0 --boundary-ratio 1.0
Recipe G: LTX-2 video with Cache-DiT (~1.4× speedup, online)
vllm serve Lightricks/LTX-2 --omni --port 8098 \
--enforce-eager --flow-shift 1.0 --boundary-ratio 1.0 \
--cache-backend cache_dit
Offline equivalents
For quick local testing, replace vllm serve ... --omni with the offline scripts:
# Image
python examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --quantization fp8
# Video
python examples/offline_inference/text_to_video/text_to_video.py \
--model Lightricks/LTX-2 --prompt "..." --enforce-eager
Decision Flowchart
Is output quality paramount?
├── YES → Use only Step 1 (lossless)
│ ├── Single GPU? → torch.compile (default) + FP8 quantization
│ ├── Multi-GPU? → Add SP/TP/CFG parallel (check support table)
│ └── OOM? → Enable CPU offload or VAE slicing/tiling
└── NO → Also apply Step 2 (lossy)
├── TeaCache supported? → Use tea_cache with rel_l1_thresh=0.2
└── DiT model? → Use cache_dit with defaults
Tips
- Benchmark in online serving mode for deployment decisions. Offline numbers may differ due to process startup and measurement methodology.
- Use
--enforce-eagerunless torch.compile measurably improves warm-request latency for your model/GPU. This avoids first-request compilation overhead. - CFG parallel + Ulysses is usually better than pure Ulysses at the same GPU count for CFG models.
- Layerwise offload is nearly free for video models where per-block compute dwarfs H2D transfer time.
- Combine lossless + lossy: e.g., FP8 + Cache-DiT for maximum throughput.
- Check
_NO_CACHE_ACCELERATIONinregistry.pybefore enabling cache backends — UNet-based and some specialized models don't support them. - Send multiple requests when benchmarking online serving to measure warm (steady-state) latency rather than first-request startup.
Key Source Files
| File | What |
|---|---|
vllm_omni/diffusion/data.py |
OmniDiffusionConfig, DiffusionParallelConfig, DiffusionCacheConfig |
vllm_omni/diffusion/compile.py |
Regional torch.compile logic |
vllm_omni/diffusion/registry.py |
_NO_CACHE_ACCELERATION, model registry |
vllm_omni/diffusion/distributed/cfg_parallel.py |
CFGParallelMixin |
vllm_omni/diffusion/cache/ |
TeaCache and CacheDiT backends |
vllm_omni/diffusion/offloader/ |
CPU offload backends |
vllm_omni/diffusion/quantization/ |
Quantization backends (fp8, gguf, ...) |
docs/user_guide/diffusion/ |
All user-facing docs |
docs/user_guide/diffusion/parallelism_acceleration.md |
Canonical parallelism support table |
Extending This Skill
When a new optimization method is added to vLLM-Omni, update this skill as follows:
Adding a new lossless method
- Add a subsection under Step 1 with: What / How / Speedup / Requirements / Source.
- Update the Discovering Current Capabilities section with a "How to check" row (e.g., what attribute or class to grep for to confirm model support).
- Add a recipe under Quick Recipes if the method is broadly useful.
- Update the Decision Flowchart if the method creates a new decision branch.
- Add the key source file to the Key Source Files table.
Adding a new lossy method
- Add a subsection under Step 2 with: What / How / Speedup / Quality impact / Config.
- Note which models are excluded (if any) and where exclusion is tracked in code
(e.g., a set in
registry.py). - Update the Decision Flowchart.
Adding a new quantization method
- Add under Step 1 → 1.5 Quantization with: What / How / Requirements / VRAM savings.
- Note the source file in
vllm_omni/diffusion/quantization/.
Adding a new parallelism method
- Add under Step 1 → 1.2 Multi-GPU Parallelism with: What / How / Constraint / CLI flag.
- Add a grep instruction to the Discovering Current Capabilities table
(e.g., "Transformer defines
_new_attr"). - Update
docs/user_guide/diffusion/parallelism_acceleration.mdwith the new column in the support table.
General guidelines
- Prefer "check the code" instructions over static tables. Tables go stale; grep commands don't.
- Always include the CLI flag for both offline (
--flag-name) and online serving (--flag-nameviavllm serve). Online serving flags sometimes differ (e.g.,--ulysses-degreeoffline vs--usponline). - Include a Source pointer so developers can find the implementation.
- After updating the skill, test it by asking the agent to optimize a model and verify it discovers the new method correctly.