vllm-omni-perf - SKILL.md Agent Skill

name: vllm-omni-perf description: Optimize vLLM-Omni performance through benchmarking, TeaCache, Cache-DiT, quantization, CPU offloading, and parallelism tuning. Use when improving inference speed, reducing latency, lowering memory usage, running benchmarks, or enabling diffusion acceleration.

vLLM-Omni Performance Tuning

Overview

vLLM-Omni provides multiple optimization levers for both autoregressive and diffusion pipelines. Key techniques include KV cache optimization (inherited from vLLM), TeaCache/Cache-DiT for diffusion acceleration, quantization, CPU offloading, and parallelism configuration.

Optimization Quick Reference

Technique	Applies To	Speedup	Quality Impact
TeaCache	Diffusion models	1.5-2.0x	Minimal
Cache-DiT	Diffusion models	1.3-1.8x	Minimal
Quantization	All models	1.2-1.5x	Slight
Tensor Parallelism	All models	Near-linear	None
Sequence Parallelism	DiT models	Near-linear	None
CPU Offloading	All models	Enables larger models	Adds latency
GPU Memory Tuning	All models	More throughput	None
Multi-Thread Weight Loading	Diffusion models	Faster startup	None

TeaCache (Diffusion Acceleration)

TeaCache provides adaptive caching for diffusion transformer denoising steps, skipping redundant computations:

vllm serve <model> --omni \
  --enable-teacache \
  --teacache-threshold 0.1

Parameter	Description	Default
`--enable-teacache`	Enable TeaCache	Disabled
`--teacache-threshold`	Cache hit threshold (lower = more caching)	Model-specific

Recommended thresholds by model:

Image models: 0.05-0.15
Video models: 0.08-0.20

Cache-DiT

Alternative diffusion acceleration backend:

vllm serve <model> --omni --enable-cache-dit

Can be combined with TeaCache, but test independently first to measure impact.

Supported models: FLUX.2-dev, Helios-Distilled, Wan2.2, and others using ForwardPattern.Pattern_2. Helios achieves ~20% speedup with cache-dit.

TeaCache and CPU Offload hooks are compatible — use them simultaneously with --enable-teacache --enable-cpu-offload (or --cpu-offload-gb). The HookRegistry sorts hooks alphabetically and ensures the forward-overriding hook (TeaCache) runs last in the pre-process chain. Only one forward-overriding hook is allowed at a time.

Quantization

For full quantization guidance (method selection, AWQ/GPTQ workflows, FP8 KV cache, quality verification), see the dedicated vllm-omni-quantization skill.

Multi-Thread Weight Loading

Diffusion models (Qwen-Image, Wan2.2, FLUX, HunyuanImage3.0, etc.) load safetensors shards in parallel using a thread pool instead of sequentially. This is enabled by default and significantly reduces cold-start time:

Qwen-Image: ~3 min -> substantially faster
Wan2.2-I2V 14B: ~5 min -> substantially faster

No configuration needed -- this is automatic for all diffusion models using safetensors format.

CPU Offloading

Offload model layers to CPU RAM to fit larger models:

Model-Level Offloading

vllm serve <model> --omni --cpu-offload-gb 10

Offloads approximately 10 GB of model weights to CPU. Adds latency for offloaded layers.

Layer-Wise Offloading

For diffusion models, layer-wise offloading moves individual transformer layers to CPU between forward passes:

vllm serve <model> --omni --enable-layerwise-cpu-offload

When multiple DiT transformers exist in a pipeline (e.g., Wan2.2-T2V's transformer + transformer-2), the sequential offloader applies mutual exclusion: only one DiT is loaded on GPU at a time, and all others are offloaded to CPU along with encoders. This prevents OOM on memory-constrained GPUs (64 GB).

GPU Memory Configuration

Maximize throughput by tuning GPU memory allocation:

# Default: 90% of GPU memory
vllm serve <model> --omni --gpu-memory-utilization 0.9

# Conservative: 80% (leaves room for other processes)
vllm serve <model> --omni --gpu-memory-utilization 0.8

# Aggressive: 95%
vllm serve <model> --omni --gpu-memory-utilization 0.95

Benchmarking

Quick Benchmark

python -m vllm_omni.benchmarks.benchmark_serving \
  --model Tongyi-MAI/Z-Image-Turbo \
  --num-prompts 100 \
  --port 8091

Measuring Latency

Time a single request:

time curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "a red circle"}],
    "extra_body": {"height": 512, "width": 512, "num_inference_steps": 20}
  }' > /dev/null

Monitoring During Benchmark

# GPU utilization
watch -n 1 nvidia-smi

# Server metrics
curl http://localhost:8091/metrics

Optimization Workflow

Baseline: Run benchmark with default settings
Memory: Tune --gpu-memory-utilization to maximize without OOM
Parallelism: Add tensor parallelism if multi-GPU available
Caching: Enable TeaCache or Cache-DiT for diffusion models
Quantization: Apply if memory-constrained
Offloading: Use CPU offloading as last resort for large models
Re-benchmark: Compare against baseline

Troubleshooting

No speedup with TeaCache: Threshold may be too conservative. Lower it gradually (e.g., 0.05) and check quality.

OOM after optimization: Quantization reduces memory. Combine with lower gpu-memory-utilization.

Latency regression with TP: For small models, the communication overhead of tensor parallelism may exceed the compute savings. Use TP only for models that saturate a single GPU.

References

For TeaCache configuration details, see references/teacache.md
For quantization methods and compatibility, see references/quantization.md