vllm-omni-quantization

star 76

Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as `awq`, `gptq`, `fp8`, `int8`, `gguf`, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.

hsliuustc0106 By hsliuustc0106 schedule Updated 5/24/2026

name: vllm-omni-quantization description: Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as awq, gptq, fp8, int8, gguf, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.

vLLM-Omni Quantization

Overview

Use this skill for vllm-omni quantization work. The current codebase has a unified quantization framework centered on vllm_omni.quantization.build_quant_config(), but the runtime integration still splits into distinct patterns:

  • AR and general quantization inherited from upstream vllm
  • diffusion quantization for DiT models in vllm-omni, currently fp8, int8, and gguf
  • multi-stage omni quantization using scoped pre-quantized checkpoints such as Qwen3-Omni thinker ModelOpt checkpoints

Core principle: keep generic quantization infrastructure in upstream vllm. Keep vllm-omni focused on unified config routing, component scoping, diffusion-specific model wiring, adapter logic, and verification.

Quick Decision

Task Use
Quantize Qwen-Omni, Qwen-TTS, or another AR-backed model references/methods.md and references/modality-compat.md
Use build_quant_config() or per-component quantization references/methods.md
Quantize diffusion transformer weights with fp8, int8, or gguf references/diffusion.md
Add quantization support to a new diffusion model references/adding-models.md
Add a new quantization method such as nvfp4 or a new ModelOpt path references/diffusion.md and references/adding-models.md
Unsure whether a change belongs in vllm or vllm-omni references/diffusion.md

When to Use

  • Choosing a quantization method for memory or throughput
  • Checking whether a modality or model family actually supports quantization
  • Using the unified build_quant_config() entrypoint or per-component config dicts
  • Enabling diffusion fp8, int8, or gguf
  • Adding a new diffusion quantization method or a pre-quantized multi-stage model path
  • Debugging quantized loading, tensor-name mapping, shape mismatch, quality drift, or performance regressions

AR vs Diffusion Boundary

  • AR and general quantization usually mean upstream vllm methods such as awq, gptq, fp8, and KV-cache FP8.
  • Diffusion quantization means vllm-omni DiT-specific integration on top of the unified framework and should not duplicate upstream vllm kernels or config semantics.
  • Multi-stage omni quantization often means pre-quantized checkpoints whose scope must be constrained to the intended component, such as the thinker language_model.

Rule: if a new method is missing generic kernels, loader behavior, or config classes, fix upstream vllm first. vllm-omni should add thin wrappers, component routing, and model-specific wiring, not a private quantization stack.

Example: Enable FP8 for a Diffusion Model

# 1. Start server with fp8 quantization
vllm serve black-forest-labs/FLUX.1-dev --omni \
  --quantization fp8 --tensor-parallel-size 2

# 2. Verify quantized model loaded correctly
curl -s http://localhost:8091/v1/models | python3 -c "
import sys, json
models = json.load(sys.stdin)['data']
print(f'Loaded: {models[0][\"id\"]}') if models else print('ERROR: No models loaded')
"

# 3. Generate test image with fixed seed for quality comparison
curl -s http://localhost:8091/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a cat","seed":42}' -o test_fp8.json

# 4. Compare against BF16 baseline (run separately without --quantization)
# If PSNR < 25 dB or visual artifacts appear, check references/diffusion.md

Common Mistakes

Symptom Likely Cause Fix
quantization flag has no visible effect wrong model stage or unsupported modality check references/modality-compat.md
unified config behaves unexpectedly per-component dict, default routing, or method override is misunderstood check references/methods.md and references/diffusion.md
AR model quality drops too much aggressive 4-bit setup or wrong method check calibration and method tradeoffs in references/methods.md
diffusion method works on one image only no baseline comparison, no LPIPS gate, or no ignored_layers tuning use the verification flow in references/diffusion.md and references/adding-models.md
GGUF mapping fails missing architecture-specific adapter add explicit adapter logic, do not rely on generic fallback
new quantization method design keeps growing unified framework boundary is unclear re-check ownership before touching model code
multi-stage omni checkpoint loads but wrong stages get quantized component scope is not constrained correctly check component routing and model config normalization
diffusion FA uses wrong KV-cache dtype when --kv-cache-dtype is passed vLLM's --kv-cache-dtype leaked into diffusion config use --diffusion-kv-cache-dtype instead (#3596)
offline text_to_image fails quantization config conflicts with diffusion pipeline init fixed in #1515, update vllm-omni
OmniDiffusion init crash pipeline_class variable not initialized during quantized load fixed in #1562, update vllm-omni
HunyuanImage3.0 load_weights error weight loading fails with quantized HunyuanImage3.0 fixed in #1598, update vllm-omni

References

Install via CLI
npx skills add https://github.com/hsliuustc0106/vllm-omni-skills --skill vllm-omni-quantization
Repository Details
star Stars 76
call_split Forks 24
navigation Branch main
article Path SKILL.md
More from Creator
hsliuustc0106
hsliuustc0106 Explore all skills →