name: vllm-omni-quantization
description: Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as awq, gptq, fp8, int8, gguf, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.
vLLM-Omni Quantization
Overview
Use this skill for vllm-omni quantization work. The current codebase has a unified quantization framework centered on vllm_omni.quantization.build_quant_config(), but the runtime integration still splits into distinct patterns:
- AR and general quantization inherited from upstream
vllm - diffusion quantization for DiT models in
vllm-omni, currentlyfp8,int8, andgguf - multi-stage omni quantization using scoped pre-quantized checkpoints such as Qwen3-Omni thinker ModelOpt checkpoints
Core principle: keep generic quantization infrastructure in upstream vllm. Keep vllm-omni focused on unified config routing, component scoping, diffusion-specific model wiring, adapter logic, and verification.
Quick Decision
| Task | Use |
|---|---|
| Quantize Qwen-Omni, Qwen-TTS, or another AR-backed model | references/methods.md and references/modality-compat.md |
Use build_quant_config() or per-component quantization |
references/methods.md |
Quantize diffusion transformer weights with fp8, int8, or gguf |
references/diffusion.md |
| Add quantization support to a new diffusion model | references/adding-models.md |
Add a new quantization method such as nvfp4 or a new ModelOpt path |
references/diffusion.md and references/adding-models.md |
Unsure whether a change belongs in vllm or vllm-omni |
references/diffusion.md |
When to Use
- Choosing a quantization method for memory or throughput
- Checking whether a modality or model family actually supports quantization
- Using the unified
build_quant_config()entrypoint or per-component config dicts - Enabling diffusion
fp8,int8, orgguf - Adding a new diffusion quantization method or a pre-quantized multi-stage model path
- Debugging quantized loading, tensor-name mapping, shape mismatch, quality drift, or performance regressions
AR vs Diffusion Boundary
- AR and general quantization usually mean upstream
vllmmethods such asawq,gptq,fp8, and KV-cache FP8. - Diffusion quantization means
vllm-omniDiT-specific integration on top of the unified framework and should not duplicate upstreamvllmkernels or config semantics. - Multi-stage omni quantization often means pre-quantized checkpoints whose scope must be constrained to the intended component, such as the thinker
language_model.
Rule: if a new method is missing generic kernels, loader behavior, or config classes, fix upstream vllm first. vllm-omni should add thin wrappers, component routing, and model-specific wiring, not a private quantization stack.
Example: Enable FP8 for a Diffusion Model
# 1. Start server with fp8 quantization
vllm serve black-forest-labs/FLUX.1-dev --omni \
--quantization fp8 --tensor-parallel-size 2
# 2. Verify quantized model loaded correctly
curl -s http://localhost:8091/v1/models | python3 -c "
import sys, json
models = json.load(sys.stdin)['data']
print(f'Loaded: {models[0][\"id\"]}') if models else print('ERROR: No models loaded')
"
# 3. Generate test image with fixed seed for quality comparison
curl -s http://localhost:8091/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{"prompt":"a cat","seed":42}' -o test_fp8.json
# 4. Compare against BF16 baseline (run separately without --quantization)
# If PSNR < 25 dB or visual artifacts appear, check references/diffusion.md
Common Mistakes
| Symptom | Likely Cause | Fix |
|---|---|---|
quantization flag has no visible effect |
wrong model stage or unsupported modality | check references/modality-compat.md |
| unified config behaves unexpectedly | per-component dict, default routing, or method override is misunderstood | check references/methods.md and references/diffusion.md |
| AR model quality drops too much | aggressive 4-bit setup or wrong method | check calibration and method tradeoffs in references/methods.md |
| diffusion method works on one image only | no baseline comparison, no LPIPS gate, or no ignored_layers tuning |
use the verification flow in references/diffusion.md and references/adding-models.md |
| GGUF mapping fails | missing architecture-specific adapter | add explicit adapter logic, do not rely on generic fallback |
| new quantization method design keeps growing | unified framework boundary is unclear | re-check ownership before touching model code |
| multi-stage omni checkpoint loads but wrong stages get quantized | component scope is not constrained correctly | check component routing and model config normalization |
diffusion FA uses wrong KV-cache dtype when --kv-cache-dtype is passed |
vLLM's --kv-cache-dtype leaked into diffusion config |
use --diffusion-kv-cache-dtype instead (#3596) |
| offline text_to_image fails | quantization config conflicts with diffusion pipeline init | fixed in #1515, update vllm-omni |
| OmniDiffusion init crash | pipeline_class variable not initialized during quantized load |
fixed in #1562, update vllm-omni |
| HunyuanImage3.0 load_weights error | weight loading fails with quantized HunyuanImage3.0 | fixed in #1598, update vllm-omni |
References
- AR and general methods: references/methods.md
- Model and modality support matrix: references/modality-compat.md
- Diffusion
fp8,int8,gguf, and unified-framework workflow: references/diffusion.md - Adding quantization support to a new model: references/adding-models.md