name: quantization description: Work on vLLM-Omni quantization for diffusion, autoregressive, omni, or multi-stage models. Use when choosing or adding methods such as fp8, int8, gguf, mxfp8, mxfp4, mxfp4_dualscale, ModelOpt, AutoRound, INC, msModelSlim, awq, or gptq; debugging quantized loading; or validating memory, speed, and output quality.
vLLM-Omni Quantization
Use this skill for quantization work in vllm-omni. Start from the local
code and docs, then pick the smallest validated path for the model, method,
hardware, and quality target.
First Checks
Before changing code or recommending a command, identify:
- Model and task: AR text, omni/TTS, text-to-image, text-to-video, or multi-stage diffusion.
- Hardware backend: CUDA generation, Ascend NPU, Intel XPU, or another platform.
- Quantization mode: online from BF16/FP16 weights, pre-quantized checkpoint, runtime KV/FA quantization, or per-component routing.
- Scope: transformer/DiT, thinker language model, text encoder, VAE, audio/vision encoder, talker, or a specific stage.
- Validation target: memory reduction, latency/throughput, output quality, or all three.
Do not assume a method is supported just because the CLI accepts the string.
Check docs/user_guide/quantization/ and the closest implementation first.
Quick Decision
| Task | Start With |
|---|---|
| Choose a method or command | references/methods.md and references/modality-compat.md |
Use build_quant_config() or per-component routing |
references/methods.md |
| Work on diffusion quantization | references/diffusion.md |
| Add quantization to a new model | references/adding-models.md |
| Convert or load ModelOpt FP8 checkpoints | references/modelopt-fp8.md |
| Debug ModelOpt, GGUF, AutoRound, MXFP, or serialized Int8 loading | references/diffusion.md and method docs under docs/user_guide/quantization/ |
Current Runtime Shape
The unified entrypoint is:
from vllm_omni.quantization import build_quant_config
It supports method strings, flat method dictionaries, per-component
dictionaries, existing QuantizationConfig objects, and None. The factory
delegates generic methods to upstream vllm and keeps vLLM-Omni overrides for
diffusion or omni-specific routing:
ggufint8mxfp8mxfp4mxfp4_dualscaleinc,auto-round,auto_round- ModelOpt auto-detected configs:
modelopt,modelopt_fp4,modelopt_mixed
Use ComponentQuantizationConfig when only one stage or component should be
quantized. Pre-quantized ModelOpt-style checkpoints should not spill into
vision/audio encoders that have no corresponding scale tensors.
Ownership Boundary
- Upstream
vllmowns generic quantization configs, kernels, loader semantics, hardware capability rules, and generic AR quantization methods. vllm-omniowns unified config routing, diffusion-specific wrappers, component scoping, GGUF or ModelOpt checkpoint adapters, model-specific prefix mapping, docs, examples, and validation.
If a new method needs missing generic kernels or loader behavior, fix upstream
vllm first. In vllm-omni, add thin integration and model wiring.
Working Loop
- Read the local method doc under
docs/user_guide/quantization/. - Inspect the active factory and config class in
vllm_omni/quantization/. - Inspect the target model pipeline and transformer for stable prefixes and
whether every relevant vLLM linear layer receives
quant_config. - For pre-quantized checkpoints, inspect
transformer/config.jsonand the checkpoint tensor names before running full generation. - Validate with the same prompt, seed, size, scheduler, steps, dtype, eager or non-eager mode, and parallelism as the BF16 baseline.
- Report quality, latency, and memory separately. Do not call a method supported until quality is checked.
Common Mistakes
| Symptom | Likely Cause | Fix |
|---|---|---|
--quantization has no visible effect |
Wrong scope or unsupported model path | Check component routing and method docs |
| Some layers stay BF16 unexpectedly | quant_config was not threaded into all vLLM linear layers |
Audit transformer constructors and prefixes |
| Quality collapses but loading succeeds | Too many sensitive layers were quantized | Add model-specific ignored_layers and compare to BF16 |
| ModelOpt checkpoint loads but output is corrupted | Prefixes, packed-module mapping, scale routing, or BF16 fallback is wrong | Use references/modelopt-fp8.md |
| GGUF shape or tensor mismatch | Missing architecture-specific adapter | Add explicit adapter mapping; avoid generic fallback |
| Online and offline MXFP4 disagree | Wrong mxfp4 vs mxfp4_dualscale mode |
Check docs/user_guide/quantization/mxfp4.md |
| Quantized path is slower than BF16 | Kernel path, compile mode, dtype, or shape mismatch | Compare eager-to-eager and non-eager-to-non-eager |
References
- General methods and config routing: references/methods.md
- Modality and model scope: references/modality-compat.md
- Diffusion implementation patterns: references/diffusion.md
- Adding model support: references/adding-models.md
- ModelOpt FP8 conversion and loading: references/modelopt-fp8.md