vllm-quantization

name: vllm-quantization allowed-tools: Bash, Read, Write, Edit, Grep, Glob description: |- vLLM datacenter-GPU quantization — picking, configuring, troubleshooting NVFP4, FP8, MXFP4, MXFP8, AWQ, GPTQ, INT8, compressed-tensors, modelopt, quark on H100/H200/B200/B300/GB200/GB300. 29 `--quantization` flag values, KV-cache dtypes (fp8_e4m3, nvfp4, per-token-head, turboquant), MoE backend selection (CUTLASS, TRTLLM, FlashInfer, DeepGEMM, Marlin, Qutlass), producing checkpoints with llm-compressor and NVIDIA ModelOpt (NVFP4_DEFAULT_CFG, FP8_DEFAULT_CFG, W4A16, SmoothQuant+GPTQ), online quantization (`fp8_per_tensor`, `fp8_per_block`), training EAGLE-3/dflash drafters on BF16 targets before PTQ, version gates per vLLM release (v0.14 → v0.21). when_to_use: |- Trigger on `--quantization`, `--kv-cache-dtype`, NVFP4, MXFP4, MXFP8, FP8, W4A16, W8A8, W4A4, AWQ, GPTQ, SmoothQuant, modelopt, compressed-tensors, quark, torchao, bitsandbytes, gguf, TurboQuant, CUTLASS, Marlin, FlashInfer, TRTLLM, DeepGEMM, Qutlass, Machete, `hf_quant_config.json`, `kv_cache_scheme`, `NVFP4_DEFAULT_CFG`, `FP8_DEFAULT_CFG`, llm-compressor, ModelOpt. Symptoms — "garbage after FP8", "NVFP4 NaN", "FP8 KV multi-turn corruption", "MoE kernel not dispatched on SM120", "illegal memory access awq_marlin", "online FP8 drops bias", "modelopt checkpoint won't load". Decisions — NVFP4 vs FP8 on H200 vs B200, quantizing EAGLE-3/dflash drafters, generating a checkpoint vLLM can load. Also implicit — "quantize {model}", "pick quant for {model}", "audit quantization", "deploy-memo quant", "which quant fits {GPU}", "spec-study quantization".

vLLM quantization — operator skill

Last verified: 2026-04-24 — see references/sources.md for per-ref audit table.

For production vLLM operators on H100 / H200 / B200 / B300 / GB200 / GB300 fleets deciding which quantization format fits a given target model, producing a checkpoint vLLM will actually load, wiring the right KV-cache dtype, diagnosing accuracy or throughput regressions after an upgrade, and composing quantization with speculative decoding / LoRA / MoE.

Pointer-map format: this SKILL.md picks the format and CLI; the files in references/ hold the per-format deep dives, exact source pointers, and troubleshooting cards. Follow the link, don't paraphrase from memory — the quantization layer moves faster than any other subsystem in vLLM (six formats landed in v0.19 alone).

When quantization wins, when it doesn't

Quantization trades weight precision for memory + compute:

KV-capacity bound (long context, high concurrency) — FP8 or NVFP4 KV cache gives a 2×/4× KV-capacity multiplier; weight format matters much less than getting --kv-cache-dtype right. Measure kv_cache_usage_perc.
Memory-bandwidth bound (small batch, decode-heavy, 70B+ on < 8 GPUs) — weight quantization (NVFP4 / FP8 / W4A16) reduces HBM traffic per token, giving 1.5–3× decode throughput on a well-matched target+kernel.
Compute bound (prefill, large batch, small model) — quantization may not help; Blackwell FP4 Tensor Cores are the first architecture where W4A4 actually beats FP8 in compute-bound regimes. On Hopper, W4A16 is memory-only — MMA still runs FP16.
Multi-node EP / disaggregated serving — NVFP4 reduces all-to-all by 4× vs BF16. DeepSeek-R1 / V3.2 on GB200/GB300 gets most of its throughput from NVFP4 over the fabric, not from per-GPU compute (see vLLM WideEP blog).

Quantized models are not equivalent to the BF16 original. Always eval on actual traffic. Stock NVFP4 checkpoints recover ~99 % at 70B+, ~95–98 % at 7B–14B (Red Hat / NVIDIA numbers). Code / math / agentic workloads hit harder.

Format selection — pick once per hardware

GPU	Weight format recommendation	KV cache	Why
H100 / H200 (SM90)	`fp8` (compressed-tensors) or `modelopt`	`fp8_e4m3`	FP8 native Tensor Cores, CUTLASS/Marlin/DeepGEMM all mature
H100 / H200, accuracy-critical	`awq_marlin` / `gptq_marlin` (W4A16)	`fp8_e4m3`	Weight-only INT4 with per-group scales — best accuracy at 4-bit
H100 / H200, long-context MoE	`fp8` + DeepGEMM block	`fp8_e4m3`	Block FP8 MoE uses DeepGEMM path, lower activation-scale cost
B200 / B300 (SM100 / SM103)	`modelopt_fp4` or compressed-tensors NVFP4	`fp8_e4m3` (NVFP4 KV roadmap #32220)	Blackwell has native FP4 Tensor Cores — NVFP4 wins on both memory AND compute
B200 / B300, GPT-OSS	`mxfp4` / `gpt_oss_mxfp4` + `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1`	`fp8_e4m3`	Only vendor-supplied format GPT-OSS ships
B200 / B300, lower accuracy risk	`modelopt_mxfp8` or online `fp8_per_block`	`fp8_e4m3`	MXFP8 MoE has the newest kernel set (v0.19), better on shapes NVFP4 struggles with
GB10 / DGX Spark (SM121)	`fp8` only (NVFP4/MXFP4 kernels brittle on SM121)	`fp8_e4m3`	See #39761 / #37030 / #34817 — desktop-Blackwell quant kernels are not production ready
MI300X / MI355X (ROCm, gfx942/gfx950)	`quark` (AMD) — W4A8 MXFP4/FP8	`fp8_e4m3` (FNUZ-adjusted)	MI300 needs FNUZ scale adjustment; AMD Quark is the validated path
CPU	`cpu_awq` (W4A16) or `torchao`	—	Intel path for laptop / dev

Cross-hardware rule of thumb: produce one NVFP4 checkpoint per model. It loads on Blackwell natively and on Hopper via emulation (PR #35733, v0.19). A separate fp8 checkpoint is still worth keeping for older Hopper nodes where the NVFP4 emulation path is slower.

The `--quantization` flag values (all 29)

Single dispatch point: vllm/model_executor/layers/quantization/__init__.py:107-184. Full catalog with file paths, min-capability, kernel map, and notes: references/formats.md.

Production formats (keep in head):

Flag	Min SM	Use for
`fp8`	89	Compressed-tensors FP8 W8A8 — the Hopper default
`modelopt`	89	ModelOpt-exported FP8 (TRT-LLM ecosystem)
`modelopt_fp4`	75 (emulated), 100 (native)	ModelOpt NVFP4 — the Blackwell default
`modelopt_mxfp8`	89	ModelOpt MXFP8 (MoE + dense)
`modelopt_mixed`	89	Mixed-precision per-layer checkpoints
`compressed-tensors`	varies per scheme	neuralmagic / Red Hat / llm-compressor output
`awq_marlin`	75	AWQ W4A16 — accuracy-critical INT4
`gptq_marlin`	75	GPTQ W4A16 — classic INT4
`mxfp4` / `gpt_oss_mxfp4`	80 (MoE only on 100)	GPT-OSS ships this
`mxfp8`	80	Online MXFP8 (v0.19+)
`quark`	varies	AMD ROCm path
`fp8_per_tensor` / `fp8_per_block` / `int8_per_channel_weight_only` / `online`	75	Online quantization from BF16 checkpoint — no pre-quant step

Deprecated / legacy / narrow: awq (unfused Triton — use awq_marlin), gptq (unfused — use gptq_marlin), fbgemm_fp8, fp_quant, experts_int8 (use int8_per_channel_weight_only), moe_wna16, bitsandbytes, gguf, inc / auto-round (Intel), torchao, cpu_awq.

KV-cache dtypes (all 11)

Single dispatch: vllm/config/cache.py:18-34.

auto — match model weight dtype.
fp8, fp8_e4m3, fp8_e5m2 — the production path. E4M3 is default; E5M2 only for ROCm-specific setups.
fp8_inc (Intel), fp8_ds_mla (DeepSeek MLA variant).
int8_per_token_head, fp8_per_token_head — dynamic per-(token,head) scales computed in-kernel. No checkpoint scales needed. Added in PR #34281, v0.17.
turboquant_k8v4, turboquant_4bit_nc, turboquant_k3v4_nc, turboquant_3bit_nc — Hadamard-rotated 2-4 bit KV (v0.19, PR #38479).
nvfp4 — roadmap, gated on #32220.

--calculate-kv-scales was deprecated in v0.19 (PR #37201). Use pre-calibrated scales (LLM Compressor produces them) or let per-token-head scales be computed dynamically.

Producing a checkpoint

vLLM doesn't quantize — a separate tool does, then vLLM loads the result. Two production paths exist:

llm-compressor (vLLM-project) — outputs compressed-tensors format. Preferred for the open ecosystem. Covered in references/llm-compressor.md.
NVIDIA ModelOpt — outputs ModelOpt HF format, also consumable by TRT-LLM and SGLang. Preferred for NVFP4 on Blackwell. Covered in references/modelopt.md.

Quick picker:

FP8 Hopper, no calibration wanted → llm-compressor FP8_DYNAMIC (data-free, ~15 min for a 70B on H100).
W4A16 INT4 (AWQ or GPTQ) with best accuracy → llm-compressor, AWQModifier or GPTQModifier with 256–512 ultrachat samples.
NVFP4 on Blackwell → ModelOpt NVFP4_DEFAULT_CFG or llm-compressor NVFP4A16 / NVFP4 scheme (v0.10+).
MXFP4 MoE for GPT-OSS-style models → use the vendor checkpoint as-is, or ModelOpt MXFP4.
KV cache FP8 scales for MLA → llm-compressor kv_cache_scheme block with strategy: tensor (per-tensor is stable; per-head is experimental — see #38652 before enabling on MLA multi-turn).

Both tools output a HF directory vLLM serves with --quantization compressed-tensors (llm-compressor) or --quantization modelopt / --quantization modelopt_fp4 (ModelOpt).

Speculative decoding drafters

llm-compressor does not train drafters. ModelOpt does: modelopt/torch/speculative/{eagle,dflash,medusa,plugins}/, examples in examples/speculative_decoding/. Recipes: modelopt_recipes/general/speculative_decoding/{eagle3,dflash}.yaml.

Critical constraint: ModelOpt recipes assume BF16 target — not validated with an already-NVFP4 target (base wrapped in torch.no_grad(), so quantized target is theoretically workable but unvalidated). The order is:

1. Train drafter on BF16 target   (ModelOpt, ~4-12h on 8×H100)
2. Export drafter HF dir          (scripts/export_hf_checkpoint.py)
3. PTQ target to NVFP4 or FP8     (ModelOpt or llm-compressor)
4. (Optional) PTQ drafter too     (small, minimal accuracy cost)
5. Serve both in vLLM             (--quantization modelopt_fp4 --speculative-config '{...}')

Medusa / MTP cannot be trained post-hoc — MTP heads are part of the pretraining (DeepSeek V3, Qwen3-Next, GLM-4.5 MoE, etc.). Full details + exact commands: references/modelopt.md § speculative-decoding.

For spec-dec runtime tuning (acceptance rate metrics, method selection per target family, chunked-prefill composability) use the separate vllm-speculative-decoding skill — don't duplicate here.

Online quantization

Introduced by the v0.14 redesign (PR #37776). Quantizes a BF16 checkpoint at load time, no pre-quantization step. Trade-off: peak load memory is BF16 size.

# Per-tensor FP8 (static scales, simplest)
vllm serve meta-llama/Llama-3.1-70B --quantization fp8_per_tensor

# Per-block FP8 (dynamic per-token activation scales)
vllm serve meta-llama/Llama-3.1-70B --quantization fp8_per_block

# Weight-only INT8
vllm serve meta-llama/Llama-3.1-70B --quantization int8_per_channel_weight_only

# YAML-configured
vllm serve meta-llama/Llama-3.1-70B \
  --quantization online \
  --quantization-config-file online.yaml

Known gotchas — see #39663 (drops bias weights), #34129 (doesn't split MoE across EP), #19020 / #32029 / #32412 (multiple active RFCs). For any bias-ed or MoE model, prefer a pre-quantized checkpoint.

The operator-pain-point shortlist

Internalize these before debugging accuracy / throughput regressions:

--kv-cache-dtype fp8 on MLA models → garbage on multi-turn (#38652). Unresolved. Avoid on DeepSeek, GLM-4.5/4.6/4.7, Kimi K2 until FP8-KV MLA follow-ups land.
Gemma 4 FP8-block → logit saturation / repetitive garbage (#39407, #39049). Use non-block FP8 or FP16.
NVFP4 on Qwen3-Next / hybrid-attention models silently corrupted output when quantization_config.ignore missed linear_attn layers (#40252, fixed + closed 2026-04-20). The underlying pattern still applies to any new hybrid-attention model: always audit the ignore list when quantizing non-standard architectures.
Online FP8 drops bias weights (#39663). Any bias-ed target → use pre-quantized checkpoint.
Dynamic FP8 + LoRA-merged model on B200 → non-deterministic degenerate output (#39662). Pin static FP8.
SM120 (RTX 5090, 6000 Pro) is not a datacenter NVFP4 MoE target (#35065, #31085) — full kernel set is SM100 / SM103 only. Desktop Blackwell is production only for fp8.

The hardware-/version-gated traps (B300 TRTLLM hang, ModelOpt-vs-compressed-tensors export drift, MXFP4-linear-falls-back-to-BF16, A100 TurboQuant crash, Qwen3.5 v0.18 KV regression, MXFP8+DeepGEMM pre-v0.19 crash) live in the full triage playbook with symptoms → PR → workaround: references/troubleshooting.md.

Version-gate highlights

Full matrix in references/version-gates.md. Load-bearing ones:

v0.21 — current stable v0.21.0 (2026-05-15); v0.20.0 stable shipped 2026-04-27, followed by v0.20.1 / v0.20.2. Run v0.21.0+ for production. Quantization-layer churn continues — re-verify any v0.19-specific claim on upgrade.
v0.19 — online MXFP8, CompressedTensorsW8A8Mxfp8, ROCm AWQ Marlin, TurboQuant KV, DeepGemm E8M0 fix for Qwen3.5 FP8 on Blackwell, --calculate-kv-scales deprecation, Gemma 4 quantized MoE, B300 / GB300 fixes.
v0.18 — FP8 KV in Triton MLA decode, FlashInfer Sparse MLA FP8, ModelOpt MXFP8 MoE, AMD Quark W4A8 MXFP4/FP8, MLA crash with AWQ/GPTQ fix.
v0.17 — per-head KV scales, SM100 MXFP8 kernels, compressed-tensors as ground-truth, ModelOpt mixed precision, Llama-4 attention quant.
v0.16 — NVFP4/FP8 on Turing via emulation, TP>4 for FP4 GEMM, ModelOpt MXFP8 dense.
v0.15 — MXFP4 W4A16 for compressed-tensors MoE, FP4 kernel optimization (+65 % on SM100F via 256-bit loads).
v0.14 — Online quantization redesign, MXFP4 W4A16 for dense.

External references

Load source, not paraphrase:

vLLM docs: FP8 W8A8, Quantized KV Cache, AMD Quark.
vLLM recipes: index.
llm-compressor docs: index, NVFP4 W4A4, Qwen3.5 NVFP4 MoE.
compressed-tensors spec: quant_scheme.py, overview.
NVIDIA: Introducing NVFP4, NVFP4 KV cache, MoE perf leaps on Blackwell.
Red Hat: Accelerating LLMs with NVFP4, LLM Compressor 0.9, vLLM FP8 foundational.
vLLM blog: GPT-OSS on Blackwell, DeepSeek-R1 WideEP on GB200, DeepSeek-V3.2 on GB300.
AMD: FP8 with Quark for vLLM, MXFP4 Llama3.3 with Quark.

When in doubt, read the vLLM source — vllm/model_executor/layers/quantization/ is the ground truth, and the quantization layer churns fast enough that cached knowledge rots inside a release cycle.