vllm-quantization

star 3

vLLM datacenter-GPU quantization — picking, configuring, troubleshooting NVFP4, FP8, MXFP4, MXFP8, AWQ, GPTQ, INT8, compressed-tensors, modelopt, quark on H100/H200/B200/B300/GB200/GB300. 29 `--quantization` flag values, KV-cache dtypes (fp8_e4m3, nvfp4, per-token-head, turboquant), MoE backend selection (CUTLASS, TRTLLM, FlashInfer, DeepGEMM, Marlin, Qutlass), producing checkpoints with llm-compressor and NVIDIA ModelOpt (NVFP4_DEFAULT_CFG, FP8_DEFAULT_CFG, W4A16, SmoothQuant+GPTQ), online quantization (`fp8_per_tensor`, `fp8_per_block`), training EAGLE-3/dflash drafters on BF16 targets before PTQ, version gates per vLLM release (v0.14 → v0.21).

air-gapped By air-gapped schedule Updated 5/29/2026

name: vllm-quantization allowed-tools: Bash, Read, Write, Edit, Grep, Glob description: |- vLLM datacenter-GPU quantization — picking, configuring, troubleshooting NVFP4, FP8, MXFP4, MXFP8, AWQ, GPTQ, INT8, compressed-tensors, modelopt, quark on H100/H200/B200/B300/GB200/GB300. 29 --quantization flag values, KV-cache dtypes (fp8_e4m3, nvfp4, per-token-head, turboquant), MoE backend selection (CUTLASS, TRTLLM, FlashInfer, DeepGEMM, Marlin, Qutlass), producing checkpoints with llm-compressor and NVIDIA ModelOpt (NVFP4_DEFAULT_CFG, FP8_DEFAULT_CFG, W4A16, SmoothQuant+GPTQ), online quantization (fp8_per_tensor, fp8_per_block), training EAGLE-3/dflash drafters on BF16 targets before PTQ, version gates per vLLM release (v0.14 → v0.21). when_to_use: |- Trigger on --quantization, --kv-cache-dtype, NVFP4, MXFP4, MXFP8, FP8, W4A16, W8A8, W4A4, AWQ, GPTQ, SmoothQuant, modelopt, compressed-tensors, quark, torchao, bitsandbytes, gguf, TurboQuant, CUTLASS, Marlin, FlashInfer, TRTLLM, DeepGEMM, Qutlass, Machete, hf_quant_config.json, kv_cache_scheme, NVFP4_DEFAULT_CFG, FP8_DEFAULT_CFG, llm-compressor, ModelOpt. Symptoms — "garbage after FP8", "NVFP4 NaN", "FP8 KV multi-turn corruption", "MoE kernel not dispatched on SM120", "illegal memory access awq_marlin", "online FP8 drops bias", "modelopt checkpoint won't load". Decisions — NVFP4 vs FP8 on H200 vs B200, quantizing EAGLE-3/dflash drafters, generating a checkpoint vLLM can load. Also implicit — "quantize {model}", "pick quant for {model}", "audit quantization", "deploy-memo quant", "which quant fits {GPU}", "spec-study quantization".

vLLM quantization — operator skill

Last verified: 2026-04-24 — see references/sources.md for per-ref audit table.

For production vLLM operators on H100 / H200 / B200 / B300 / GB200 / GB300 fleets deciding which quantization format fits a given target model, producing a checkpoint vLLM will actually load, wiring the right KV-cache dtype, diagnosing accuracy or throughput regressions after an upgrade, and composing quantization with speculative decoding / LoRA / MoE.

Pointer-map format: this SKILL.md picks the format and CLI; the files in references/ hold the per-format deep dives, exact source pointers, and troubleshooting cards. Follow the link, don't paraphrase from memory — the quantization layer moves faster than any other subsystem in vLLM (six formats landed in v0.19 alone).

When quantization wins, when it doesn't

Quantization trades weight precision for memory + compute:

  • KV-capacity bound (long context, high concurrency) — FP8 or NVFP4 KV cache gives a 2×/4× KV-capacity multiplier; weight format matters much less than getting --kv-cache-dtype right. Measure kv_cache_usage_perc.
  • Memory-bandwidth bound (small batch, decode-heavy, 70B+ on < 8 GPUs) — weight quantization (NVFP4 / FP8 / W4A16) reduces HBM traffic per token, giving 1.5–3× decode throughput on a well-matched target+kernel.
  • Compute bound (prefill, large batch, small model) — quantization may not help; Blackwell FP4 Tensor Cores are the first architecture where W4A4 actually beats FP8 in compute-bound regimes. On Hopper, W4A16 is memory-only — MMA still runs FP16.
  • Multi-node EP / disaggregated serving — NVFP4 reduces all-to-all by 4× vs BF16. DeepSeek-R1 / V3.2 on GB200/GB300 gets most of its throughput from NVFP4 over the fabric, not from per-GPU compute (see vLLM WideEP blog).

Quantized models are not equivalent to the BF16 original. Always eval on actual traffic. Stock NVFP4 checkpoints recover ~99 % at 70B+, ~95–98 % at 7B–14B (Red Hat / NVIDIA numbers). Code / math / agentic workloads hit harder.

Format selection — pick once per hardware

GPU Weight format recommendation KV cache Why
H100 / H200 (SM90) fp8 (compressed-tensors) or modelopt fp8_e4m3 FP8 native Tensor Cores, CUTLASS/Marlin/DeepGEMM all mature
H100 / H200, accuracy-critical awq_marlin / gptq_marlin (W4A16) fp8_e4m3 Weight-only INT4 with per-group scales — best accuracy at 4-bit
H100 / H200, long-context MoE fp8 + DeepGEMM block fp8_e4m3 Block FP8 MoE uses DeepGEMM path, lower activation-scale cost
B200 / B300 (SM100 / SM103) modelopt_fp4 or compressed-tensors NVFP4 fp8_e4m3 (NVFP4 KV roadmap #32220) Blackwell has native FP4 Tensor Cores — NVFP4 wins on both memory AND compute
B200 / B300, GPT-OSS mxfp4 / gpt_oss_mxfp4 + VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 fp8_e4m3 Only vendor-supplied format GPT-OSS ships
B200 / B300, lower accuracy risk modelopt_mxfp8 or online fp8_per_block fp8_e4m3 MXFP8 MoE has the newest kernel set (v0.19), better on shapes NVFP4 struggles with
GB10 / DGX Spark (SM121) fp8 only (NVFP4/MXFP4 kernels brittle on SM121) fp8_e4m3 See #39761 / #37030 / #34817 — desktop-Blackwell quant kernels are not production ready
MI300X / MI355X (ROCm, gfx942/gfx950) quark (AMD) — W4A8 MXFP4/FP8 fp8_e4m3 (FNUZ-adjusted) MI300 needs FNUZ scale adjustment; AMD Quark is the validated path
CPU cpu_awq (W4A16) or torchao Intel path for laptop / dev

Cross-hardware rule of thumb: produce one NVFP4 checkpoint per model. It loads on Blackwell natively and on Hopper via emulation (PR #35733, v0.19). A separate fp8 checkpoint is still worth keeping for older Hopper nodes where the NVFP4 emulation path is slower.

The --quantization flag values (all 29)

Single dispatch point: vllm/model_executor/layers/quantization/__init__.py:107-184. Full catalog with file paths, min-capability, kernel map, and notes: references/formats.md.

Production formats (keep in head):

Flag Min SM Use for
fp8 89 Compressed-tensors FP8 W8A8 — the Hopper default
modelopt 89 ModelOpt-exported FP8 (TRT-LLM ecosystem)
modelopt_fp4 75 (emulated), 100 (native) ModelOpt NVFP4 — the Blackwell default
modelopt_mxfp8 89 ModelOpt MXFP8 (MoE + dense)
modelopt_mixed 89 Mixed-precision per-layer checkpoints
compressed-tensors varies per scheme neuralmagic / Red Hat / llm-compressor output
awq_marlin 75 AWQ W4A16 — accuracy-critical INT4
gptq_marlin 75 GPTQ W4A16 — classic INT4
mxfp4 / gpt_oss_mxfp4 80 (MoE only on 100) GPT-OSS ships this
mxfp8 80 Online MXFP8 (v0.19+)
quark varies AMD ROCm path
fp8_per_tensor / fp8_per_block / int8_per_channel_weight_only / online 75 Online quantization from BF16 checkpoint — no pre-quant step

Deprecated / legacy / narrow: awq (unfused Triton — use awq_marlin), gptq (unfused — use gptq_marlin), fbgemm_fp8, fp_quant, experts_int8 (use int8_per_channel_weight_only), moe_wna16, bitsandbytes, gguf, inc / auto-round (Intel), torchao, cpu_awq.

KV-cache dtypes (all 11)

Single dispatch: vllm/config/cache.py:18-34.

  • auto — match model weight dtype.
  • fp8, fp8_e4m3, fp8_e5m2 — the production path. E4M3 is default; E5M2 only for ROCm-specific setups.
  • fp8_inc (Intel), fp8_ds_mla (DeepSeek MLA variant).
  • int8_per_token_head, fp8_per_token_head — dynamic per-(token,head) scales computed in-kernel. No checkpoint scales needed. Added in PR #34281, v0.17.
  • turboquant_k8v4, turboquant_4bit_nc, turboquant_k3v4_nc, turboquant_3bit_nc — Hadamard-rotated 2-4 bit KV (v0.19, PR #38479).
  • nvfp4 — roadmap, gated on #32220.

--calculate-kv-scales was deprecated in v0.19 (PR #37201). Use pre-calibrated scales (LLM Compressor produces them) or let per-token-head scales be computed dynamically.

Producing a checkpoint

vLLM doesn't quantize — a separate tool does, then vLLM loads the result. Two production paths exist:

  1. llm-compressor (vLLM-project) — outputs compressed-tensors format. Preferred for the open ecosystem. Covered in references/llm-compressor.md.
  2. NVIDIA ModelOpt — outputs ModelOpt HF format, also consumable by TRT-LLM and SGLang. Preferred for NVFP4 on Blackwell. Covered in references/modelopt.md.

Quick picker:

  • FP8 Hopper, no calibration wanted → llm-compressor FP8_DYNAMIC (data-free, ~15 min for a 70B on H100).
  • W4A16 INT4 (AWQ or GPTQ) with best accuracy → llm-compressor, AWQModifier or GPTQModifier with 256–512 ultrachat samples.
  • NVFP4 on Blackwell → ModelOpt NVFP4_DEFAULT_CFG or llm-compressor NVFP4A16 / NVFP4 scheme (v0.10+).
  • MXFP4 MoE for GPT-OSS-style models → use the vendor checkpoint as-is, or ModelOpt MXFP4.
  • KV cache FP8 scales for MLA → llm-compressor kv_cache_scheme block with strategy: tensor (per-tensor is stable; per-head is experimental — see #38652 before enabling on MLA multi-turn).

Both tools output a HF directory vLLM serves with --quantization compressed-tensors (llm-compressor) or --quantization modelopt / --quantization modelopt_fp4 (ModelOpt).

Speculative decoding drafters

llm-compressor does not train drafters. ModelOpt does: modelopt/torch/speculative/{eagle,dflash,medusa,plugins}/, examples in examples/speculative_decoding/. Recipes: modelopt_recipes/general/speculative_decoding/{eagle3,dflash}.yaml.

Critical constraint: ModelOpt recipes assume BF16 target — not validated with an already-NVFP4 target (base wrapped in torch.no_grad(), so quantized target is theoretically workable but unvalidated). The order is:

1. Train drafter on BF16 target   (ModelOpt, ~4-12h on 8×H100)
2. Export drafter HF dir          (scripts/export_hf_checkpoint.py)
3. PTQ target to NVFP4 or FP8     (ModelOpt or llm-compressor)
4. (Optional) PTQ drafter too     (small, minimal accuracy cost)
5. Serve both in vLLM             (--quantization modelopt_fp4 --speculative-config '{...}')

Medusa / MTP cannot be trained post-hoc — MTP heads are part of the pretraining (DeepSeek V3, Qwen3-Next, GLM-4.5 MoE, etc.). Full details + exact commands: references/modelopt.md § speculative-decoding.

For spec-dec runtime tuning (acceptance rate metrics, method selection per target family, chunked-prefill composability) use the separate vllm-speculative-decoding skill — don't duplicate here.

Online quantization

Introduced by the v0.14 redesign (PR #37776). Quantizes a BF16 checkpoint at load time, no pre-quantization step. Trade-off: peak load memory is BF16 size.

# Per-tensor FP8 (static scales, simplest)
vllm serve meta-llama/Llama-3.1-70B --quantization fp8_per_tensor

# Per-block FP8 (dynamic per-token activation scales)
vllm serve meta-llama/Llama-3.1-70B --quantization fp8_per_block

# Weight-only INT8
vllm serve meta-llama/Llama-3.1-70B --quantization int8_per_channel_weight_only

# YAML-configured
vllm serve meta-llama/Llama-3.1-70B \
  --quantization online \
  --quantization-config-file online.yaml

Known gotchas — see #39663 (drops bias weights), #34129 (doesn't split MoE across EP), #19020 / #32029 / #32412 (multiple active RFCs). For any bias-ed or MoE model, prefer a pre-quantized checkpoint.

The operator-pain-point shortlist

Internalize these before debugging accuracy / throughput regressions:

  1. --kv-cache-dtype fp8 on MLA models → garbage on multi-turn (#38652). Unresolved. Avoid on DeepSeek, GLM-4.5/4.6/4.7, Kimi K2 until FP8-KV MLA follow-ups land.
  2. Gemma 4 FP8-block → logit saturation / repetitive garbage (#39407, #39049). Use non-block FP8 or FP16.
  3. NVFP4 on Qwen3-Next / hybrid-attention models silently corrupted output when quantization_config.ignore missed linear_attn layers (#40252, fixed + closed 2026-04-20). The underlying pattern still applies to any new hybrid-attention model: always audit the ignore list when quantizing non-standard architectures.
  4. Online FP8 drops bias weights (#39663). Any bias-ed target → use pre-quantized checkpoint.
  5. Dynamic FP8 + LoRA-merged model on B200 → non-deterministic degenerate output (#39662). Pin static FP8.
  6. SM120 (RTX 5090, 6000 Pro) is not a datacenter NVFP4 MoE target (#35065, #31085) — full kernel set is SM100 / SM103 only. Desktop Blackwell is production only for fp8.

The hardware-/version-gated traps (B300 TRTLLM hang, ModelOpt-vs-compressed-tensors export drift, MXFP4-linear-falls-back-to-BF16, A100 TurboQuant crash, Qwen3.5 v0.18 KV regression, MXFP8+DeepGEMM pre-v0.19 crash) live in the full triage playbook with symptoms → PR → workaround: references/troubleshooting.md.

Version-gate highlights

Full matrix in references/version-gates.md. Load-bearing ones:

  • v0.21 — current stable v0.21.0 (2026-05-15); v0.20.0 stable shipped 2026-04-27, followed by v0.20.1 / v0.20.2. Run v0.21.0+ for production. Quantization-layer churn continues — re-verify any v0.19-specific claim on upgrade.
  • v0.19 — online MXFP8, CompressedTensorsW8A8Mxfp8, ROCm AWQ Marlin, TurboQuant KV, DeepGemm E8M0 fix for Qwen3.5 FP8 on Blackwell, --calculate-kv-scales deprecation, Gemma 4 quantized MoE, B300 / GB300 fixes.
  • v0.18 — FP8 KV in Triton MLA decode, FlashInfer Sparse MLA FP8, ModelOpt MXFP8 MoE, AMD Quark W4A8 MXFP4/FP8, MLA crash with AWQ/GPTQ fix.
  • v0.17 — per-head KV scales, SM100 MXFP8 kernels, compressed-tensors as ground-truth, ModelOpt mixed precision, Llama-4 attention quant.
  • v0.16 — NVFP4/FP8 on Turing via emulation, TP>4 for FP4 GEMM, ModelOpt MXFP8 dense.
  • v0.15 — MXFP4 W4A16 for compressed-tensors MoE, FP4 kernel optimization (+65 % on SM100F via 256-bit loads).
  • v0.14 — Online quantization redesign, MXFP4 W4A16 for dense.

What to read next

  • references/formats.md — per-format deep dive: kernels, config JSON shapes, min-capability, known caveats.
  • references/llm-compressor.md — recipe cookbook: FP8_DYNAMIC / W4A16 / AWQ / NVFP4A16 / KV-cache FP8 / model-free PTQ commands with exact calibration budgets and output layouts.
  • references/modelopt.md — ModelOpt PTQ (hf_ptq.py) + speculative-decoding training (EAGLE-3, dflash, MTP constraints) + vLLM loader compatibility.
  • references/kernels.md — kernel × format × SM dispatch map (Marlin / CUTLASS / DeepGEMM / FlashInfer / TRTLLM / Qutlass / Machete / Triton / Exllamav2).
  • references/kv-cache.md — KV-cache quantization: dtypes, per-token-head scales, attention-backend compatibility, calibration.
  • references/troubleshooting.md — symptom → known-issue → fix playbook.
  • references/version-gates.md — release-by-release quantization changes, v0.14 → v0.21.

External references

Load source, not paraphrase:

When in doubt, read the vLLM source — vllm/model_executor/layers/quantization/ is the ground truth, and the quantization layer churns fast enough that cached knowledge rots inside a release cycle.

Install via CLI
npx skills add https://github.com/air-gapped/skills --skill vllm-quantization
Repository Details
star Stars 3
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator