ptq

star 2.9k

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

NVIDIA By NVIDIA schedule Updated 6/8/2026

name: ptq description: This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

ModelOpt Post-Training Quantization

Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.

Step 1 — Environment

Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:

  • ModelOpt source is available
  • Local or remote (+ cluster config if remote)
  • SLURM / Docker+GPU / bare GPU
  • Launcher available?
  • Which workspace to use

Step 2 — Is the model supported?

Check the support table in examples/llm_ptq/README.md for verified HF models.

  • Listed → supported, use hf_ptq.py (step 4A/4B)
  • Not listed → read references/unsupported-models.md to determine if hf_ptq.py can still work or if a custom script is needed (step 4C)

Step 2.5 — Check for model-specific dependencies

If the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:

grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u

Known dependency patterns:

Import found Packages to install
from mamba_ssm / from causal_conv1d mamba-ssm causal-conv1d (Mamba/hybrid models: NemotronH, Jamba)

If extra deps are needed:

  • Launcher (4B): set EXTRA_PIP_DEPS in the task's environment section — ptq.sh installs them automatically
  • Manual (4A): unset PIP_CONSTRAINT && pip install <deps> before running hf_ptq.py

Step 3 — Choose quantization format

First, check for a model-specific recipe:

ls modelopt_recipes/models/ 2>/dev/null
ls modelopt_recipes/huggingface/<model_type>/ptq/ 2>/dev/null  # per-arch; <model_type> from local config.json (Hub ID: AutoConfig.from_pretrained)

If a model-specific recipe exists, prefer --recipe <path> — but inspect its include/exclude patterns rather than assuming (e.g. for VLMs, confirm the vision tower is actually excluded).

If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):

  • Blackwell (B100/B200/GB200): nvfp4 variants
  • Hopper (H100/H200) or older: fp8 or int4_awq

Use --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.

Before running PTQ, sanity-check the selected qformat/recipe against the model structure. Inspect the recipe's include/exclude patterns and summarize which layer groups will be quantized and approximately how many modules/layers match (attention projections, MLP projections, experts, etc.). If the match count is 0, or far smaller than expected for the model, stop and fix the recipe or ask the user before launching calibration.

VLMs: generic *mlp*/*experts* recipes also match the vision tower (model.visual.*); quantizing the ViT silently breaks image benchmarks. Use the huggingface/<model_type>/ptq/ recipe or add *visual*/*vision_tower* excludes, then verify in Step 5 — see references/checkpoint-validation.md.

If the source checkpoint is already quantized and the requested recipe/config reduces quantization coverage, confirm that intent with the user before running. For example, if an FP8 checkpoint is used as input and the recipe excludes some layers so they would fall back to BF16 instead of staying quantized, call out the affected layer groups and ask whether that FP8-to-BF16 fallback is intended.

NVFP4 can be calibrated on Hopper but requires Blackwell for inference.

Step 4 — Run PTQ

Goal: checkpoint on disk (.safetensors + config.json).

For listed models (4A/4B): run full calibration directly (--calib_size 512). For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.

Which path?

In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  │
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)

4A — Direct: supported model, manual execution

pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>

Run --help for all options.

For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).

4B — Launcher: supported model on SLURM or local Docker

Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.

cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes

The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).

4C — Unlisted model

Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.

For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.

Monitoring

After job submission, register the job and set up monitoring per the monitor skill.

Step 5 — Verify output

ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors

Report the path and size to the user.

Post-quantization validation

This is a required gate before any deployment or evaluation submission. Do not submit an eval, start a serving job, or hand off the checkpoint as ready until the gate has passed.

Read references/checkpoint-validation.md and perform all three validation groups on the exact checkpoint path that will be deployed/evaluated:

  1. Check output size and estimated bits per weight against the baseline/source checkpoint.
  2. Check quantized-weight coverage against the requested qformat/recipe/config.
  3. Check metadata consistency against the baseline/source model.

Report the gate result before moving on. The report must include source size, output size, output/source size ratio, layer precision counts (for example NVFP4, FP8, INT4, BF16/unquantized excluded, unexpected unquantized, declaration mismatches), and metadata diffs. If the output/source ratio is >= 1.0 for a compression recipe, if any intended layer group is missing quantization, or if metadata changed unexpectedly, stop and fix the checkpoint or ask the user before proceeding.

Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.

Key API Rules

  • mtq.register() classes must define _setup() and call it from __init__
  • Call mto.enable_huggingface_checkpointing() before quantization
  • Wildcard *gate* matches too broadly — use *mlp.gate* or *router*
  • VLMs: hf_ptq.py auto-extracts the language model via extract_and_prepare_language_model_from_vl() — no manual VLM handling needed in most cases
  • FP8 checkpoints: prefer _QuantFP8Linear (lazy dequant) over FineGrainedFP8Config(dequantize=True) which wastes ~2x memory. See references/unsupported-models.md for details
  • Custom quantizer names must end with _input_quantizer or _weight_quantizer

Common Pitfalls

  • Model-specific dependencies: Models with trust_remote_code may import packages not in the container (e.g., mamba-ssm for hybrid Mamba models). See Step 2.5. Use EXTRA_PIP_DEPS env var with the launcher, or install manually before running hf_ptq.py
  • Transformers version: New models may need a newer version of transformers than what's installed. Check config.json for transformers_version. In containers, beware of PIP_CONSTRAINT blocking upgrades — see references/slurm-setup-ptq.md for workarounds
  • Gated datasets: Some calibration datasets require HF authentication. Ensure HF_TOKEN is set in the job environment, or use --dataset cnn_dailymail as a non-gated alternative
  • NFS root_squash + Docker: See skills/common/slurm-setup.md section 5

References

Reference When to read
skills/common/environment-setup.md Step 1: always
skills/common/workspace-management.md Step 1: always
references/launcher-guide.md Step 4B only (launcher path)
tools/launcher/CLAUDE.md Step 4B only, if you need more launcher detail
references/unsupported-models.md Step 4C only (unlisted model)
references/checkpoint-validation.md Step 5: mandatory post-PTQ gate before deployment/evaluation
skills/common/remote-execution.md Step 4A/4C only, if target is remote
skills/common/slurm-setup.md Step 4A/4C only, if using SLURM manually (not launcher)
references/slurm-setup-ptq.md Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2)
examples/llm_ptq/README.md Step 3: support matrix, CLI flags, accuracy
modelopt/torch/quantization/config.py Step 3: format definitions
modelopt/torch/export/model_utils.py Step 4C: TRT-LLM export type mapping
modelopt_recipes/ Step 3: pre-built recipes
Install via CLI
npx skills add https://github.com/NVIDIA/Model-Optimizer --skill ptq
Repository Details
star Stars 2,936
call_split Forks 444
navigation Branch main
article Path SKILL.md
More from Creator