name: ptq description: This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
ModelOpt Post-Training Quantization
Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.
Step 1 — Environment
Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:
- ModelOpt source is available
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher available?
- Which workspace to use
Step 2 — Is the model supported?
Check the support table in examples/llm_ptq/README.md for verified HF models.
- Listed → supported, use
hf_ptq.py(step 4A/4B) - Not listed → read
references/unsupported-models.mdto determine ifhf_ptq.pycan still work or if a custom script is needed (step 4C)
Step 2.5 — Check for model-specific dependencies
If the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
Known dependency patterns:
| Import found | Packages to install |
|---|---|
from mamba_ssm / from causal_conv1d |
mamba-ssm causal-conv1d (Mamba/hybrid models: NemotronH, Jamba) |
If extra deps are needed:
- Launcher (4B): set
EXTRA_PIP_DEPSin the task'senvironmentsection —ptq.shinstalls them automatically - Manual (4A):
unset PIP_CONSTRAINT && pip install <deps>before runninghf_ptq.py
Step 3 — Choose quantization format
First, check for a model-specific recipe:
ls modelopt_recipes/models/ 2>/dev/null
ls modelopt_recipes/huggingface/<model_type>/ptq/ 2>/dev/null # per-arch; <model_type> from local config.json (Hub ID: AutoConfig.from_pretrained)
If a model-specific recipe exists, prefer --recipe <path> — but inspect its include/exclude patterns rather than assuming (e.g. for VLMs, confirm the vision tower is actually excluded).
If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):
- Blackwell (B100/B200/GB200):
nvfp4variants - Hopper (H100/H200) or older:
fp8orint4_awq
Use --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.
Before running PTQ, sanity-check the selected qformat/recipe against the model structure. Inspect the recipe's include/exclude patterns and summarize which layer groups will be quantized and approximately how many modules/layers match (attention projections, MLP projections, experts, etc.). If the match count is 0, or far smaller than expected for the model, stop and fix the recipe or ask the user before launching calibration.
VLMs: generic *mlp*/*experts* recipes also match the vision tower (model.visual.*); quantizing the ViT silently breaks image benchmarks. Use the huggingface/<model_type>/ptq/ recipe or add *visual*/*vision_tower* excludes, then verify in Step 5 — see references/checkpoint-validation.md.
If the source checkpoint is already quantized and the requested recipe/config reduces quantization coverage, confirm that intent with the user before running. For example, if an FP8 checkpoint is used as input and the recipe excludes some layers so they would fall back to BF16 instead of staying quantized, call out the affected layer groups and ask whether that FP8-to-BF16 fallback is intended.
NVFP4 can be calibrated on Hopper but requires Blackwell for inference.
Step 4 — Run PTQ
Goal: checkpoint on disk (.safetensors + config.json).
For listed models (4A/4B): run full calibration directly (--calib_size 512).
For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.
Which path?
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)
4A — Direct: supported model, manual execution
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>
Run --help for all options.
For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).
4B — Launcher: supported model on SLURM or local Docker
Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.
cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).
4C — Unlisted model
Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.
For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.
Monitoring
After job submission, register the job and set up monitoring per the monitor skill.
Step 5 — Verify output
ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors
Report the path and size to the user.
Post-quantization validation
This is a required gate before any deployment or evaluation submission. Do not submit an eval, start a serving job, or hand off the checkpoint as ready until the gate has passed.
Read references/checkpoint-validation.md and perform all three validation groups on the exact checkpoint path that will be deployed/evaluated:
- Check output size and estimated bits per weight against the baseline/source checkpoint.
- Check quantized-weight coverage against the requested qformat/recipe/config.
- Check metadata consistency against the baseline/source model.
Report the gate result before moving on. The report must include source size, output size, output/source size ratio, layer precision counts (for example NVFP4, FP8, INT4, BF16/unquantized excluded, unexpected unquantized, declaration mismatches), and metadata diffs. If the output/source ratio is >= 1.0 for a compression recipe, if any intended layer group is missing quantization, or if metadata changed unexpectedly, stop and fix the checkpoint or ask the user before proceeding.
Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
Key API Rules
mtq.register()classes must define_setup()and call it from__init__- Call
mto.enable_huggingface_checkpointing()before quantization - Wildcard
*gate*matches too broadly — use*mlp.gate*or*router* - VLMs:
hf_ptq.pyauto-extracts the language model viaextract_and_prepare_language_model_from_vl()— no manual VLM handling needed in most cases - FP8 checkpoints: prefer
_QuantFP8Linear(lazy dequant) overFineGrainedFP8Config(dequantize=True)which wastes ~2x memory. Seereferences/unsupported-models.mdfor details - Custom quantizer names must end with
_input_quantizeror_weight_quantizer
Common Pitfalls
- Model-specific dependencies: Models with
trust_remote_codemay import packages not in the container (e.g.,mamba-ssmfor hybrid Mamba models). See Step 2.5. UseEXTRA_PIP_DEPSenv var with the launcher, or install manually before runninghf_ptq.py - Transformers version: New models may need a newer version of transformers than what's installed. Check
config.jsonfortransformers_version. In containers, beware ofPIP_CONSTRAINTblocking upgrades — seereferences/slurm-setup-ptq.mdfor workarounds - Gated datasets: Some calibration datasets require HF authentication. Ensure
HF_TOKENis set in the job environment, or use--dataset cnn_dailymailas a non-gated alternative - NFS root_squash + Docker: See
skills/common/slurm-setup.mdsection 5
References
| Reference | When to read |
|---|---|
skills/common/environment-setup.md |
Step 1: always |
skills/common/workspace-management.md |
Step 1: always |
references/launcher-guide.md |
Step 4B only (launcher path) |
tools/launcher/CLAUDE.md |
Step 4B only, if you need more launcher detail |
references/unsupported-models.md |
Step 4C only (unlisted model) |
references/checkpoint-validation.md |
Step 5: mandatory post-PTQ gate before deployment/evaluation |
skills/common/remote-execution.md |
Step 4A/4C only, if target is remote |
skills/common/slurm-setup.md |
Step 4A/4C only, if using SLURM manually (not launcher) |
references/slurm-setup-ptq.md |
Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
examples/llm_ptq/README.md |
Step 3: support matrix, CLI flags, accuracy |
modelopt/torch/quantization/config.py |
Step 3: format definitions |
modelopt/torch/export/model_utils.py |
Step 4C: TRT-LLM export type mapping |
modelopt_recipes/ |
Step 3: pre-built recipes |