ptq - SKILL.md Agent Skill

name: ptq description: This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

ModelOpt Post-Training Quantization

Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.

Step 1 — Environment

Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:

ModelOpt source is available
Local or remote (+ cluster config if remote)
SLURM / Docker+GPU / bare GPU
Launcher available?
Which workspace to use

Step 2 — Is the model supported?

Check the support table in examples/llm_ptq/README.md for verified HF models.

Listed → supported, use hf_ptq.py (step 4A/4B)
Not listed → read references/unsupported-models.md to determine if hf_ptq.py can still work or if a custom script is needed (step 4C)

Step 2.5 — Check for model-specific dependencies

If the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:

grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u

Known dependency patterns:

Import found	Packages to install
`from mamba_ssm` / `from causal_conv1d`	`mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba)

If extra deps are needed:

Launcher (4B): set EXTRA_PIP_DEPS in the task's environment section — ptq.sh installs them automatically
Manual (4A): unset PIP_CONSTRAINT && pip install <deps> before running hf_ptq.py

Step 3 — Choose quantization format

First, check for a model-specific recipe:

ls modelopt_recipes/models/ 2>/dev/null
ls modelopt_recipes/huggingface/<model_type>/ptq/ 2>/dev/null  # per-arch; <model_type> from local config.json (Hub ID: AutoConfig.from_pretrained)

If a model-specific recipe exists, prefer --recipe <path> — but inspect its include/exclude patterns rather than assuming (e.g. for VLMs, confirm the vision tower is actually excluded).

If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):

Blackwell (B100/B200/GB200): nvfp4 variants
Hopper (H100/H200) or older: fp8 or int4_awq

Use --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.

Before running PTQ, sanity-check the selected qformat/recipe against the model structure. Inspect the recipe's include/exclude patterns and summarize which layer groups will be quantized and approximately how many modules/layers match (attention projections, MLP projections, experts, etc.). If the match count is 0, or far smaller than expected for the model, stop and fix the recipe or ask the user before launching calibration.

VLMs: generic *mlp*/*experts* recipes also match the vision tower (model.visual.*); quantizing the ViT silently breaks image benchmarks. Use the huggingface/<model_type>/ptq/ recipe or add *visual*/*vision_tower* excludes, then verify in Step 5 — see references/checkpoint-validation.md.

If the source checkpoint is already quantized and the requested recipe/config reduces quantization coverage, confirm that intent with the user before running. For example, if an FP8 checkpoint is used as input and the recipe excludes some layers so they would fall back to BF16 instead of staying quantized, call out the affected layer groups and ask whether that FP8-to-BF16 fallback is intended.

NVFP4 can be calibrated on Hopper but requires Blackwell for inference.

Step 4 — Run PTQ

Goal: checkpoint on disk (.safetensors + config.json).

For listed models (4A/4B): run full calibration directly (--calib_size 512). For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.

Which path?

In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  │
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)

4A — Direct: supported model, manual execution

pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>

Run --help for all options.

For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).

4B — Launcher: supported model on SLURM or local Docker

Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.

cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes

The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).

4C — Unlisted model

Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.

For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.

Monitoring

After job submission, register the job and set up monitoring per the monitor skill.

Step 5 — Verify output

ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors

Report the path and size to the user.

Post-quantization validation

This is a required gate before any deployment or evaluation submission. Do not submit an eval, start a serving job, or hand off the checkpoint as ready until the gate has passed.

Read references/checkpoint-validation.md and perform all three validation groups on the exact checkpoint path that will be deployed/evaluated:

Check output size and estimated bits per weight against the baseline/source checkpoint.
Check quantized-weight coverage against the requested qformat/recipe/config.
Check metadata consistency against the baseline/source model.

Report the gate result before moving on. The report must include source size, output size, output/source size ratio, layer precision counts (for example NVFP4, FP8, INT4, BF16/unquantized excluded, unexpected unquantized, declaration mismatches), and metadata diffs. If the output/source ratio is >= 1.0 for a compression recipe, if any intended layer group is missing quantization, or if metadata changed unexpectedly, stop and fix the checkpoint or ask the user before proceeding.

Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.

Key API Rules

mtq.register() classes must define _setup() and call it from __init__
Call mto.enable_huggingface_checkpointing() before quantization
Wildcard *gate* matches too broadly — use *mlp.gate* or *router*
VLMs: hf_ptq.py auto-extracts the language model via extract_and_prepare_language_model_from_vl() — no manual VLM handling needed in most cases
FP8 checkpoints: prefer _QuantFP8Linear (lazy dequant) over FineGrainedFP8Config(dequantize=True) which wastes ~2x memory. See references/unsupported-models.md for details
Custom quantizer names must end with _input_quantizer or _weight_quantizer

Common Pitfalls

Model-specific dependencies: Models with trust_remote_code may import packages not in the container (e.g., mamba-ssm for hybrid Mamba models). See Step 2.5. Use EXTRA_PIP_DEPS env var with the launcher, or install manually before running hf_ptq.py
Transformers version: New models may need a newer version of transformers than what's installed. Check config.json for transformers_version. In containers, beware of PIP_CONSTRAINT blocking upgrades — see references/slurm-setup-ptq.md for workarounds
Gated datasets: Some calibration datasets require HF authentication. Ensure HF_TOKEN is set in the job environment, or use --dataset cnn_dailymail as a non-gated alternative
NFS root_squash + Docker: See skills/common/slurm-setup.md section 5

References

Reference	When to read
`skills/common/environment-setup.md`	Step 1: always
`skills/common/workspace-management.md`	Step 1: always
`references/launcher-guide.md`	Step 4B only (launcher path)
`tools/launcher/CLAUDE.md`	Step 4B only, if you need more launcher detail
`references/unsupported-models.md`	Step 4C only (unlisted model)
`references/checkpoint-validation.md`	Step 5: mandatory post-PTQ gate before deployment/evaluation
`skills/common/remote-execution.md`	Step 4A/4C only, if target is remote
`skills/common/slurm-setup.md`	Step 4A/4C only, if using SLURM manually (not launcher)
`references/slurm-setup-ptq.md`	Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2)
`examples/llm_ptq/README.md`	Step 3: support matrix, CLI flags, accuracy
`modelopt/torch/quantization/config.py`	Step 3: format definitions
`modelopt/torch/export/model_utils.py`	Step 4C: TRT-LLM export type mapping
`modelopt_recipes/`	Step 3: pre-built recipes