quant-recipe-search - SKILL.md Agent Skill

name: quant-recipe-search description: >- Use when the user asks to find, search for, or optimize the best quantization recipe for a model, including direct requests like "find the best quantization recipe and generate a PTQ checkpoint." Guides the multi-candidate loop: choose compute-vs-memory success metrics, select ModelOpt recipe baselines, design AutoQuant/manual recipe deltas, interpret sensitivity, and decide next candidates. Do NOT use for a single known PTQ recipe run (use ptq), serving (use deployment), creating/running evals (use evaluation or launching-evals), monitoring jobs (use monitor), MLflow browsing (use accessing-mlflow), or comparing completed baseline-vs-candidate scores only (use compare-results).

Quant Recipe Search

Use this skill when quantization is an iterative recipe search, not a one-off PTQ run. The skill owns strategy: define success, choose the search space, sequence candidates, and decide the next iteration. It delegates checkpoint generation, serving, evaluation, monitoring, and metric comparison to the existing execution skills.

Treat a direct request such as "find the best quantization recipe and generate a PTQ checkpoint for this model" as enough to start. Recover local state first, then ask only for missing decisions that change the search.

Skill Boundaries

Use ptq to produce and validate checkpoints.
Use deployment to serve checkpoints and debug serving-specific flags.
Use evaluation to create NEL configs and submit evals.
Use launching-evals to run, resume, debug, and analyze NEL runs.
Use monitor for active job tracking.
Use accessing-mlflow for MLflow artifact lookup.
Use compare-results for validated baseline-vs-candidate deltas and score-field comparability.

Do not duplicate those workflows here. This skill should leave the user with a clear recipe portfolio, success metric, experiment sequence, and next decision.

Problem

The task is to find the best recipe for a user-defined target, not merely to produce a quantized checkpoint. A generated PTQ checkpoint is only a candidate. It becomes a recommended recipe only after evaluation and comparison against the matching baseline.

Required inputs before planning candidates:

Optimization goal: compute/throughput, memory/latency, or a custom metric.
Primary quantization family: for example NVFP4, W4A16 NVFP4, FP8/W8A8, INT4/AWQ, or a custom mixed set.
Benchmark set or baseline results: the user-defined acceptance surface.

If any of these are missing, ask for them. Do not silently default to FP8/W8A8 or call a checkpoint "best" before evaluation.

Default success rule: maximize the chosen performance objective while keeping each benchmark within 1 percentage point of the matching BF16/FP16 baseline. Near-threshold or noisy regressions require reruns before making a decision.

Search Space

Keep the search space explicit. A candidate recipe is a tuple across these axes:

Numeric format: FP8/W8A8, NVFP4/W4A4, W4A16 NVFP4, INT4/AWQ, or mixed formats such as NVFP4+FP8.
Calibration/search algorithm: max calibration, MSE calibration, GPTQ, AWQ, AutoQuant scoring, and calibration dataset or sample-count variants.
Selection method: manual/heuristic rules, sensitivity-guided manual recipes, AutoQuant selection, or a hybrid of AutoQuant plus manual overrides.
Module family: attention, MLP, MoE experts, routers/gates, embeddings, lm_head, adapters, vision encoders, and model-specific modules.
Runtime fusion constraints: modules fused by the inference library must use compatible quantization. Examples: vLLM Qwen linear_attn.in_proj_qkvz and fused MoE expert projections such as gate/up (w1/w3).
Calibration budget: dataset mix, sample count, sequence length, and batch settings.

Do not collapse the search to one dimension such as numeric format only. Read references/recipe_iteration.md when choosing concrete axes or candidates.

Design Workflow

Recover state
- Read result tables, recipe logs, AutoQuant states, sensitivity reports, and experiment notes before proposing new work.
- Ask monitor, launching-evals, or compare-results to recover active job state and completed metrics when needed.
Define the target
- Confirm the optimization goal, primary quantization family, benchmark set, accuracy-loss threshold, calibration budget, and cost metric.
- Include quantization metadata such as scale storage in active-cost or size estimates.
Pick baselines and first candidates
- Always include BF16/FP16 and a near-lossless FP8/W8A8 baseline unless FP8 itself is the target.
- For ModelOpt work, start from modelopt_recipes: model-specific recipes first, then general PTQ presets or recipe fragments.
- Add an AutoQuant candidate in the requested primary family when AutoQuant is available. Expect AutoQuant to find a better trade-off than a first manual recipe, but validate that assumption with the same evals.
- Add at least one manual or sensitivity-guided candidate so AutoQuant can be compared against controlled ablations and there is a fallback if AutoQuant misses the best frontier or hits runtime constraints.
Generate candidates
- Delegate checkpoint generation and PTQ validation to ptq.
- Change one major axis at a time: format, calibration algorithm, module selection, granularity, exclusions, or calibration data.
- Use AutoQuant for broad candidate generation and sensitivity reports; use manual recipes for controlled module-family ablations and overrides.
Gate before scaling
- Validate checkpoint coverage and metadata.
- Reject or rewrite recipes that mix quantization algorithms inside a fused runtime group.
- If the checkpoint is valid but serving fails due to runtime support, do not reject the recipe immediately. Delegate to deployment / debug for small patches or flags, then rerun a pipe-clean check.

Iteration Loop

Run cheap screen evals for every candidate that passes the gates.
Compare accuracy, verbosity/token usage, and active cost against baselines.
Rerun noisy or near-threshold results before labeling a regression.
Decide the next candidate:
- Accuracy drop: protect or ablate sensitive module families, try MSE/GPTQ, or use AutoQuant sensitivity to choose overrides.
- Poor performance/cost: quantize the next high-cost active family, adjust active-cost objective, or try a more aggressive format.
- AutoQuant underperforms manual recipes: inspect sensitivity reports, achieved bits, excluded modules, and runtime-fusion constraints; keep the manual recipe in the portfolio instead of forcing the AutoQuant result.
- Runtime incompatibility: rewrite around fused groups or isolate deployment support from checkpoint quality.
- Repeated AutoQuant recipes: inspect achieved bits and recipe hashes, then adjust constraints before launching a larger sweep.
Promote only when compare-results shows the candidate is comparable to the baseline and satisfies the user-defined goal.

Maintain a recipe portfolio table with recipe name, objective, active-cost estimate, calibration notes, checkpoint path, eval/log references, accuracy, verbosity, and decision.

References

For recipe design, search-space details, sensitivity, and active-cost accounting, read references/recipe_iteration.md.
For a concrete prior case study, read references/qwen36_case_study.md only when Qwen3.5/Qwen3.6 details are relevant.