name: q2mm-benchmark description: 'MANDATORY before launching any q2mm batch optimization run >30 minutes (regenerate_convergence_results.py, multi-system convergence runs, from-QFUERZA runs, etc.). Forces success-spec write, optimizer config sanity check, FIRST-system audit gate before launching the rest, and post-batch validation. Use when user says "run benchmarks", "regenerate convergence results", "run all systems", "QFUERZA recovery", or any variant of launching a multi-system q2mm optimization batch.' license: MIT allowed-tools: Bash, Read, Edit, Grep, Glob
q2mm Benchmark Pre-Flight & Audit
A multi-hour GPU benchmark batch that produces useless results is worse than one that fails fast. This skill exists because a previous agent shipped a flawed batch (5 systems, ~4 GPU-hours) where the optimizer exited at n_evaluations=2 on every system without actually optimizing, and then wrote a PR comparing the wrong metric. Don't repeat that.
The cardinal rule: AUDIT THE FIRST SYSTEM BEFORE LAUNCHING THE REST.
Step 1 — Write the success spec (mandatory)
Before any other work, write a paragraph in the session plan answering:
- What is the user's literal question? Quote it verbatim.
- What deliverable artifacts will exist when this is done? Be specific: "a markdown page with three tables: (a) per-system R² of q2mm-optimized FF vs published-paper R², (b) per-parameter abs deviation tables, (c) physical-chemistry walkthrough of top-5 deviations per system."
- What numerical results count as 'success'? Examples: "QFUERZA-start final OF within 20% of published-start final OF on at least 3/5 systems" or "per-param mean abs deviation < 10% on bond_eq."
- What would make me declare the batch failed and stop? Examples: "All systems exit at
n_evals<=2" or "R² is negative on the optimized FF."
If you cannot answer (1)–(4) in five sentences, stop and ask the user to clarify. Do not launch the batch.
Step 2 — Sanity-check the optimizer config
Read these defaults in the q2mm source and confirm they're appropriate for the starting point:
Bounds (q2mm/models/forcefield.py → DEFAULT_BOUNDS)
The defaults are physical sanity bounds, not "stay near starting FF":
bond_k: (-3600, 3600)kcal/mol/Ųangle_k: (-720, 720)kcal/mol/rad²bond_eq: (0.5, 3.0)Åangle_eq: (30, 180)deg
For canonical-default QFUERZA-start runs (the default; or any from-poor-start run), sanity bounds let L-BFGS-B escape the QFUERZA basin into a worse local minimum. Use fractional bounds around the starting value:
- Typical:
bounds_fraction_fc=0.20,bounds_fraction_eq=0.05 - Fragile systems (heck-relay):
bounds_fraction_fc=0.05,bounds_fraction_eq=0.02
For publication-baseline (--starting-point published) runs, sanity bounds are usually fine — the starting FF is already in the published basin.
Pass via --fc-fraction / --eq-fraction CLI flags on scripts/regenerate_convergence_results.py.
Convergence tolerance (scipy_opt.py → _run_minimize)
The script default L-BFGS-B ftol=1e-8 is loose for from-poor-start runs — nfev will often be ≤ 5 with no real optimization. Override with --ftol 1e-12 (or tighter) for any run where you actually want the optimizer to work.
Ratio gate (scipy_opt.py → ratio check)
For TS systems with a poor starting FF, the JaxLoss/ObjectiveFunction ratio can be 0.1–0.4 or even diverge to 1e74 (heck-relay from QFUERZA). The default ratio check rejects these. Two options:
--ratio-tol -1to bypass entirely (use for from-QFUERZA TS runs)- Document the explosion honestly in the analysis instead of pretending it didn't happen
Step 3 — Pre-flight checklist
Walk through this LITERALLY before running. Each item must be checked.
- Goal restated as a success spec (Step 1)
- Bounds strategy chosen and matches starting-point quality (Step 2)
-
ftol/gtoltight enough for real optimization - Per-system overrides documented if any system needs special handling
- GPU verified:
python -c "import jax; print(jax.devices())"shows CudaDevice - Output directory chosen (
q2mm-data/benchmarks/<system>/convergence/for the canonical QFUERZA-start default;q2mm-data/benchmarks/<system>/from-published/for opt-in publication-baseline runs) - PYTHONPATH set if running from a worktree (editable install points to master)
Step 4 — Run the FIRST system in isolation
Do NOT launch all systems in a batch. Run only the first system:
# Canonical default is --starting-point qfuerza; pass --starting-point
# published only when reproducing publication-baseline benchmarks.
PYTHONPATH=/path/to/worktree python scripts/regenerate_convergence_results.py \
--system <first-system> \
--ftol 1e-12 \
--fc-fraction 0.20 \
--eq-fraction 0.05 \
--ratio-tol <value> \
--output-dir /path/to/q2mm-data/benchmarks
Step 5 — AUDIT GATE (do not skip)
After the first system completes, inspect <output-dir>/validation_results.json:
import json
with open("<first-system>/convergence/validation_results.json") as f:
r = json.load(f)["result"]
print("n_evaluations:", r["n_evaluations"])
print("n_iterations:", r["n_iterations"])
print("real OF: ", r["initial_obj_score"], "→", r["final_obj_score"])
print("improvement%: ", r["improvement_pct"])
print("ratio: ", r["ratio"])
print("Seminario R²: ", r["seminario"])
print("Optimized R²: ", r["optimized"])
Pass criteria (all must hold or you must explicitly justify deviation):
n_evaluations > 5(the optimizer actually evaluated the gradient)n_iterations > 3(it took real steps)|improvement_pct| > 1(the real OF actually changed)- Optimized R² ≥ Seminario R² on each metric (the run didn't make things worse)
Fail criteria (stop immediately if any holds):
n_evaluations ≤ 2AND|improvement_pct| < 1→ optimizer did not optimize. Likelyftoltoo loose or bounds too wide. Do NOT launch the remaining systems. Diagnose and re-run.ratio > 100ANDimprovement_pct < 0→ JaxLoss surrogate diverged AND the optimizer made the FF worse. Diagnose: tighter bounds, FC clamping, different starting point.- Optimized R² < Seminario R² → the run degraded the FF. Diagnose before continuing.
Step 6 — Launch the rest (only if Step 5 passes)
Launch the remaining systems, one at a time or in a small batch. Continue to spot-check intermediate results; if a system shows the same nfev<=2 failure mode that didn't appear in the first system, stop and diagnose.
Step 7 — Post-batch validation
Before writing the analysis doc:
- All systems have
n_evaluations > 5(or each exception is documented with a chemical/physical reason) - All systems have
validation_results.jsonwith full provenance - Spot-check at least two
validation_results.jsonfiles for sanity
If a benchmark batch ends with multiple systems exiting at n_evals=2, that's a silent protocol failure — even if all the JSON files were written.
Quick reference: things that should make you stop and ask
- "How do I know if the optimizer actually optimized?" → check
n_evaluationsand real OF delta - "Should I use sanity bounds or fractional bounds?" → fractional for the canonical QFUERZA-start default; sanity is fine for
--starting-point publishedruns - "Why does
nfev=2happen on every system?" → defaultftolis 1e-8, way too loose for from-poor-start; use--ftol 1e-12 - "Heck-relay's ratio is 1e74, is that OK?" → no, the JaxLoss surrogate exploded; document honestly and consider tighter bounds or FF pre-conditioning
- "Should I just bypass the ratio gate with
--ratio-tol -1?" → only if you understand why it's failing; ratio gate exists for a reason
Anti-patterns to refuse
- Launching all 5 systems in a batch without auditing the first one
- Shipping results where
n_evaluations <= 2on every system as if optimization happened - Comparing only
final_obj_scorewhen the user asked about parameter values or R² - Bypassing the ratio gate with
--ratio-tol -1without diagnosing why it's failing - Writing the analysis doc before re-reading the user's literal question