name: mlmm-hpc
description: PBS (Torque / PBSPro) and SLURM submission for mlmm-toolkit — generic preamble templates with placeholders, walltime budgeting, CPU vs GPU choice, job monitoring, and the dynamic-dispatch (flock + pbsdsh) recipe in dynamic-dispatch.md. TRIGGER on cluster submission / qsub / sbatch / walltime / preamble / multi-job dispatch / pbsdsh / flock / many-system batch questions. SKIP for local single-machine runs, install setup, or output parsing. Note: mlmm-toolkit is single-GPU per job and has no --resume / --workers interface.
mlmm HPC
Purpose
mlmm-toolkit is a CPU+GPU Python program; on HPC clusters you typically
submit it as a PBS or SLURM job that requests one node with one GPU.
This skill provides generic templates with placeholders — fill in
your queue / module / env names from mlmm-env-detect/SKILL.md.
When the env is unknown
If you don't know the cluster's queue / GPU / module configuration,
read mlmm-env-detect/SKILL.md first. It walks through the
discovery commands (qstat -Q, pbsnodes -a, nvidia-smi,
module avail cuda, conda env list) and tells you how to fill the
placeholders this skill uses.
PBS preamble template (Torque / PBSPro)
#!/usr/bin/env bash
#PBS -N <jobname>
#PBS -q <YOUR_QUEUE>
#PBS -l nodes=1:ppn=<NCPU>:gpus=<NGPU>,mem=<MEM>GB,walltime=<HH:MM:SS>
#PBS -o <jobname>.out
#PBS -e <jobname>.err
set -euo pipefail
cd "${PBS_O_WORKDIR}"
# Preflight: fail fast if the env or the CUDA driver is missing.
command -v conda >/dev/null || { echo "conda not on PATH"; exit 1; }
nvidia-smi -L >/dev/null || { echo "no GPU visible"; exit 1; }
# CUDA + toolchain: HPC modulefiles (env-detect outputs <CUDA_MODULE>)
# - gcc: load when the system default is too old for the CUDA toolkit or
# when pip will compile a C/CUDA extension from source.
# (OpenMPI is not needed: mlmm-toolkit is single-GPU only and has no Ray /
# `--workers` path.)
command -v module >/dev/null && module load <CUDA_MODULE> gcc
# Conda env (env-detect outputs <YOUR_ENV>)
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate <YOUR_ENV>
# Optional: torch CUDA tuning
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
mlmm all -i 1.R.pdb 3.P.pdb \
-c 'SAM,GPP,MG' -l 'SAM:1,GPP:-3' \
--tsopt --thermo \
--out-dir result_all > mlmm.log 2>&1
PBSPro syntax differs slightly (#PBS -l select=1:ncpus=<NCPU>:ngpus=<NGPU>:mem=<MEM>gb).
Both are accepted by most modern Torque + PBSPro installations; check
man qsub on your cluster.
SLURM preamble template
#!/usr/bin/env bash
#SBATCH --job-name=<jobname>
#SBATCH --partition=<YOUR_PARTITION>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<NCPU>
#SBATCH --gres=gpu:<NGPU>
#SBATCH --mem=<MEM>G
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=%x.%j.out
set -euo pipefail
cd "${SLURM_SUBMIT_DIR}"
# Preflight: confirm conda + GPU before launching
command -v conda >/dev/null || { echo "ERROR: conda not on PATH"; exit 1; }
command -v nvidia-smi >/dev/null && nvidia-smi -L || echo "WARN: nvidia-smi not found; continuing"
# CUDA + toolchain (see PBS template above for when gcc is needed)
command -v module >/dev/null && module load <CUDA_MODULE> gcc
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate <YOUR_ENV>
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
mlmm all -i 1.R.pdb 3.P.pdb \
-c 'SAM,GPP,MG' -l 'SAM:1,GPP:-3' \
--tsopt --thermo \
--out-dir result_all
Walltime budgeting
Empirical rough-cuts for clusters of ~200–700 atoms with UMA-s-1.1 on a single mid-range GPU. Adjust generously.
| Stage | Per-segment time | Notes |
|---|---|---|
extract |
< 1 min | Pure Python, CPU |
path-search (GSM) |
5–30 min | Scales with --max-nodes |
path-search (DMF) |
10–60 min | Slower than GSM but more robust |
tsopt (RS-I-RFO) |
5–60 min | Hessian rebuilds dominate |
tsopt (Dimer) |
1–10 min | Hessian-free; cheaper |
irc |
5–30 min | Forward + backward; default 125 cycles each |
freq |
5–30 min | Hessian once + diagonalization |
dft (ωB97M-V/def2-svp, GPU) |
30 min – 6 h | Heavy; ~1–10 h on TZVPD |
dft (CPU) |
10× GPU time | Use only for small clusters |
For an all run with 2 segments + DFT: budget 6–24 h walltime.
For pure MLIP all (no DFT): 2–6 h is usually enough.
CPU vs GPU choice
| Workload | CPU | GPU |
|---|---|---|
| MLIP inference (any backend) | Possible but ~50–200× slower | Required for production |
mlmm dft with ωB97M-V on > 200 atoms |
Slow (10–100 h) | Recommended |
mlmm dft with cheap functional / small molecule |
Fine | Marginal speedup |
| Hessian (analytical, UMA) | OK if VRAM-limited | Faster |
Check mlmm-install-backends/dft.md for --engine gpu / cpu
specifics, including the aarch64 caveat (CPU PySCF only).
Monitoring and control
PBS:
qstat -u "$USER" # state: Q (queued), R (running), C (complete)
qstat -f <jobid> # full job info
qdel <jobid> # cancel a single job by id
SLURM:
squeue -u "$USER"
scontrol show job <jobid>
scancel <jobid>
Scope cancellation by job-name pattern; never xargs qdel over
unfiltered output. When cancelling a batch, filter explicitly:
# PBS — terminate only jobs matching a name pattern
qstat -u "$USER" | grep <pattern> | awk '{print $1}' | xargs -r qdel
# SLURM — equivalent
squeue -u "$USER" --name=<pattern> -h -o '%i' | xargs -r scancel
qdel / scancel only terminate jobs you own, so they cannot affect
other users; the scope warning is to avoid cancelling your own
unrelated jobs (e.g. interactive sessions or another campaign).
Failed jobs / restart
mlmm all doesn't auto-resume by default; re-running creates
a fresh result_all/. Several stages support manual continuation:
tsopt,freq,irc,dft— re-run on the previous output.path-search— pass the partialmep.pdbas-i.
For walltime-truncated jobs, write the per-stage outputs to a persistent location and resume from the last completed stage.
Parallel job submission patterns
Fan-out (one job per task)
for ts in seg_*.pdb; do
jobid=$(qsub -v TS="$ts" generic_dft.sh)
echo "submitted $ts as $jobid"
done
Each qsub produces an independent PBS job; the scheduler load-balances
them.
Dynamic dispatch (one job, N nodes pull tasks)
When you have many short tasks and want to amortize the queue wait,
use the flock + pbsdsh pattern documented in dynamic-dispatch.md. One
qsub grabs N nodes, each node runs a worker that pulls tasks from a
shared list with file-lock-protected counter increment.
Useful environment variables
| Variable | Purpose |
|---|---|
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
Reduce torch memory fragmentation |
CUDA_VISIBLE_DEVICES=0 |
Restrict to a single GPU per worker |
OMP_NUM_THREADS=<NCPU> |
Limit OpenMP threads (avoid oversubscription) |
MKL_NUM_THREADS=<NCPU> |
Intel MKL thread cap |
LD_LIBRARY_PATH=<torch lib>:... |
Override system CUDA libs (see env-cuda.md) |
ssh-based remote submission
Generally avoided in shared distribution skills (depends on
per-user ssh config). If your cluster requires ssh <login> qsub,
add that as a wrapper around the PBS / SLURM command above; do not
embed it inside the skill template.
See also
dynamic-dispatch.md— flock + pbsdsh template for many short tasks.mlmm-env-detect/SKILL.md— discover queue / module / env values for the placeholders above.mlmm-install-backends/env-cuda.md— driver / torch CUDA pairing.mlmm-cli/all.md— the typical workload submitted to HPC.