mlmm-hpc

star 17

PBS (Torque / PBSPro) and SLURM submission for mlmm-toolkit — generic preamble templates with placeholders, walltime budgeting, CPU vs GPU choice, job monitoring, and the dynamic-dispatch (`flock` + `pbsdsh`) recipe in `dynamic-dispatch.md`. TRIGGER on cluster submission / `qsub` / `sbatch` / walltime / preamble / multi-job dispatch / `pbsdsh` / `flock` / many-system batch questions. SKIP for local single-machine runs, install setup, or output parsing. Note: mlmm-toolkit is single-GPU per job and has no `--resume` / `--workers` interface.

t-0hmura By t-0hmura schedule Updated 6/8/2026

name: mlmm-hpc description: PBS (Torque / PBSPro) and SLURM submission for mlmm-toolkit — generic preamble templates with placeholders, walltime budgeting, CPU vs GPU choice, job monitoring, and the dynamic-dispatch (flock + pbsdsh) recipe in dynamic-dispatch.md. TRIGGER on cluster submission / qsub / sbatch / walltime / preamble / multi-job dispatch / pbsdsh / flock / many-system batch questions. SKIP for local single-machine runs, install setup, or output parsing. Note: mlmm-toolkit is single-GPU per job and has no --resume / --workers interface.

mlmm HPC

Purpose

mlmm-toolkit is a CPU+GPU Python program; on HPC clusters you typically submit it as a PBS or SLURM job that requests one node with one GPU. This skill provides generic templates with placeholders — fill in your queue / module / env names from mlmm-env-detect/SKILL.md.

When the env is unknown

If you don't know the cluster's queue / GPU / module configuration, read mlmm-env-detect/SKILL.md first. It walks through the discovery commands (qstat -Q, pbsnodes -a, nvidia-smi, module avail cuda, conda env list) and tells you how to fill the placeholders this skill uses.

PBS preamble template (Torque / PBSPro)

#!/usr/bin/env bash
#PBS -N <jobname>
#PBS -q <YOUR_QUEUE>
#PBS -l nodes=1:ppn=<NCPU>:gpus=<NGPU>,mem=<MEM>GB,walltime=<HH:MM:SS>
#PBS -o <jobname>.out
#PBS -e <jobname>.err
set -euo pipefail
cd "${PBS_O_WORKDIR}"

# Preflight: fail fast if the env or the CUDA driver is missing.
command -v conda >/dev/null || { echo "conda not on PATH"; exit 1; }
nvidia-smi -L >/dev/null     || { echo "no GPU visible"; exit 1; }

# CUDA + toolchain: HPC modulefiles (env-detect outputs <CUDA_MODULE>)
# - gcc: load when the system default is too old for the CUDA toolkit or
#   when pip will compile a C/CUDA extension from source.
# (OpenMPI is not needed: mlmm-toolkit is single-GPU only and has no Ray /
#  `--workers` path.)
command -v module >/dev/null && module load <CUDA_MODULE> gcc

# Conda env (env-detect outputs <YOUR_ENV>)
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate <YOUR_ENV>

# Optional: torch CUDA tuning
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

mlmm all -i 1.R.pdb 3.P.pdb \
    -c 'SAM,GPP,MG' -l 'SAM:1,GPP:-3' \
    --tsopt --thermo \
    --out-dir result_all > mlmm.log 2>&1

PBSPro syntax differs slightly (#PBS -l select=1:ncpus=<NCPU>:ngpus=<NGPU>:mem=<MEM>gb). Both are accepted by most modern Torque + PBSPro installations; check man qsub on your cluster.

SLURM preamble template

#!/usr/bin/env bash
#SBATCH --job-name=<jobname>
#SBATCH --partition=<YOUR_PARTITION>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<NCPU>
#SBATCH --gres=gpu:<NGPU>
#SBATCH --mem=<MEM>G
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=%x.%j.out
set -euo pipefail

cd "${SLURM_SUBMIT_DIR}"
# Preflight: confirm conda + GPU before launching
command -v conda >/dev/null || { echo "ERROR: conda not on PATH"; exit 1; }
command -v nvidia-smi >/dev/null && nvidia-smi -L || echo "WARN: nvidia-smi not found; continuing"
# CUDA + toolchain (see PBS template above for when gcc is needed)
command -v module >/dev/null && module load <CUDA_MODULE> gcc
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate <YOUR_ENV>
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

mlmm all -i 1.R.pdb 3.P.pdb \
    -c 'SAM,GPP,MG' -l 'SAM:1,GPP:-3' \
    --tsopt --thermo \
    --out-dir result_all

Walltime budgeting

Empirical rough-cuts for clusters of ~200–700 atoms with UMA-s-1.1 on a single mid-range GPU. Adjust generously.

Stage Per-segment time Notes
extract < 1 min Pure Python, CPU
path-search (GSM) 5–30 min Scales with --max-nodes
path-search (DMF) 10–60 min Slower than GSM but more robust
tsopt (RS-I-RFO) 5–60 min Hessian rebuilds dominate
tsopt (Dimer) 1–10 min Hessian-free; cheaper
irc 5–30 min Forward + backward; default 125 cycles each
freq 5–30 min Hessian once + diagonalization
dft (ωB97M-V/def2-svp, GPU) 30 min – 6 h Heavy; ~1–10 h on TZVPD
dft (CPU) 10× GPU time Use only for small clusters

For an all run with 2 segments + DFT: budget 6–24 h walltime. For pure MLIP all (no DFT): 2–6 h is usually enough.

CPU vs GPU choice

Workload CPU GPU
MLIP inference (any backend) Possible but ~50–200× slower Required for production
mlmm dft with ωB97M-V on > 200 atoms Slow (10–100 h) Recommended
mlmm dft with cheap functional / small molecule Fine Marginal speedup
Hessian (analytical, UMA) OK if VRAM-limited Faster

Check mlmm-install-backends/dft.md for --engine gpu / cpu specifics, including the aarch64 caveat (CPU PySCF only).

Monitoring and control

PBS:

qstat -u "$USER"                 # state: Q (queued), R (running), C (complete)
qstat -f <jobid>                 # full job info
qdel <jobid>                     # cancel a single job by id

SLURM:

squeue -u "$USER"
scontrol show job <jobid>
scancel <jobid>

Scope cancellation by job-name pattern; never xargs qdel over unfiltered output. When cancelling a batch, filter explicitly:

# PBS — terminate only jobs matching a name pattern
qstat -u "$USER" | grep <pattern> | awk '{print $1}' | xargs -r qdel
# SLURM — equivalent
squeue -u "$USER" --name=<pattern> -h -o '%i' | xargs -r scancel

qdel / scancel only terminate jobs you own, so they cannot affect other users; the scope warning is to avoid cancelling your own unrelated jobs (e.g. interactive sessions or another campaign).

Failed jobs / restart

mlmm all doesn't auto-resume by default; re-running creates a fresh result_all/. Several stages support manual continuation:

  • tsopt, freq, irc, dft — re-run on the previous output.
  • path-search — pass the partial mep.pdb as -i.

For walltime-truncated jobs, write the per-stage outputs to a persistent location and resume from the last completed stage.

Parallel job submission patterns

Fan-out (one job per task)

for ts in seg_*.pdb; do
    jobid=$(qsub -v TS="$ts" generic_dft.sh)
    echo "submitted $ts as $jobid"
done

Each qsub produces an independent PBS job; the scheduler load-balances them.

Dynamic dispatch (one job, N nodes pull tasks)

When you have many short tasks and want to amortize the queue wait, use the flock + pbsdsh pattern documented in dynamic-dispatch.md. One qsub grabs N nodes, each node runs a worker that pulls tasks from a shared list with file-lock-protected counter increment.

Useful environment variables

Variable Purpose
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True Reduce torch memory fragmentation
CUDA_VISIBLE_DEVICES=0 Restrict to a single GPU per worker
OMP_NUM_THREADS=<NCPU> Limit OpenMP threads (avoid oversubscription)
MKL_NUM_THREADS=<NCPU> Intel MKL thread cap
LD_LIBRARY_PATH=<torch lib>:... Override system CUDA libs (see env-cuda.md)

ssh-based remote submission

Generally avoided in shared distribution skills (depends on per-user ssh config). If your cluster requires ssh <login> qsub, add that as a wrapper around the PBS / SLURM command above; do not embed it inside the skill template.

See also

  • dynamic-dispatch.md — flock + pbsdsh template for many short tasks.
  • mlmm-env-detect/SKILL.md — discover queue / module / env values for the placeholders above.
  • mlmm-install-backends/env-cuda.md — driver / torch CUDA pairing.
  • mlmm-cli/all.md — the typical workload submitted to HPC.
Install via CLI
npx skills add https://github.com/t-0hmura/mlmm_toolkit --skill mlmm-hpc
Repository Details
star Stars 17
call_split Forks 5
navigation Branch main
article Path SKILL.md
More from Creator