name: hpc-submit description: Generates HPC job-submission scripts (SLURM, PBS/Torque, SGE/UGE, LSF) from a Python/R/Julia script with sensible resource estimates, partition selection, module loading, and container setup. Use when the user asks "how do I submit this to SLURM", "write me a sbatch script", "I need to run this on the cluster", "what resources should I request", or describes a script they want to run at scale. short_desc: HPC job script: SLURM/PBS/SGE/LSF generators keywords: [SLURM, sbatch, PBS, Torque, LSF, SGE, srun, qsub] model: opus effort: high allowed-tools: Read, Write, Bash, Glob
HPC Submit (Opus)
Purpose: Take a user's compute script and produce a working submission script for the relevant HPC scheduler, with resource requests calibrated to the workload, environment setup that matches typical cluster conventions, and the gotchas that bite first-time users.
Model: Opus 4.7 — straightforward generation task that benefits from broad knowledge of scheduler dialects.
When to invoke autonomously:
- The user mentions SLURM, sbatch, qsub, bsub, "the cluster", "HPC", a known partition name, or a queue manager.
- The user shares a script and says "I need to run this 1000 times" or "this needs more memory than my laptop".
- The user asks about resource estimation, GPU partitions, or array jobs.
Do NOT invoke for:
- Local-machine parallelism (use
@coderwith joblib/dask). - Cloud-batch services (AWS Batch, GCP Batch) — different conventions, ask first.
Usage
/hpc-submit Run train.py on 4 GPUs, 4 hours wall, 32 CPU cores, conda env pytorch24
/hpc-submit Sbatch script for snakemake pipeline, dynamic resource per rule
/hpc-submit Array job processing 480 FASTQ files through bwa+samtools, each ~1 hr
/hpc-submit How much memory should I request for this xarray pipeline?
What This Skill Does
1. Scheduler Identification
Ask once (then remember) which scheduler is in play. Differences matter:
| Scheduler | Submit cmd | Script header | Status | Cancel | Notes |
|---|---|---|---|---|---|
| SLURM | sbatch |
#SBATCH |
squeue -u $USER |
scancel <jobid> |
Most common in 2026 academic HPC |
| PBS / Torque / OpenPBS | qsub |
#PBS |
qstat -u $USER |
qdel <jobid> |
Older but still common (NCAR, some clusters) |
| SGE / UGE / Grid Engine | qsub |
#$ |
qstat -u $USER |
qdel <jobid> |
Legacy; declining |
| LSF | bsub < |
#BSUB |
bjobs |
bkill <jobid> |
IBM-shop clusters |
If unsure, run which sbatch qsub bsub 2>/dev/null or module avail 2>&1 | head -5 on the cluster to detect. Many sites maintain user docs at /etc/motd or ~/cluster-docs/.
2. Resource Estimation Heuristics
The skill estimates resources rather than guessing. Defaults below; refine when the user provides actual measurements (/usr/bin/time -v, nvidia-smi, htop).
CPU cores: number of independent parallel tasks. For NumPy/PyTorch matrix ops, more cores ≠ more speed past ~8-16 unless the workload was specifically parallelised (OpenMP/MKL/joblib). Default: 1 core for single-threaded scripts, 8 for vectorised numerical, 16 for multi-process pipelines. Set OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS to match --cpus-per-task to prevent thread thrashing.
Memory: peak RSS of the process. For Python with pandas/numpy/scikit-learn:
- Tabular data: peak memory ≈ 3-5× the input file size (for parsing + working copies).
- DataFrame
pd.read_csvof $X$ GB → request $4X$ + 4 GB headroom. - xarray/zarr with
chunksset: memory is bounded by chunk size × number of workers. - PyTorch training: model parameters in float32 → 4 bytes/param; plus optimizer state (Adam: 8 bytes/param), plus activations (workload-dependent, often dominant). Mixed-precision halves this for the forward pass but optimizer state is still float32.
- If unsure, ask the user to run
/usr/bin/time -v python script.pylocally on a subset and report "Maximum resident set size".
Wall time: estimate as 2-5× a small-scale benchmark, plus 10% safety margin. Most schedulers kill jobs at the wall limit with no warning; mid-job checkpointing is mandatory for long jobs.
GPUs: request as --gres=gpu:N (SLURM) or scheduler-specific. Match GPU model to needs (A100/H100/A6000/etc.). Per-GPU memory: list the typical VRAM in the cluster's docs. For inference, often 1 GPU suffices; for distributed training, request 2-8 on one node and use NCCL (or 16-64 across nodes with appropriate scheduler topology hints).
3. SLURM Template (Most Common)
#!/bin/bash
#SBATCH --job-name=myrun
#SBATCH --output=logs/%x_%j.out # %x=jobname, %j=jobid
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=04:00:00 # HH:MM:SS or D-HH:MM
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G # per node; or --mem-per-cpu=4G
#SBATCH --partition=gpu # cluster-specific
#SBATCH --gres=gpu:a100:1 # 1 A100 GPU; check cluster syntax
#SBATCH --mail-user=user@example.edu
#SBATCH --mail-type=END,FAIL
# Optional: account / QoS / nodelist if the site requires them
# #SBATCH --account=lab_pi
# #SBATCH --qos=normal
set -euo pipefail
mkdir -p logs
echo "Job $SLURM_JOB_ID started on $(hostname) at $(date)"
echo "Working directory: $(pwd)"
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
# --- Environment setup ---
# Prefer pixi / conda / module — pick one and stick to it
module purge
module load cuda/12.4 cudnn/9.0 # cluster-specific names
source ~/miniforge3/etc/profile.d/conda.sh
conda activate pytorch24
# Pin thread counts to requested CPUs to avoid oversubscription
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK
# --- Reproducibility ---
export PYTHONHASHSEED=0
export CUBLAS_WORKSPACE_CONFIG=:4096:8 # for deterministic cuDNN
# --- Run ---
srun --unbuffered python train.py \
--config configs/run.yaml \
--output-dir /scratch/$USER/$SLURM_JOB_ID
echo "Job $SLURM_JOB_ID finished at $(date)"
Notes the skill emphasises:
set -euo pipefail— fail fast on errors, undefined vars, pipe failures.sruninsidesbatchensures correct task accounting and proper signal propagation on timeout.--unbuffered(orpython -u) so stdout reaches log files in real time.- Use
$TMPDIRor/scratch/$USER/$SLURM_JOB_IDfor fast local I/O; tmpfs is purged when job ends. - Stage data from network filesystem to local scratch at job start, copy results back at end.
- For Python: avoid
python script.pydirect invocation in scripts where you might be sourcing from a different env later — use$(which python) script.pyor the full path.
4. SLURM Array Jobs (Embarrassingly Parallel)
For processing 1000 inputs, do not submit 1000 jobs — submit one array of 1000 tasks. Schedulers handle this efficiently.
#SBATCH --job-name=align
#SBATCH --array=0-479%50 # 480 tasks, max 50 concurrent
#SBATCH --time=01:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
# Read the input for this task from a manifest file
SAMPLE=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" samples.txt)
echo "Task $SLURM_ARRAY_TASK_ID processing $SAMPLE"
bwa mem -t $SLURM_CPUS_PER_TASK ref.fa fastq/${SAMPLE}_R1.fq.gz fastq/${SAMPLE}_R2.fq.gz \
| samtools sort -@ $SLURM_CPUS_PER_TASK -o aligned/${SAMPLE}.bam -
Use %N (e.g. %50) to throttle concurrency — full burst of 1000 simultaneous jobs can hammer shared filesystems and earn complaints from sysadmins.
5. PBS / Torque
#!/bin/bash
#PBS -N myrun
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=04:00:00
#PBS -q gpu
#PBS -o logs/myrun.out
#PBS -e logs/myrun.err
#PBS -m abe -M user@example.edu
cd $PBS_O_WORKDIR
# ... rest as SLURM
Array: #PBS -J 0-479 and $PBS_ARRAY_INDEX.
6. SGE / UGE
#!/bin/bash
#$ -N myrun
#$ -cwd
#$ -pe smp 8 # parallel environment + slots
#$ -l h_rt=04:00:00
#$ -l mem_free=32G
#$ -q gpu.q
#$ -o logs/myrun.out
#$ -e logs/myrun.err
#$ -M user@example.edu -m beas
Array: #$ -t 1-480 and $SGE_TASK_ID (1-indexed).
7. LSF
#!/bin/bash
#BSUB -J myrun
#BSUB -n 8 # cores
#BSUB -R "rusage[mem=32G]"
#BSUB -W 04:00 # HH:MM
#BSUB -q gpu
#BSUB -gpu "num=1"
#BSUB -o logs/myrun.%J.out
#BSUB -e logs/myrun.%J.err
8. Snakemake / Nextflow on HPC
Both have first-class scheduler profiles — generate a config rather than wrapping the engine in sbatch.
Snakemake — --profile slurm reads from ~/.config/snakemake/slurm/config.yaml:
cluster:
mkdir -p logs/{rule} &&
sbatch
--partition={resources.partition}
--cpus-per-task={threads}
--mem={resources.mem_mb}
--time={resources.runtime}
--job-name={rule}
--output=logs/{rule}/{wildcards}.out
default-resources:
- partition=cpu
- mem_mb=4000
- runtime=60
restart-times: 1
max-jobs-per-second: 10
max-status-checks-per-second: 1
jobs: 100
keep-going: true
rerun-incomplete: true
use-conda: true
Then: snakemake --profile slurm.
Nextflow — nextflow.config:
process {
executor = 'slurm'
queue = 'cpu'
cpus = 8
memory = '32 GB'
time = '4h'
withName: 'ALIGN' {
queue = 'gpu'
accelerator = 1
memory = '64 GB'
}
}
executor {
queueSize = 100
submitRateLimit = '10 sec'
}
Then: nextflow run main.nf -profile slurm.
9. Containers (Apptainer / Singularity)
For environment portability and reproducibility, prefer containers over module-loading on the host:
#SBATCH ... (resource lines as before)
module load apptainer # or singularity on older sites
apptainer exec --nv \
--bind /scratch/$USER:/scratch \
--bind /data:/data:ro \
/shared/containers/pytorch_24.10.sif \
python train.py --output /scratch/$SLURM_JOB_ID
--nv exposes NVIDIA GPUs. --bind mounts host paths into the container.
10. Common Mistakes the Skill Catches
| Mistake | Symptom | Fix |
|---|---|---|
| Requesting more memory than the node has | Job pends forever | Check sinfo -o "%P %m %c %G" for partition specs |
| Wall time too short | Job killed mid-run | Estimate from a small benchmark × 3 |
| Wall time too long | Long queue wait | Most schedulers prefer accurate estimates; partitions often penalise oversized requests |
Forgetting srun inside sbatch for distributed jobs |
Tasks share one rank, no parallelism | Wrap with srun |
Not pinning OMP_NUM_THREADS |
Thread oversubscription, slower than serial | Set to $SLURM_CPUS_PER_TASK |
| Writing to home from many tasks | Filesystem hangs | Use node-local $TMPDIR or scratch |
| Running 1000 separate jobs instead of an array | Hits per-user job limit | Use --array |
module load after conda activate |
Module wipes path | Always module before conda activate |
| Hardcoded paths in scripts | Breaks on the cluster | Use $HOME, $SCRATCH, $SLURM_SUBMIT_DIR |
| Forgetting to redirect stderr | Logs missing errors | #SBATCH --error=... or combine via --output |
Output Format
## HPC Submission Script
**Scheduler detected**: [SLURM | PBS | SGE | LSF | UNKNOWN — please confirm]
**Cluster docs** (if known): [link]
### Resource estimate
| Resource | Requested | Rationale |
|---|---|---|
| Cores | [N] | [why] |
| Memory | [X GB] | [why — based on data size, model size, etc.] |
| Wall time | [HH:MM] | [why — estimate from benchmark or model] |
| GPUs | [type × N] | [if applicable] |
| Partition / queue | [name] | [why] |
### Submission script
[language: bash]
[full script with comments]
### How to submit
```bash
mkdir -p logs
sbatch submit.sh
squeue -u $USER # check status
Notes / gotchas
- [scheduler-specific notes]
- [environment-setup notes]
- [common-failure-mode notes]
Iteration tips
- Start with a 30-minute test run on a small input subset.
- Monitor real resource use:
seff <jobid>(SLURM) orqstat -f <jobid>(PBS). - Adjust the script down to actual usage + 20% headroom.
If you don't know the partition/queue names
Run on the cluster: sinfo -o "%P %m %c %G" (SLURM) or qstat -Q (PBS) or bqueues (LSF).
## Hard Rules
1. **Always include `mkdir -p logs`** before `sbatch` — most schedulers fail silently if the log directory doesn't exist.
2. **Always pin `OMP_NUM_THREADS`** to `$SLURM_CPUS_PER_TASK` (or equivalent) for any numpy/scipy/torch workload.
3. **Always use `set -euo pipefail`** in the submission script body.
4. **Always log job ID and hostname** at start; date at start and end.
5. **Never recommend running 100+ jobs without an array** — it's bad citizenship and often hits per-user limits.
6. **Always recommend a 5-30 minute test run before full submission** for any pipeline the user hasn't used before.
7. **Refuse to guess at site-specific things** like partition names, account codes, GPU types — ask, or insert `# TODO: confirm with cluster docs` markers.
## Integration with Knowledge Graph
Leans on [[Reproducible Research Workflows]] for the broader context of pipelines + environments + workflow engines. After helping a user, if the cluster has unusual quirks worth recording, write a per-project KG node `knowledge/concepts/hpc-<cluster-name>.md` capturing partition names, module conventions, scratch paths, and filesystem etiquette.
## Success Criteria
- Script syntactically correct for the scheduler.
- Resource requests are estimates, not random numbers, with rationale stated.
- Environment setup is explicit and reproducible.
- Thread counts are pinned to requested CPUs.
- Local scratch is used for I/O-heavy steps.
- The user can submit, monitor, cancel, and read logs with the commands provided.
- Common gotchas relevant to their workflow are called out.