hpc-submit

star 3

Generates HPC job-submission scripts (SLURM, PBS/Torque, SGE/UGE, LSF) from a Python/R/Julia script with sensible resource estimates, partition selection, module loading, and container setup. Use when the user asks "how do I submit this to SLURM", "write me a sbatch script", "I need to run this on the cluster", "what resources should I request", or describes a script they want to run at scale.

hotak92 By hotak92 schedule Updated 5/23/2026

name: hpc-submit description: Generates HPC job-submission scripts (SLURM, PBS/Torque, SGE/UGE, LSF) from a Python/R/Julia script with sensible resource estimates, partition selection, module loading, and container setup. Use when the user asks "how do I submit this to SLURM", "write me a sbatch script", "I need to run this on the cluster", "what resources should I request", or describes a script they want to run at scale. short_desc: HPC job script: SLURM/PBS/SGE/LSF generators keywords: [SLURM, sbatch, PBS, Torque, LSF, SGE, srun, qsub] model: opus effort: high allowed-tools: Read, Write, Bash, Glob

HPC Submit (Opus)

Purpose: Take a user's compute script and produce a working submission script for the relevant HPC scheduler, with resource requests calibrated to the workload, environment setup that matches typical cluster conventions, and the gotchas that bite first-time users.

Model: Opus 4.7 — straightforward generation task that benefits from broad knowledge of scheduler dialects.

When to invoke autonomously:

  • The user mentions SLURM, sbatch, qsub, bsub, "the cluster", "HPC", a known partition name, or a queue manager.
  • The user shares a script and says "I need to run this 1000 times" or "this needs more memory than my laptop".
  • The user asks about resource estimation, GPU partitions, or array jobs.

Do NOT invoke for:

  • Local-machine parallelism (use @coder with joblib/dask).
  • Cloud-batch services (AWS Batch, GCP Batch) — different conventions, ask first.

Usage

/hpc-submit Run train.py on 4 GPUs, 4 hours wall, 32 CPU cores, conda env pytorch24
/hpc-submit Sbatch script for snakemake pipeline, dynamic resource per rule
/hpc-submit Array job processing 480 FASTQ files through bwa+samtools, each ~1 hr
/hpc-submit How much memory should I request for this xarray pipeline?

What This Skill Does

1. Scheduler Identification

Ask once (then remember) which scheduler is in play. Differences matter:

Scheduler Submit cmd Script header Status Cancel Notes
SLURM sbatch #SBATCH squeue -u $USER scancel <jobid> Most common in 2026 academic HPC
PBS / Torque / OpenPBS qsub #PBS qstat -u $USER qdel <jobid> Older but still common (NCAR, some clusters)
SGE / UGE / Grid Engine qsub #$ qstat -u $USER qdel <jobid> Legacy; declining
LSF bsub < #BSUB bjobs bkill <jobid> IBM-shop clusters

If unsure, run which sbatch qsub bsub 2>/dev/null or module avail 2>&1 | head -5 on the cluster to detect. Many sites maintain user docs at /etc/motd or ~/cluster-docs/.

2. Resource Estimation Heuristics

The skill estimates resources rather than guessing. Defaults below; refine when the user provides actual measurements (/usr/bin/time -v, nvidia-smi, htop).

CPU cores: number of independent parallel tasks. For NumPy/PyTorch matrix ops, more cores ≠ more speed past ~8-16 unless the workload was specifically parallelised (OpenMP/MKL/joblib). Default: 1 core for single-threaded scripts, 8 for vectorised numerical, 16 for multi-process pipelines. Set OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS to match --cpus-per-task to prevent thread thrashing.

Memory: peak RSS of the process. For Python with pandas/numpy/scikit-learn:

  • Tabular data: peak memory ≈ 3-5× the input file size (for parsing + working copies).
  • DataFrame pd.read_csv of $X$ GB → request $4X$ + 4 GB headroom.
  • xarray/zarr with chunks set: memory is bounded by chunk size × number of workers.
  • PyTorch training: model parameters in float32 → 4 bytes/param; plus optimizer state (Adam: 8 bytes/param), plus activations (workload-dependent, often dominant). Mixed-precision halves this for the forward pass but optimizer state is still float32.
  • If unsure, ask the user to run /usr/bin/time -v python script.py locally on a subset and report "Maximum resident set size".

Wall time: estimate as 2-5× a small-scale benchmark, plus 10% safety margin. Most schedulers kill jobs at the wall limit with no warning; mid-job checkpointing is mandatory for long jobs.

GPUs: request as --gres=gpu:N (SLURM) or scheduler-specific. Match GPU model to needs (A100/H100/A6000/etc.). Per-GPU memory: list the typical VRAM in the cluster's docs. For inference, often 1 GPU suffices; for distributed training, request 2-8 on one node and use NCCL (or 16-64 across nodes with appropriate scheduler topology hints).

3. SLURM Template (Most Common)

#!/bin/bash
#SBATCH --job-name=myrun
#SBATCH --output=logs/%x_%j.out           # %x=jobname, %j=jobid
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=04:00:00                   # HH:MM:SS or D-HH:MM
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G                         # per node; or --mem-per-cpu=4G
#SBATCH --partition=gpu                   # cluster-specific
#SBATCH --gres=gpu:a100:1                 # 1 A100 GPU; check cluster syntax
#SBATCH --mail-user=user@example.edu
#SBATCH --mail-type=END,FAIL
# Optional: account / QoS / nodelist if the site requires them
# #SBATCH --account=lab_pi
# #SBATCH --qos=normal

set -euo pipefail
mkdir -p logs

echo "Job $SLURM_JOB_ID started on $(hostname) at $(date)"
echo "Working directory: $(pwd)"
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"

# --- Environment setup ---
# Prefer pixi / conda / module — pick one and stick to it
module purge
module load cuda/12.4 cudnn/9.0          # cluster-specific names
source ~/miniforge3/etc/profile.d/conda.sh
conda activate pytorch24

# Pin thread counts to requested CPUs to avoid oversubscription
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK

# --- Reproducibility ---
export PYTHONHASHSEED=0
export CUBLAS_WORKSPACE_CONFIG=:4096:8     # for deterministic cuDNN

# --- Run ---
srun --unbuffered python train.py \
    --config configs/run.yaml \
    --output-dir /scratch/$USER/$SLURM_JOB_ID

echo "Job $SLURM_JOB_ID finished at $(date)"

Notes the skill emphasises:

  • set -euo pipefail — fail fast on errors, undefined vars, pipe failures.
  • srun inside sbatch ensures correct task accounting and proper signal propagation on timeout.
  • --unbuffered (or python -u) so stdout reaches log files in real time.
  • Use $TMPDIR or /scratch/$USER/$SLURM_JOB_ID for fast local I/O; tmpfs is purged when job ends.
  • Stage data from network filesystem to local scratch at job start, copy results back at end.
  • For Python: avoid python script.py direct invocation in scripts where you might be sourcing from a different env later — use $(which python) script.py or the full path.

4. SLURM Array Jobs (Embarrassingly Parallel)

For processing 1000 inputs, do not submit 1000 jobs — submit one array of 1000 tasks. Schedulers handle this efficiently.

#SBATCH --job-name=align
#SBATCH --array=0-479%50                 # 480 tasks, max 50 concurrent
#SBATCH --time=01:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

# Read the input for this task from a manifest file
SAMPLE=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" samples.txt)
echo "Task $SLURM_ARRAY_TASK_ID processing $SAMPLE"

bwa mem -t $SLURM_CPUS_PER_TASK ref.fa fastq/${SAMPLE}_R1.fq.gz fastq/${SAMPLE}_R2.fq.gz \
    | samtools sort -@ $SLURM_CPUS_PER_TASK -o aligned/${SAMPLE}.bam -

Use %N (e.g. %50) to throttle concurrency — full burst of 1000 simultaneous jobs can hammer shared filesystems and earn complaints from sysadmins.

5. PBS / Torque

#!/bin/bash
#PBS -N myrun
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=04:00:00
#PBS -q gpu
#PBS -o logs/myrun.out
#PBS -e logs/myrun.err
#PBS -m abe -M user@example.edu

cd $PBS_O_WORKDIR
# ... rest as SLURM

Array: #PBS -J 0-479 and $PBS_ARRAY_INDEX.

6. SGE / UGE

#!/bin/bash
#$ -N myrun
#$ -cwd
#$ -pe smp 8                              # parallel environment + slots
#$ -l h_rt=04:00:00
#$ -l mem_free=32G
#$ -q gpu.q
#$ -o logs/myrun.out
#$ -e logs/myrun.err
#$ -M user@example.edu -m beas

Array: #$ -t 1-480 and $SGE_TASK_ID (1-indexed).

7. LSF

#!/bin/bash
#BSUB -J myrun
#BSUB -n 8                                # cores
#BSUB -R "rusage[mem=32G]"
#BSUB -W 04:00                            # HH:MM
#BSUB -q gpu
#BSUB -gpu "num=1"
#BSUB -o logs/myrun.%J.out
#BSUB -e logs/myrun.%J.err

8. Snakemake / Nextflow on HPC

Both have first-class scheduler profiles — generate a config rather than wrapping the engine in sbatch.

Snakemake--profile slurm reads from ~/.config/snakemake/slurm/config.yaml:

cluster:
  mkdir -p logs/{rule} &&
  sbatch
    --partition={resources.partition}
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --time={resources.runtime}
    --job-name={rule}
    --output=logs/{rule}/{wildcards}.out
default-resources:
  - partition=cpu
  - mem_mb=4000
  - runtime=60
restart-times: 1
max-jobs-per-second: 10
max-status-checks-per-second: 1
jobs: 100
keep-going: true
rerun-incomplete: true
use-conda: true

Then: snakemake --profile slurm.

Nextflownextflow.config:

process {
    executor = 'slurm'
    queue = 'cpu'
    cpus = 8
    memory = '32 GB'
    time = '4h'

    withName: 'ALIGN' {
        queue = 'gpu'
        accelerator = 1
        memory = '64 GB'
    }
}

executor {
    queueSize = 100
    submitRateLimit = '10 sec'
}

Then: nextflow run main.nf -profile slurm.

9. Containers (Apptainer / Singularity)

For environment portability and reproducibility, prefer containers over module-loading on the host:

#SBATCH ... (resource lines as before)

module load apptainer                    # or singularity on older sites

apptainer exec --nv \
    --bind /scratch/$USER:/scratch \
    --bind /data:/data:ro \
    /shared/containers/pytorch_24.10.sif \
    python train.py --output /scratch/$SLURM_JOB_ID

--nv exposes NVIDIA GPUs. --bind mounts host paths into the container.

10. Common Mistakes the Skill Catches

Mistake Symptom Fix
Requesting more memory than the node has Job pends forever Check sinfo -o "%P %m %c %G" for partition specs
Wall time too short Job killed mid-run Estimate from a small benchmark × 3
Wall time too long Long queue wait Most schedulers prefer accurate estimates; partitions often penalise oversized requests
Forgetting srun inside sbatch for distributed jobs Tasks share one rank, no parallelism Wrap with srun
Not pinning OMP_NUM_THREADS Thread oversubscription, slower than serial Set to $SLURM_CPUS_PER_TASK
Writing to home from many tasks Filesystem hangs Use node-local $TMPDIR or scratch
Running 1000 separate jobs instead of an array Hits per-user job limit Use --array
module load after conda activate Module wipes path Always module before conda activate
Hardcoded paths in scripts Breaks on the cluster Use $HOME, $SCRATCH, $SLURM_SUBMIT_DIR
Forgetting to redirect stderr Logs missing errors #SBATCH --error=... or combine via --output

Output Format

## HPC Submission Script

**Scheduler detected**: [SLURM | PBS | SGE | LSF | UNKNOWN — please confirm]
**Cluster docs** (if known): [link]

### Resource estimate

| Resource | Requested | Rationale |
|---|---|---|
| Cores | [N] | [why] |
| Memory | [X GB] | [why — based on data size, model size, etc.] |
| Wall time | [HH:MM] | [why — estimate from benchmark or model] |
| GPUs | [type × N] | [if applicable] |
| Partition / queue | [name] | [why] |

### Submission script

[language: bash]
[full script with comments]

### How to submit

```bash
mkdir -p logs
sbatch submit.sh
squeue -u $USER                    # check status

Notes / gotchas

  • [scheduler-specific notes]
  • [environment-setup notes]
  • [common-failure-mode notes]

Iteration tips

  • Start with a 30-minute test run on a small input subset.
  • Monitor real resource use: seff <jobid> (SLURM) or qstat -f <jobid> (PBS).
  • Adjust the script down to actual usage + 20% headroom.

If you don't know the partition/queue names

Run on the cluster: sinfo -o "%P %m %c %G" (SLURM) or qstat -Q (PBS) or bqueues (LSF).


## Hard Rules

1. **Always include `mkdir -p logs`** before `sbatch` — most schedulers fail silently if the log directory doesn't exist.
2. **Always pin `OMP_NUM_THREADS`** to `$SLURM_CPUS_PER_TASK` (or equivalent) for any numpy/scipy/torch workload.
3. **Always use `set -euo pipefail`** in the submission script body.
4. **Always log job ID and hostname** at start; date at start and end.
5. **Never recommend running 100+ jobs without an array** — it's bad citizenship and often hits per-user limits.
6. **Always recommend a 5-30 minute test run before full submission** for any pipeline the user hasn't used before.
7. **Refuse to guess at site-specific things** like partition names, account codes, GPU types — ask, or insert `# TODO: confirm with cluster docs` markers.

## Integration with Knowledge Graph

Leans on [[Reproducible Research Workflows]] for the broader context of pipelines + environments + workflow engines. After helping a user, if the cluster has unusual quirks worth recording, write a per-project KG node `knowledge/concepts/hpc-<cluster-name>.md` capturing partition names, module conventions, scratch paths, and filesystem etiquette.

## Success Criteria

- Script syntactically correct for the scheduler.
- Resource requests are estimates, not random numbers, with rationale stated.
- Environment setup is explicit and reproducible.
- Thread counts are pinned to requested CPUs.
- Local scratch is used for I/O-heavy steps.
- The user can submit, monitor, cancel, and read logs with the commands provided.
- Common gotchas relevant to their workflow are called out.
Install via CLI
npx skills add https://github.com/hotak92/vibecoded-orchestrator --skill hpc-submit
Repository Details
star Stars 3
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator