slurm

star 11

Help write, debug, and manage SLURM jobs. Use when the user asks about sbatch, salloc, squeue, job scripts, or cluster resource allocation.

michaelrizvi By michaelrizvi schedule Updated 3/14/2026

name: slurm description: Help write, debug, and manage SLURM jobs. Use when the user asks about sbatch, salloc, squeue, job scripts, or cluster resource allocation.

SLURM Assistant

Help the user write job scripts, debug failed jobs, and manage cluster resources.

Job Script Guidelines

  • Always include: --job-name, --output, --error, --time, --mem, --gres (for GPUs), --cpus-per-task
  • Place scripts in a dedicated folder (e.g. scripts/)
  • Use set -euo pipefail in the bash portion
  • Log key info at the start: hostname, GPU info (nvidia-smi), date, git commit hash
  • Activate the correct virtual environment before running Python

Resource Allocation Rules

  • Small experiments (<1M params): 1 GPU, 4-8 CPUs, 16-32GB RAM
  • Medium experiments (1M-1B params): 1-2 GPUs, 8-16 CPUs, 32-64GB RAM
  • Large models (7B+): multiple GPUs, 64-128GB+ RAM
  • 32B+ inference: 4+ GPUs, match tensor parallelism to GPU count
  • Rule of thumb: ~4-8 CPUs per GPU, ~2x model size in FP16 for VRAM

Known GPU Types & Selection

GPU types (use with --gres=gpu:<type>:N)

  • a100: A100 40GB HBM2e
  • a100l: A100 80GB HBM2e
  • a6000: RTX A6000 48GB GDDR6
  • h100: H100 80GB HBM3
  • l40s: L40S ~45GB GDDR6
  • rtx8000: Quadro RTX 8000 48GB GDDR6
  • v100: V100 32GB HBM2

GPU selection by attribute

You can also request GPUs by memory, architecture, or feature:

  • By memory: --gres=gpu:48gb:1 (any 48GB GPU: RTX8000, A6000, L40S)
  • By arch: --gres=gpu:ampere:1 (A100, A6000, L40S)
  • By interconnect: --gres=gpu:nvlink:1
  • By system: --gres=gpu:dgx:1
  • Memory tags: 12gb, 32gb, 40gb, 48gb, 80gb
  • Arch tags: volta, turing, ampere

Node Inventory

Nodes Count GPUs CPUs RAM
cn-l[001-091] 91 4x L40S (48GB) 48 1024GB
cn-c[001-040] 40 8x RTX8000 (48GB) 64 384GB
cn-g[001-029] 29 4x A100 (80GB) 64 1024GB
cn-a[001-011] 11 8x RTX8000 (48GB) 40 384GB
cn-b[001-005] 5 8x V100 (32GB) 40 384GB
cn-k[001-004] 4 4x A100 (40GB) 48 512GB
cn-n[001-002] 2 8x H100 (80GB) 192 2048GB
cn-d[001-004] (DGX) 4 8x A100 (40/80GB) 128 1024-2048GB
cn-j001 1 8x A6000 (48GB) 64 1024GB

GPUs per node is either 4 or 8 — don't request more than the node type has.

Partitions & Preemption

Partition Time Limit Per-User Limits
long (default) 7 days No per-user GPU cap
main 5 days 2 GPUs, 8 CPUs, 48GB
short 3 hours 4 GPUs, 1TB mem
unkillable 2 days 1 GPU, 6 CPUs, 32GB

Preemption hierarchy: unkillable > main > long. Once preempted, jobs are killed and auto-requeued. main jobs do NOT preempt other main jobs. -grace variants give a SIGTERM grace period before kill. Checkpoint frequently on long partition.

Storage

Path Quota Key Policy
$HOME 100GB / 1M files Daily backup, low I/O — don't write logs here
$SCRATCH 5TB / unlimited Files unused >90 days deleted
$SLURM_TMPDIR No quota Fastest I/O, cleared after job
/network/projects/<group>/ 1TB / 1M files Shared project storage
$ARCHIVE 5TB No backup, not on GPU nodes

Always copy data to $SLURM_TMPDIR at job start for performance. Write logs/outputs to $SCRATCH, not $HOME. Check usage with disk-quota.

Module System

  • module load python/3.10 — required before creating venvs on cluster
  • module load miniconda/3 — for conda environments
  • module avail / module spider <term> — search available modules
  • Pre-built PyTorch/TF modules exist for Mila GPUs
  • On login/CPU nodes without GPUs: CONDA_OVERRIDE_CUDA=11.8 before conda commands

Debugging Failed Jobs

  • Check .err files first — experiment logs go to stderr
  • sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed,NodeList for completed jobs
  • Common issues: OOM (check MaxRSS), time limit, bad path, missing module/env
  • For OOM: check batch size, model size, gradient accumulation, and whether --mem was sufficient
  • torch.autograd.set_detect_anomaly(True) causes extreme filesystem IOPS — never leave on in batch jobs, admins will flag it

Monitoring

  • disk-quota — check storage usage
  • squeue -u $USER — your active jobs
  • echo $SLURM_JOB_GPUS — which GPU(s) your job got
  • Netdata per-node: <node>.server.mila.quebec:19999 (requires Mila wifi or SSH tunnel)
  • Grafana dashboard: dashboard.server.mila.quebec

Limits

  • Max 1000 jobs per user in the system at any time

Safety

  • Never submit jobs (sbatch) without explicit user confirmation
  • Verify paths and configs before submission
  • Test on small instances first when possible

Scope

$ARGUMENTS

Install via CLI
npx skills add https://github.com/michaelrizvi/claude-config --skill slurm
Repository Details
star Stars 11
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator
michaelrizvi
michaelrizvi Explore all skills →