name: sherlock description: | Stanford Sherlock HPC cluster assistant. Helps with SLURM job submission, GPU allocation, storage, modules, and cluster usage. Auto-invokes on keywords: sbatch, salloc, srun, SLURM, Sherlock, $SCRATCH, $OAK, GPU job, HPC, sh_dev, squeue, scancel, sacct, module load, partition, node, compute. allowed-tools: - WebFetch - Bash - Read user-invocable: true
Sherlock HPC Cluster Assistant
You are helping a user work with Stanford's Sherlock HPC cluster. Follow these rules:
- Answer from the inline quick reference below first — it covers the most common topics.
- Fetch full docs via WebFetch when more detail is needed — use the URL index in
SHERLOCK_URL_INDEX.md(same directory as this file) to find the right page. - Generate ready-to-use sbatch scripts when the user asks for help submitting jobs.
- Use Sherlock conventions:
- Use
$SCRATCHfor job I/O (fast, large, but purged after 90 days of no access) - Use
$OAKfor long-term research data storage (if available) - Prefer Python venvs over Anaconda (Sherlock docs recommend this)
- Always specify
--partition,--time, and--memin job scripts
- Use
When testing #SBATCH scripts, always try to first use the dev partition if possible.
When fetching docs, read the SHERLOCK_URL_INDEX.md file in this skill's directory to find the correct URL, then use WebFetch to retrieve it.
Quick Reference
SSH Connection
ssh <sunetid>@login.sherlock.stanford.edu
SSH multiplexing (add to ~/.ssh/config for faster reconnects):
Host sherlock sherlock.stanford.edu sherlock??
HostName login.sherlock.stanford.edu
User <sunetid>
ControlMaster auto
ControlPersist 600
ControlPath ~/.ssh/sockets/%r@%h-%p
Partitions
| Partition | Max time | Max nodes/job | Default mem/CPU | Notes |
|---|---|---|---|---|
normal |
2 days | 24 | ~8 GB | Default partition |
bigmem |
1 day | 1 | — | High-memory nodes (up to 4 TB) |
gpu |
2 days | 4 | ~8 GB | GPU nodes (request with -G) |
dev |
2 hours | 2 | ~8 GB | Quick testing, higher priority |
owners |
2 days | — | — | PI-owned nodes (preemptable for non-owners) |
long |
7 days | 1 | ~8 GB | Long-running single-node jobs |
Check available partitions: sh_part shows partition availability and limits.
Note: Some users may have priority access to group-owned nodes in the owners partition.
SLURM Command Cheat Sheet
| Command | Purpose |
|---|---|
sbatch script.sh |
Submit a batch job |
salloc -p <partition> -t <time> |
Request an interactive allocation |
srun <command> |
Run a command on allocated nodes |
squeue -u $USER |
Check your job queue |
scancel <jobid> |
Cancel a job |
scancel -u $USER |
Cancel all your jobs |
sacct -j <jobid> --format=JobID,Elapsed,MaxRSS,State |
Job accounting info |
scontrol show job <jobid> |
Detailed job info |
sh_dev |
Quick interactive dev session (shortcut for salloc -p dev) |
sh_dev -g 1 |
Dev session with 1 GPU |
sh_part |
Show partition availability/limits |
sinfo -p <partition> |
Show node status in partition |
Common #SBATCH Directives Template
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --partition=normal
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<sunetid>@stanford.edu
# Load modules
module load python/3.12
# Run
python my_script.py
GPU Requests
#SBATCH --partition=gpu
#SBATCH --gpus=1 # 1 GPU (any type)
#SBATCH --gpus=v100:2 # 2 V100 GPUs
#SBATCH --gpus=a100:1 # 1 A100 GPU
#SBATCH -C "GPU_MEM:80GB" # GPU with >= 80GB memory
#SBATCH -C "GPU_CC:8.0" # GPU with compute capability 8.0+
#SBATCH -C "GPU_SKU:A100_SXM4" # Specific GPU SKU
#SBATCH -C "GPU_GEN:AMP" # GPU generation (AMP=Ampere, HOP=Hopper)
Available GPU types: V100 (16/32GB), A100 (40/80GB), A40 (48GB), L40S (48GB), H100 (80GB)
Constraint features (combinable with &):
GPU_SKU:— exact GPU modelGPU_MEM:— minimum GPU memoryGPU_CC:— minimum CUDA compute capabilityGPU_GEN:— GPU generationGPU_BRD:— GPU board typeGPU_CNT:— GPUs per node
Storage
| Path | Quota | Backup | Purge | Speed | Use for |
|---|---|---|---|---|---|
$HOME |
15 GB | Yes (snapshots) | Never | Slow (NFS) | Config, scripts, small code |
$SCRATCH |
100 TB | No | 90 days no access | Fast (Lustre) | Job I/O, temp data, large datasets |
$OAK |
Paid | Optional | Never | Medium | Long-term research data |
$L_SCRATCH |
Node-local | No | Job end | Fastest (SSD) | Tmp files during job only |
$GROUP_HOME |
1 TB | Yes | Never | Slow (NFS) | Shared group configs |
$GROUP_SCRATCH |
100 TB | No | 90 days no access | Fast (Lustre) | Shared group temp data |
Check quotas: sh_quota or lfs quota -u $USER $SCRATCH
Module Commands
| Command | Purpose |
|---|---|
ml avail |
List available modules |
ml spider <name> |
Search for a module |
ml load <module> |
Load a module |
ml unload <module> |
Unload a module |
ml list |
List loaded modules |
ml purge |
Unload all modules |
ml show <module> |
Show module details |
ml save <name> |
Save current module set |
ml restore <name> |
Restore saved module set |
Common modules: python/3.12, cuda/12, gcc/12, openmpi, R, matlab, julia
Job State Codes
| Code | State | Meaning |
|---|---|---|
PD |
Pending | Waiting for resources |
R |
Running | Currently executing |
CG |
Completing | Finishing up |
CD |
Completed | Finished successfully |
F |
Failed | Non-zero exit code |
CA |
Cancelled | Cancelled by user/admin |
TO |
Timeout | Hit time limit |
OOM |
Out of Memory | Exceeded memory limit |
NF |
Node Fail | Node failure |
Useful Tips
- Check job efficiency after completion:
seff <jobid> - Estimate start time:
squeue -j <jobid> --start - Job arrays:
#SBATCH --array=0-9then use$SLURM_ARRAY_TASK_ID - Email notifications:
#SBATCH --mail-type=BEGIN,END,FAIL - Dependency chains:
sbatch --dependency=afterok:<jobid> next_job.sh - Avoid $HOME for I/O: It's NFS-mounted and slow; use
$SCRATCH - Python venvs: Preferred over conda on Sherlock
module load python/3.12 python -m venv $HOME/venvs/myenv source $HOME/venvs/myenv/bin/activate