sherlock

star 0

Stanford Sherlock HPC cluster assistant. Helps with SLURM job submission, GPU allocation, storage, modules, and cluster usage. Auto-invokes on keywords: sbatch, salloc, srun, SLURM, Sherlock, $SCRATCH, $OAK, GPU job, HPC, sh_dev, squeue, scancel, sacct, module load, partition, node, compute.

NarayanSchuetz By NarayanSchuetz schedule Updated 2/15/2026

name: sherlock description: | Stanford Sherlock HPC cluster assistant. Helps with SLURM job submission, GPU allocation, storage, modules, and cluster usage. Auto-invokes on keywords: sbatch, salloc, srun, SLURM, Sherlock, $SCRATCH, $OAK, GPU job, HPC, sh_dev, squeue, scancel, sacct, module load, partition, node, compute. allowed-tools: - WebFetch - Bash - Read user-invocable: true

Sherlock HPC Cluster Assistant

You are helping a user work with Stanford's Sherlock HPC cluster. Follow these rules:

  1. Answer from the inline quick reference below first — it covers the most common topics.
  2. Fetch full docs via WebFetch when more detail is needed — use the URL index in SHERLOCK_URL_INDEX.md (same directory as this file) to find the right page.
  3. Generate ready-to-use sbatch scripts when the user asks for help submitting jobs.
  4. Use Sherlock conventions:
    • Use $SCRATCH for job I/O (fast, large, but purged after 90 days of no access)
    • Use $OAK for long-term research data storage (if available)
    • Prefer Python venvs over Anaconda (Sherlock docs recommend this)
    • Always specify --partition, --time, and --mem in job scripts

When testing #SBATCH scripts, always try to first use the dev partition if possible.

When fetching docs, read the SHERLOCK_URL_INDEX.md file in this skill's directory to find the correct URL, then use WebFetch to retrieve it.


Quick Reference

SSH Connection

ssh <sunetid>@login.sherlock.stanford.edu

SSH multiplexing (add to ~/.ssh/config for faster reconnects):

Host sherlock sherlock.stanford.edu sherlock??
    HostName login.sherlock.stanford.edu
    User <sunetid>
    ControlMaster auto
    ControlPersist 600
    ControlPath ~/.ssh/sockets/%r@%h-%p

Partitions

Partition Max time Max nodes/job Default mem/CPU Notes
normal 2 days 24 ~8 GB Default partition
bigmem 1 day 1 High-memory nodes (up to 4 TB)
gpu 2 days 4 ~8 GB GPU nodes (request with -G)
dev 2 hours 2 ~8 GB Quick testing, higher priority
owners 2 days PI-owned nodes (preemptable for non-owners)
long 7 days 1 ~8 GB Long-running single-node jobs

Check available partitions: sh_part shows partition availability and limits.

Note: Some users may have priority access to group-owned nodes in the owners partition.

SLURM Command Cheat Sheet

Command Purpose
sbatch script.sh Submit a batch job
salloc -p <partition> -t <time> Request an interactive allocation
srun <command> Run a command on allocated nodes
squeue -u $USER Check your job queue
scancel <jobid> Cancel a job
scancel -u $USER Cancel all your jobs
sacct -j <jobid> --format=JobID,Elapsed,MaxRSS,State Job accounting info
scontrol show job <jobid> Detailed job info
sh_dev Quick interactive dev session (shortcut for salloc -p dev)
sh_dev -g 1 Dev session with 1 GPU
sh_part Show partition availability/limits
sinfo -p <partition> Show node status in partition

Common #SBATCH Directives Template

#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --partition=normal
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<sunetid>@stanford.edu

# Load modules
module load python/3.12

# Run
python my_script.py

GPU Requests

#SBATCH --partition=gpu
#SBATCH --gpus=1                    # 1 GPU (any type)
#SBATCH --gpus=v100:2               # 2 V100 GPUs
#SBATCH --gpus=a100:1               # 1 A100 GPU
#SBATCH -C "GPU_MEM:80GB"           # GPU with >= 80GB memory
#SBATCH -C "GPU_CC:8.0"             # GPU with compute capability 8.0+
#SBATCH -C "GPU_SKU:A100_SXM4"      # Specific GPU SKU
#SBATCH -C "GPU_GEN:AMP"            # GPU generation (AMP=Ampere, HOP=Hopper)

Available GPU types: V100 (16/32GB), A100 (40/80GB), A40 (48GB), L40S (48GB), H100 (80GB)

Constraint features (combinable with &):

  • GPU_SKU: — exact GPU model
  • GPU_MEM: — minimum GPU memory
  • GPU_CC: — minimum CUDA compute capability
  • GPU_GEN: — GPU generation
  • GPU_BRD: — GPU board type
  • GPU_CNT: — GPUs per node

Storage

Path Quota Backup Purge Speed Use for
$HOME 15 GB Yes (snapshots) Never Slow (NFS) Config, scripts, small code
$SCRATCH 100 TB No 90 days no access Fast (Lustre) Job I/O, temp data, large datasets
$OAK Paid Optional Never Medium Long-term research data
$L_SCRATCH Node-local No Job end Fastest (SSD) Tmp files during job only
$GROUP_HOME 1 TB Yes Never Slow (NFS) Shared group configs
$GROUP_SCRATCH 100 TB No 90 days no access Fast (Lustre) Shared group temp data

Check quotas: sh_quota or lfs quota -u $USER $SCRATCH

Module Commands

Command Purpose
ml avail List available modules
ml spider <name> Search for a module
ml load <module> Load a module
ml unload <module> Unload a module
ml list List loaded modules
ml purge Unload all modules
ml show <module> Show module details
ml save <name> Save current module set
ml restore <name> Restore saved module set

Common modules: python/3.12, cuda/12, gcc/12, openmpi, R, matlab, julia

Job State Codes

Code State Meaning
PD Pending Waiting for resources
R Running Currently executing
CG Completing Finishing up
CD Completed Finished successfully
F Failed Non-zero exit code
CA Cancelled Cancelled by user/admin
TO Timeout Hit time limit
OOM Out of Memory Exceeded memory limit
NF Node Fail Node failure

Useful Tips

  • Check job efficiency after completion: seff <jobid>
  • Estimate start time: squeue -j <jobid> --start
  • Job arrays: #SBATCH --array=0-9 then use $SLURM_ARRAY_TASK_ID
  • Email notifications: #SBATCH --mail-type=BEGIN,END,FAIL
  • Dependency chains: sbatch --dependency=afterok:<jobid> next_job.sh
  • Avoid $HOME for I/O: It's NFS-mounted and slow; use $SCRATCH
  • Python venvs: Preferred over conda on Sherlock
    module load python/3.12
    python -m venv $HOME/venvs/myenv
    source $HOME/venvs/myenv/bin/activate
    
Install via CLI
npx skills add https://github.com/NarayanSchuetz/claude-sherlock-skill --skill sherlock
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
NarayanSchuetz
NarayanSchuetz Explore all skills →