sherlock - SKILL.md Agent Skill

name: sherlock description: | Stanford Sherlock HPC cluster assistant. Helps with SLURM job submission, GPU allocation, storage, modules, and cluster usage. Auto-invokes on keywords: sbatch, salloc, srun, SLURM, Sherlock, $SCRATCH, $OAK, GPU job, HPC, sh_dev, squeue, scancel, sacct, module load, partition, node, compute. allowed-tools: - WebFetch - Bash - Read user-invocable: true

Sherlock HPC Cluster Assistant

You are helping a user work with Stanford's Sherlock HPC cluster. Follow these rules:

Answer from the inline quick reference below first — it covers the most common topics.
Fetch full docs via WebFetch when more detail is needed — use the URL index in SHERLOCK_URL_INDEX.md (same directory as this file) to find the right page.
Generate ready-to-use sbatch scripts when the user asks for help submitting jobs.
Use Sherlock conventions:
- Use $SCRATCH for job I/O (fast, large, but purged after 90 days of no access)
- Use $OAK for long-term research data storage (if available)
- Prefer Python venvs over Anaconda (Sherlock docs recommend this)
- Always specify --partition, --time, and --mem in job scripts

When testing #SBATCH scripts, always try to first use the dev partition if possible.

When fetching docs, read the SHERLOCK_URL_INDEX.md file in this skill's directory to find the correct URL, then use WebFetch to retrieve it.

Quick Reference

SSH Connection

ssh <sunetid>@login.sherlock.stanford.edu

SSH multiplexing (add to ~/.ssh/config for faster reconnects):

Host sherlock sherlock.stanford.edu sherlock??
    HostName login.sherlock.stanford.edu
    User <sunetid>
    ControlMaster auto
    ControlPersist 600
    ControlPath ~/.ssh/sockets/%r@%h-%p

Partitions

Partition	Max time	Max nodes/job	Default mem/CPU	Notes
`normal`	2 days	24	~8 GB	Default partition
`bigmem`	1 day	1	—	High-memory nodes (up to 4 TB)
`gpu`	2 days	4	~8 GB	GPU nodes (request with `-G`)
`dev`	2 hours	2	~8 GB	Quick testing, higher priority
`owners`	2 days	—	—	PI-owned nodes (preemptable for non-owners)
`long`	7 days	1	~8 GB	Long-running single-node jobs

Check available partitions: sh_part shows partition availability and limits.

Note: Some users may have priority access to group-owned nodes in the owners partition.

SLURM Command Cheat Sheet

Command	Purpose
`sbatch script.sh`	Submit a batch job
`salloc -p <partition> -t <time>`	Request an interactive allocation
`srun <command>`	Run a command on allocated nodes
`squeue -u $USER`	Check your job queue
`scancel <jobid>`	Cancel a job
`scancel -u $USER`	Cancel all your jobs
`sacct -j <jobid> --format=JobID,Elapsed,MaxRSS,State`	Job accounting info
`scontrol show job <jobid>`	Detailed job info
`sh_dev`	Quick interactive dev session (shortcut for `salloc -p dev`)
`sh_dev -g 1`	Dev session with 1 GPU
`sh_part`	Show partition availability/limits
`sinfo -p <partition>`	Show node status in partition

Common #SBATCH Directives Template

#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --partition=normal
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<sunetid>@stanford.edu

# Load modules
module load python/3.12

# Run
python my_script.py

GPU Requests

#SBATCH --partition=gpu
#SBATCH --gpus=1                    # 1 GPU (any type)
#SBATCH --gpus=v100:2               # 2 V100 GPUs
#SBATCH --gpus=a100:1               # 1 A100 GPU
#SBATCH -C "GPU_MEM:80GB"           # GPU with >= 80GB memory
#SBATCH -C "GPU_CC:8.0"             # GPU with compute capability 8.0+
#SBATCH -C "GPU_SKU:A100_SXM4"      # Specific GPU SKU
#SBATCH -C "GPU_GEN:AMP"            # GPU generation (AMP=Ampere, HOP=Hopper)

Available GPU types: V100 (16/32GB), A100 (40/80GB), A40 (48GB), L40S (48GB), H100 (80GB)

Constraint features (combinable with &):

GPU_SKU: — exact GPU model
GPU_MEM: — minimum GPU memory
GPU_CC: — minimum CUDA compute capability
GPU_GEN: — GPU generation
GPU_BRD: — GPU board type
GPU_CNT: — GPUs per node

Storage

Path	Quota	Backup	Purge	Speed	Use for
`$HOME`	15 GB	Yes (snapshots)	Never	Slow (NFS)	Config, scripts, small code
`$SCRATCH`	100 TB	No	90 days no access	Fast (Lustre)	Job I/O, temp data, large datasets
`$OAK`	Paid	Optional	Never	Medium	Long-term research data
`$L_SCRATCH`	Node-local	No	Job end	Fastest (SSD)	Tmp files during job only
`$GROUP_HOME`	1 TB	Yes	Never	Slow (NFS)	Shared group configs
`$GROUP_SCRATCH`	100 TB	No	90 days no access	Fast (Lustre)	Shared group temp data

Check quotas: sh_quota or lfs quota -u $USER $SCRATCH

Module Commands

Command	Purpose
`ml avail`	List available modules
`ml spider <name>`	Search for a module
`ml load <module>`	Load a module
`ml unload <module>`	Unload a module
`ml list`	List loaded modules
`ml purge`	Unload all modules
`ml show <module>`	Show module details
`ml save <name>`	Save current module set
`ml restore <name>`	Restore saved module set

Common modules: python/3.12, cuda/12, gcc/12, openmpi, R, matlab, julia

Job State Codes

Code	State	Meaning
`PD`	Pending	Waiting for resources
`R`	Running	Currently executing
`CG`	Completing	Finishing up
`CD`	Completed	Finished successfully
`F`	Failed	Non-zero exit code
`CA`	Cancelled	Cancelled by user/admin
`TO`	Timeout	Hit time limit
`OOM`	Out of Memory	Exceeded memory limit
`NF`	Node Fail	Node failure

Useful Tips

Check job efficiency after completion: seff <jobid>
Estimate start time: squeue -j <jobid> --start
Job arrays: #SBATCH --array=0-9 then use $SLURM_ARRAY_TASK_ID
Email notifications: #SBATCH --mail-type=BEGIN,END,FAIL
Dependency chains: sbatch --dependency=afterok:<jobid> next_job.sh
Avoid $HOME for I/O: It's NFS-mounted and slow; use $SCRATCH

Python venvs: Preferred over conda on Sherlock

module load python/3.12
python -m venv $HOME/venvs/myenv
source $HOME/venvs/myenv/bin/activate