hpc-cluster-tooling - SKILL.md Agent Skill

name: hpc-cluster-tooling description: "Practitioner knowledge base for the practical workflow tooling of HPC clusters — the command-line skills around writing, building, debugging, profiling, and running scientific code at scale. Use when working on a cluster or HPC project's toolchain: Unix shell for HPC (pipes, grep/sed/awk, ssh/rsync, tmux, environment modules); build automation with Make (targets, rules, automatic variables, pattern rules, parallel make); the CMake build system (out-of-source builds, find_package, target-based commands, build types); git version control with HPC discipline (gitignore, LFS, reproducibility tagging); debugging with GDB (breakpoints, watchpoints, backtrace, core dumps); memory and parallel debugging (Valgrind, AddressSanitizer/ThreadSanitizer, MPI debugging, DDT/TotalView); profiling and benchmarking (gprof, perf, PAPI hardware counters, TAU parallel profiling/tracing); and SLURM batch job management (sbatch/squeue/scancel, #SBATCH directives, job arrays, dependencies, login vs compute nodes). Covers the cluster workflow and command-line tooling — not numerical algorithms or parallel-programming model APIs." allowed-tools: - Read - Grep argument-hint: [tool (slurm/cmake/gdb/tau), topic, or chapter (e.g. ch08)]

HPC Cluster Tooling

Scope: Unix shell · Make · CMake · git · GDB · memory/parallel debugging · profiling (gprof/PAPI/TAU) · SLURM batch jobs | Chapters: 8 | Generated: 2026-06-09

How to Use This Skill

Without arguments — load the core workflow below.
With a tool — ask about SLURM, CMake, GDB, TAU, Valgrind; I find and read the relevant chapter.
With a topic — ask about job arrays, watchpoints, hardware counters, out-of-source build; I find the chapter.
With a chapter — ask for ch08; I load that file.

When you ask about something not in the Core section, I read the relevant chapter (and cheatsheet.md / patterns.md / glossary.md).

Core Workflow

The cluster is a Unix terminal (Ch 1)

Everything happens at the shell: compose tools with pipes (grep | awk | sort), transfer data with rsync (incremental/resumable), run long work in tmux (survives disconnect), select the toolchain with module load. Record modules with results.

Build: Make → CMake (Ch 2, 3)

Make rebuilds incrementally by timestamp (target: prereqs + TAB recipe; $@/$</$^; make -j). CMake is the meta-build system: declarative CMakeLists.txt generates Make/Ninja; always build out-of-source (cmake -B build), use find_package + target-based commands, set CMAKE_BUILD_TYPE. Don't hand-write Makefiles for portable projects.

Version control (Ch 4)

Version source + build recipe in git; keep artifacts/large data out (.gitignore/LFS). Commit often on branches; tag and record the commit hash + environment with every result — reproducibility = commit + modules.

Debug: correctness (Ch 5, 6)

GDB (build -g -O0): on a crash, gdb prog core → backtrace first; watch var for corruption; conditional breakpoints to skip to the failing iteration. Memory errors are silent in C/C++ — run sanitizers (-fsanitize=address/thread/undefined) routinely, Valgrind for exhaustive coverage. Parallel bugs (deadlock = mismatched blocking exchange; race = run TSan) — reproduce at the smallest rank count; GDB-per-rank (few) or DDT/TotalView (many).

Profile: performance (Ch 7)

Measure before optimizing (Amdahl). gprof/perf for where (hot functions), PAPI counters for why (cache/branch/FLOP), TAU for which rank/phase (parallel imbalance). Benchmark reproducibly: warm up, pin nodes, repeat, report spread.

Run: SLURM (Ch 8)

Never compute on the login node — submit a batch script (#SBATCH resource directives + module load + srun). sbatch/squeue/scancel. Realistic wall time (-t: shorter backfills sooner, overruns killed). Job arrays (--array) for parameter sweeps; --dependency=afterok to chain workflows. Record job id (%j) + commit for reproducibility.

Chapter Index

#	Title	Key Topics
ch01	Unix for HPC	pipes, grep/sed/awk, ssh/rsync, tmux, modules
ch02	Make	targets/rules, automatic variables, pattern rules, `-j`
ch03	CMake	out-of-source, find_package, target-based, build types
ch04	Git	branches, .gitignore/LFS, reproducibility tagging
ch05	Debugging with GDB	breakpoints, watchpoints, backtrace, core dumps
ch06	Memory & Parallel Debugging	Valgrind, sanitizers, MPI debugging, DDT/TotalView
ch07	Profiling & Benchmarking	gprof, perf, PAPI, TAU/ParaProf/Jumpshot
ch08	SLURM & Batch Jobs	sbatch/squeue, #SBATCH, job arrays, dependencies

Topic Index

awk / sed / grep / pipes → ch01
batch script / #SBATCH → ch08
CMake / find_package / out-of-source → ch03
core dump / post-mortem → ch05
DDT / TotalView / parallel debugging → ch06
GDB / breakpoints / watchpoints / backtrace → ch05
git / .gitignore / LFS / reproducibility → ch04
gprof / perf → ch07
job arrays / dependencies → ch08
Make / targets / automatic variables → ch02
memory errors / sanitizers / Valgrind → ch06
modules / environment → ch01
PAPI / hardware counters → ch07
rsync / ssh / tmux → ch01
SLURM / sbatch / squeue / srun → ch08
TAU / parallel profiling / tracing → ch07

Supporting Files

glossary.md — every key term with its defining chapter
patterns.md — concrete techniques (shell pipelines, out-of-source build, reproducibility tagging, post-mortem GDB, sanitizers-first, profile-then-optimize, batch submission, parameter sweeps)
cheatsheet.md — command reference: Make/CMake/git/GDB/SLURM tables, debugging-tool picker, profiling-tool picker, #SBATCH directives

Scope & Limits

Covers the practical cluster workflow and command-line tooling of HPC — shell, build systems, version control, debugging, profiling, and the batch scheduler. For the numerical algorithms and theory, see hpc-numerics; for parallel-programming models (MPI/OpenMP/CUDA/Kokkos APIs), see gpu-multithreading / cpp-hpc / mpi-5.0 / openmp-6.0 / cuda-programming; for Fortran build tooling specifically, see fobis. Specific scheduler/tool options vary by site and version — verify against your cluster's documentation and man pages.