name: gpu-multithreading description: "Practitioner knowledge base for parallel, multithreaded, and GPU programming — the design methodology, performance laws, and cross-technology optimization playbook. Use when designing or optimizing parallel software: choosing a parallel decomposition (PCAM, geometric/pipeline/master-worker patterns); reasoning about speedup and scalability (Amdahl, Gustafson, roofline, arithmetic intensity); writing shared-memory code (C++ threads, mutexes, atomics, memory_order, condition variables, lock-free/CAS, false sharing, deadlock); distributed-memory message passing (MPI, domain decomposition, halo exchange, collectives); GPU programming (CUDA/OpenCL thread hierarchy, warps, coalescing, shared-memory tiling, occupancy, host-device transfer); directive-based parallelism (OpenMP fork-join, data-sharing clauses, reductions); OpenMP GPU offload in depth (target/teams/distribute, the map clause and target-data regions, declare target, unified shared memory, async multi-device offload, the Eightfold Path to performance); high-level GPU template libraries (Thrust transform/reduce/scan/sort); load balancing (static/DLT, dynamic/work-stealing/master-worker); or diagnosing parallel performance and correctness pitfalls (data races, uncoalesced access, benchmark-timing errors, floating-point non-reproducibility). Self-contained: the CUDA, MPI, OpenMP, and OpenCL chapters carry concrete APIs, host-program skeletons, code, and parameter tables — usable for hands-on coding, not just design." allowed-tools: - Read - Grep argument-hint: [topic, technology (cuda/mpi/openmp), pattern, or chapter (e.g. ch03)]
GPU & Multithreaded Parallel Programming
Scope: design methodology · performance laws · shared/distributed/GPU programming models · OpenMP GPU offload · optimization | Chapters: 13 | Generated: 2026-06-09
How to Use This Skill
- Without arguments — load the core decision rules below.
- With a topic — ask about
decomposition,Amdahl,coalescing,false sharing,load balancing,memory_order; I find and read the relevant chapter. - With a technology — ask about
CUDA,MPI,OpenMP; I load that chapter. - With a chapter — ask for
ch03; I load that file.
When you ask about something not in the Core section, I read the relevant chapter (and cheatsheet.md / patterns.md / glossary.md). The CUDA, MPI, OpenMP, and OpenCL chapters are self-contained — they carry concrete APIs, host-program skeletons, code, and parameter tables for hands-on coding.
Core Decision Framework
Pick the programming model from the memory model (Ch 1)
Shared address space, one node → threads / OpenMP. Separate address spaces, many nodes → MPI. Massive data-parallel arithmetic → GPU (CUDA / OpenCL / Thrust). Real HPC node → hybrid (MPI + OpenMP/GPU). CPUs hide latency with caches; GPUs hide it with occupancy — algorithms must match.
Design with PCAM + a decomposition pattern (Ch 2)
Partition → Communicate → Agglomerate → Map. The task dependency graph is the central artifact. Pattern by work shape: embarrassingly parallel (independent), divide-and-conquer (recursive), geometric/domain + halo (grids/stencils — the HPC workhorse), pipeline (streaming; slowest stage sets rate), master–worker (irregular). Maximize surface-to-volume in geometric decomposition.
Know the ceilings (Ch 3)
Amdahl (strong scaling): S ≤ 1/((1−α)+α/N), capped at 1/(1−α) — attack the serial fraction first. Gustafson (weak scaling): S = (1−α)+αN, ~linear as problems grow. Roofline: arithmetic intensity (FLOPs/byte) locates a kernel as memory-bound (reuse/tile) or compute-bound (vectorize/FMA).
Shared-memory correctness (Ch 4–5)
Shared mutable state → mutex or atomic (read-only/thread-local need neither); data race = UB. Always RAII locks (scoped_lock for multiple, or global order → no deadlock). cv.wait with a predicate. Default atomics seq_cst; weaken to acquire/release only with a happens-before proof. Watch false sharing. Start with coarse locking; earn lock-free (CAS loops, ABA-aware) with profiler evidence.
GPU performance order (Ch 7–8, 10)
- Coalesce global access (SoA, contiguous per warp) — #1 lever. 2) Tile through shared memory (raise arithmetic intensity). 3) Occupancy to hide latency. 4) Minimize warp divergence. 5) Minimize + overlap host↔device copies. CUDA↔OpenCL maps thread/block/grid ↔ work-item/work-group/NDRange. Prefer high-level primitives (
transform/reduce/scan/sort) over hand kernels.
Distributed memory (Ch 6)
SPMD over a communicator; communication is the bottleneck — minimize, batch, overlap (Irecv/Isend + interior compute). Prefer collectives over point-to-point loops; Sendrecv/nonblocking to avoid deadlock. Domain decomposition + halo exchange is the dominant pattern.
Load balancing (Ch 11) & pitfalls (Ch 12)
Static (proportional/DLT) for predictable work; dynamic (master–worker, work stealing) for irregular. Balance heterogeneous hardware by measured throughput. Always synchronize before timing async GPU/MPI work — a missing sync produces the classic impossible ">100× speedup." Parallel floating-point reductions reorder additions → not bitwise reproducible.
Chapter Index
| # | Title | Key Topics |
|---|---|---|
| ch01 | Hardware & Parallelism Taxonomy | Flynn, SIMD/MIMD/SPMD, shared/distributed, NUMA, GPU hierarchy |
| ch02 | Decomposition & PCAM | PCAM, dependency graph, geometric/pipeline/master-worker patterns |
| ch03 | Performance Laws & Scalability | Amdahl, Gustafson, roofline, arithmetic intensity |
| ch04 | C++ Threads & Concurrency | thread/jthread, mutex, CV, atomics, memory_order, deadlock |
| ch05 | Parallel Data Structures | locking spectrum, lock-free/CAS, ABA, scan, reduction |
| ch06 | Distributed Memory (MPI) | SPMD, point-to-point, collectives, halo exchange, overlap |
| ch07 | GPU Programming (CUDA) | grid/block/thread, warps, coalescing, shared-memory tiling, occupancy |
| ch08 | Portable Accelerators (OpenCL) | NDRange, work-items, CUDA↔OpenCL map, runtime compile, SYCL |
| ch09 | Shared Memory (OpenMP) | fork-join, data-sharing clauses, reduction, schedule, target offload |
| ch10 | High-Level GPU (Thrust) | device_vector, transform/reduce/scan/sort, fancy iterators, fusion |
| ch11 | Load Balancing | static/DLT, dynamic, master-worker, work stealing, tuple space |
| ch12 | Optimization & Pitfalls | triage order, data races, false sharing, timing discipline, FP reproducibility |
| ch13 | OpenMP GPU Offload | target/teams/distribute, map & target-data regions, declare target, USM, async multi-device, Eightfold Path |
Topic Index
- Amdahl / Gustafson / scalability → ch03
- arithmetic intensity / roofline → ch03, ch12
- atomics / memory_order / CAS → ch04, ch05
- coalescing / SoA layout → ch07, ch12
- collectives (Bcast/Allreduce) → ch06
- condition variables → ch04
- CUDA (grid/block/warp/SM) → ch07
- data race / false sharing / deadlock → ch04, ch12
- decomposition patterns → ch02
- domain decomposition / halo exchange → ch02, ch06
- Flynn taxonomy / SIMD / MIMD / SPMD → ch01
- load balancing (static/dynamic/DLT/work-stealing) → ch11
- lock-free / wait-free → ch05
- master–worker / task farm → ch02, ch11
- MPI / message passing → ch06
- mutex / RAII locking → ch04
- OpenCL / NDRange / work-items → ch08
- OpenMP / fork-join / clauses → ch09
- OpenMP GPU offload (target/teams/distribute) → ch13, ch09
- map clause / target-data regions / data movement → ch13
- declare target / unified shared memory (USM) → ch13
- async offload / multi-device (nowait/depend/device) → ch13
- variant directives / performance portability → ch13
- Eightfold Path to performance → ch13
- occupancy / warp divergence → ch07, ch13
- PCAM methodology → ch02
- pipeline pattern → ch02
- reduction / scan (prefix sum) → ch05, ch09
- roofline / memory-vs-compute bound → ch03, ch12
- shared-memory tiling → ch07
- Thrust / high-level GPU primitives → ch10
- benchmark timing / synchronization → ch12
- floating-point reproducibility → ch05, ch12
Supporting Files
- glossary.md — every key term with its defining chapter
- patterns.md — concrete techniques (PCAM, halo exchange, tiling, CAS loop, overlap, synchronized benchmarking)
- cheatsheet.md — decision rules: model picker, decomposition picker, GPU optimization order, CUDA↔OpenCL map, pitfall→fix
Scope & Limits
This skill covers parallel-programming end to end: the design methodology and performance laws, the cross-technology optimization playbook, and self-contained, hands-on chapters for C++ threads, MPI, CUDA, OpenCL, OpenMP, and Thrust — each with concrete APIs, host-program skeletons, code, and parameter tables. It targets the durable model and the practical coding patterns rather than exhaustively enumerating every API entry point or the newest spec revision's micro-features; for the latest version-specific spec minutiae the upstream specifications remain the final word. For language-standard concurrency/memory-model questions, see iso-cpp-2023 (C++) or iso-c-9899-2024 (C).