gpu-ci

name: gpu-ci description: "GPU CI patterns for CUDA compilation caching, manylinux wheels, and multi-arch builds. Use when distribution=pypi-wheel or hardware_targets includes cuda." version: 1.0.0

/gpu-ci

Guidance-only skill. No .ai/bin/agent-gpu-ci wrapper exists. Applies when .ai/project.yml declares distribution: pypi-wheel or hardware_targets: [cuda].

Covers: sccache compilation caching, auditwheel manylinux validation, multi-CUDA wheel matrices, and GPU test gating patterns.

1. sccache for CUDA compilation caching

sccache wraps nvcc and g++. Set these before cmake:

export SCCACHE_BUCKET=my-build-cache   # S3 bucket
export SCCACHE_REGION=us-west-2
export CUDAHOSTCXX=/usr/bin/g++        # separate host/device compilation
sccache --start-server

In CMakeLists.txt:

set(CMAKE_CUDA_COMPILER_LAUNCHER sccache)
set(CMAKE_CXX_COMPILER_LAUNCHER  sccache)

GitHub Actions — use mozilla-actions/sccache-action@v0.0.3. Cache-hit optimisation: pin compiler versions in Docker, avoid timestamp-dependent flags, and keep CUDA Toolkit versions consistent across matrix jobs.

2. auditwheel — manylinux wheel validation

CUDA runtime libraries MUST be excluded from bundling (user provides them):

auditwheel repair dist/*.whl \
  --exclude libcuda.so.1      \
  --exclude libcudart.so.11.0 \
  --exclude libcudart.so.12.0 \
  --exclude libcublas.so.11   \
  --exclude libcublas.so.12   \
  --exclude libcublasLt.so.11 \
  --exclude libcublasLt.so.12 \
  --exclude libcudnn.so.8     \
  --exclude libnccl.so.2      \
  --plat manylinux2014_x86_64 \
  -w dist/repaired/

Always exclude: libcuda, libcudart, libnvrtc, libcublas*, libcudnn, libnccl. May bundle: custom CUDA kernels compiled as .so.

3. Multi-CUDA wheel build matrix

Use PEP 440 local version identifiers: mypackage-0.1.0+cu118-cp310-…whl

Typical GitHub Actions matrix:

strategy:
  matrix:
    cuda: [cu118, cu121, cu124]
    python: ['3.9', '3.10', '3.11', '3.12']

CUDA version map: cu118 → 11.8.0, cu121 → 12.1.0, cu124 → 12.4.0. Use Jimver/cuda-toolkit@v0.2.11 to install. Inject the +cuXYZ suffix before uploading to avoid overwriting wheels of different CUDA variants.

4. GPU test gating patterns

pytest markers (`conftest.py`)

def pytest_runtest_setup(item):
    gpu = _get_gpu_type()   # nvidia-smi --query-gpu=name
    if item.get_closest_marker('gpu') and gpu is None:
        pytest.skip("GPU not available")
    if item.get_closest_marker('h100') and 'H100' not in (gpu or ''):
        pytest.skip("H100 not available")
    if item.get_closest_marker('a100') and 'A100' not in (gpu or ''):
        pytest.skip("A100 not available")

Usage:

@pytest.mark.gpu
def test_basic_cuda(): ...

@pytest.mark.h100
def test_h100_fp8(): ...

Self-hosted runner labels

jobs:
  test-h100:
    runs-on: [self-hosted, gpu, h100]
  test-a100:
    runs-on: [self-hosted, gpu, a100]

5. Common pitfalls

Problem	Solution
auditwheel bundles `libcudart`	Always pass `--exclude libcudart.so.*`
Inconsistent CUDA patch versions break sccache	Pin exact versions: `11.8.0` not `11.8`
Multiple CUDA variants overwrite each other	Inject `+cu118` suffix before upload
GPU tests fail on CPU-only runners	Use pytest markers + skip logic (section 4)

6. Constraints respected

.ai/constraints/hybrid/python-cpp-build.md — wheel build patterns
.ai/constraints/hybrid/system-deps.md — CUDA Toolkit discovery
.ai/constraints/cpp/cuda-modern.md — CUDA compilation flags