gpu-custom-setup - SKILL.md Agent Skill

name: gpu-custom-setup description: One-time onboarding flow for users who set ROCKIE_GPU_MODE=custom (i.e., they have their own GPU setup — own AWS account, on-prem cluster, SSH tunnel to a workstation, university HPC, custom orchestration — instead of using Rockie's deidentified GPU router). Trigger this when (1) the user mentions GPU/training/provisioning, (2) `echo $ROCKIE_GPU_MODE` returns `custom`, AND (3) `.claude/gpu-custom.md` doesn't exist or is empty. Walks the user through their auth/provision/connect/monitor/terminate flow and saves it to .claude/gpu-custom.md so future agent sessions reuse the saved flow without re-asking. Run this AT MOST ONCE per project.

/gpu-custom-setup — onboard a user with their own GPU setup

Rockie's default GPU layer is the deidentified GPU router (gpu.py). But some users have their own setup they prefer to drive themselves — their own AWS account, on-prem cluster, university HPC quota, custom SSH tunnel, etc. They opt out by setting ROCKIE_GPU_MODE=custom, and the agent reaches for THIS skill on first GPU need to learn how THEY provision compute.

Pre-flight checks

Run these BEFORE invoking the skill. If any fail, do not proceed.

# 1. Are we in custom mode?
test "$(echo "${ROCKIE_GPU_MODE:-router}")" = "custom" || exit 0

# 2. Has setup already been done?
test -s .claude/gpu-custom.md && {
  echo "[gpu-custom-setup] .claude/gpu-custom.md already populated — skipping"
  exit 0
}

If both checks pass, continue.

What to do

This is a Q&A skill — the agent prompts the user, the user pastes commands or describes their flow, the agent structures the answers into .claude/gpu-custom.md and saves.

Sections to elicit, in this order:

1. Authentication

Ask: "How do you authenticate to your GPU provider? Paste any commands or env vars you set up. (e.g. aws configure, gcloud auth login, ssh-add ~/.ssh/key, custom token in env, etc.)"

Probe: if they say "AWS", ask whether they use IAM role, access keys, or SSO. If "on-prem", ask whether SSH key auth or password.

2. Provision (start a GPU)

Ask: "How do you spin up a GPU? Paste the command, script, or describe the steps. (e.g. aws ec2 run-instances ..., sbatch train.slurm, ssh worker && ./start_pod.sh, terraform apply, etc.)"

Probe for:

Default instance type / GPU model
How long it usually takes from "go" to "SSH-ready"
Any pre-flight checks they run (capacity, quota, billing)
How they pass training scripts to the pod (rsync, S3, git clone, etc.)

3. Connect (SSH to running GPU)

Ask: "Once the GPU is provisioned, how do you connect? Paste the SSH template (e.g. ssh user@host -i key.pem)."

If they use a custom tunnel / bastion / VPN, capture that.

4. Monitor (status + cost)

Ask: "How do you check what's currently running and how much it's costing? (e.g. aws ec2 describe-instances, squeue -u $USER, billing dashboard URL, ssh worker top)."

Capture both: per-pod status command AND cost/billing surface (URL or CLI command).

5. Terminate (stop / delete)

Ask: "How do you tear down a GPU when you're done? Be specific — this is the most cost-sensitive command. (e.g. aws ec2 terminate-instances --instance-ids ..., scancel JOBID, ./teardown.sh)."

Capture this carefully. The agent will run this without further human review when the user asks to stop a job, so it must be 100% correct.

6. Optional sections

If the user mentions any of these, capture in their own subsections:

Persistent storage — how training data + checkpoints survive beyond a single pod
Multi-node — if they coordinate multiple GPUs, what's the orchestration shape
Preemption — does their setup get preempted? How do they handle it?
Budget tracking — do they have a way to track $/hr that's accruing? Should rockie's budget.py integrate?

Save format

Write to .claude/gpu-custom.md with this structure:

# Custom GPU setup — <user's-project-name>

Generated by /gpu-custom-setup on <ISO date>.

## 1. Authentication
<verbatim from user — paste commands in fenced bash blocks>

## 2. Provision
<verbatim>

## 3. Connect
<SSH template, exactly>

## 4. Monitor
### Status
<command>
### Cost
<URL or command>

## 5. Terminate
<command — DOUBLE-CHECKED for correctness; agent will run this without re-confirmation>

## 6. Optional
### Persistent storage / Multi-node / Preemption / etc.
<as captured>

---

**For the runtime agent (/gpu-custom skill):** when the user asks to
run an experiment, follow §2 (Provision) → §3 (Connect) → run their
training command. When they ask "what's running" / "what's it
costing", consult §4. When they say "stop" / "tear down" / "we're
done," run §5 verbatim — do NOT improvise the terminate command.

Closing the loop

After saving, tell the user:

Saved your setup to .claude/gpu-custom.md. Future agent sessions will read that file when you mention GPUs and follow the flow you just described. If your setup changes, edit the file directly or delete it and re-run /gpu-custom-setup.

If you noticed gaps during the Q&A (e.g. they didn't mention how preemption is handled, or their cost-tracking story is unclear), end with one or two specific follow-up questions they can answer at their leisure — but don't block on those answers.

Don't do

Don't run any of the user's pasted commands during onboarding — this is information capture, not execution.
Don't add Rockie's GPU-router opinions into the saved file. Custom mode means custom; respect their setup.
Don't try to "wire up" their setup to budget.py automatically. Mention budget integration as an optional next step they can pursue if they want.
Don't overwrite an existing non-empty .claude/gpu-custom.md. If the file exists with content, abort and tell the user to delete or edit manually.