colab-remote - SKILL.md Agent Skill

name: colab-remote description: > Orchestrate Google Colab Pro/Pro+ GPU instances as remote training backends via SSH. Compounds /agent-browser (to launch Colab sessions and install colab-ssh) with SSH (to operate the runtime remotely). Use when: (1) launching a Colab notebook for GPU training, (2) running training jobs on Colab from the local terminal, (3) transferring datasets/checkpoints to/from Colab, (4) monitoring GPU utilization on a Colab instance, (5) integrating Colab GPU compute with /autoany EGRI optimization loops, (6) reconnecting to a Colab session after timeout. Triggers on: "colab", "colab-remote", "colab ssh", "colab training", "remote GPU", "colab pro", "train on colab", "google colab".

Colab Remote — SSH-Operated GPU Training

Operate Google Colab Pro/Pro+ instances as headless GPU backends from the local terminal.

Architecture

Local Mac (Claude Code)
  ├── agent-browser → Chrome → colab.research.google.com
  │   └── Opens notebook, runs colab-ssh setup cell
  ├── SSH tunnel → Colab runtime (via ngrok or cloudflared)
  │   └── Run training, monitor GPU, transfer files
  └── /autoany EGRI loop (local)
      └── Proposes mutations → SSH executes on Colab → evaluates results

Phase 1: Launch Colab Session (Browser Automation)

Use /agent-browser to open Colab and set up SSH access.

Step 1: Open Colab and create notebook

agent-browser open "https://colab.research.google.com/#create=true"
agent-browser wait --load networkidle
agent-browser snapshot -i

If login is required, prompt the user to authenticate manually, then re-snapshot.

Step 2: Select GPU runtime

Navigate Runtime > Change runtime type, select GPU (T4/V100/A100 depending on plan), and save.

Step 3: Install colab-ssh and get connection details

Type the SSH setup code into a cell. Two methods supported:

Method A: ngrok (recommended)

!pip install colab-ssh --upgrade
from colab_ssh import launch_ssh
launch_ssh("YOUR_NGROK_TOKEN")

User must provide ngrok authtoken from https://ngrok.com.

Method B: cloudflared (no account needed)

!pip install colab-ssh --upgrade
from colab_ssh import launch_ssh_cloudflared
launch_ssh_cloudflared(password="your-password-here")

Step 4: Extract and save connection details

After the cell runs, snapshot output to extract hostname/port. Save for reuse:

mkdir -p ~/.colab-remote
cat > ~/.colab-remote/session.env << 'EOF'
COLAB_HOST=0.tcp.ngrok.io
COLAB_PORT=12345
COLAB_USER=root
COLAB_METHOD=ngrok
EOF

Load in subsequent commands: source ~/.colab-remote/session.env

Phase 2: SSH Operations

Connect

# ngrok
ssh -o StrictHostKeyChecking=no -p $COLAB_PORT root@$COLAB_HOST
# cloudflared
ssh -o StrictHostKeyChecking=no -o ProxyCommand="cloudflared access ssh --hostname %h" root@$COLAB_HOST

Verify GPU

ssh -p $COLAB_PORT root@$COLAB_HOST "nvidia-smi"

Transfer files

# Upload
scp -P $COLAB_PORT -r ./data root@$COLAB_HOST:/content/data
# Download
scp -P $COLAB_PORT -r root@$COLAB_HOST:/content/checkpoints ./checkpoints

Run training

# Foreground
ssh -p $COLAB_PORT root@$COLAB_HOST "cd /content && python train.py --epochs 10"
# Background (survives SSH disconnect)
ssh -p $COLAB_PORT root@$COLAB_HOST "cd /content && nohup python train.py > train.log 2>&1 &"
# Monitor
ssh -p $COLAB_PORT root@$COLAB_HOST "tail -f /content/train.log"

Monitor GPU

ssh -p $COLAB_PORT root@$COLAB_HOST "nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu --format=csv"

Install dependencies

ssh -p $COLAB_PORT root@$COLAB_HOST "pip install torch transformers peft bitsandbytes accelerate datasets"

Phase 3: EGRI Integration (/autoany)

Wire Colab as the execution backend for an EGRI optimization loop. See references/egri-colab.md for the full problem-spec template and harness patterns.

Execution loop (summary)

for each trial:
  1. Upload mutated artifact → scp to Colab
  2. Execute on Colab GPU → ssh python train.py
  3. Evaluate results → ssh python evaluate.py
  4. Download metrics → scp results.json
  5. Score locally (immutable evaluator)
  6. Promote or discard based on policy

Phase 4: Session Lifecycle

Tier	Max runtime	Idle timeout	GPU
Free	12h	90min	T4, limited
Pro	24h	90min	T4, V100, priority
Pro+	24h	90min	T4, V100, A100

Keep-alive

ssh -p $COLAB_PORT root@$COLAB_HOST "while true; do sleep 300; echo keepalive; done &"

Reconnect after timeout

Check: ssh -p $COLAB_PORT root@$COLAB_HOST "echo ok" 2>/dev/null && echo "UP" || echo "DOWN"
If dead, re-launch via Phase 1 (browser automation)
Resume from last checkpoint

Google Drive persistence

Mount Drive to persist across sessions:

ssh -p $COLAB_PORT root@$COLAB_HOST "python -c 'from google.colab import drive; drive.mount(\"/content/drive\")'"
# Checkpoints survive in /content/drive/MyDrive/

Quick Reference

Task	Command
Check GPU	`ssh -p $COLAB_PORT root@$COLAB_HOST "nvidia-smi"`
Upload	`scp -P $COLAB_PORT ./file root@$COLAB_HOST:/content/`
Download	`scp -P $COLAB_PORT root@$COLAB_HOST:/content/file ./`
Run script	`ssh -p $COLAB_PORT root@$COLAB_HOST "python /content/script.py"`
Background job	`ssh -p $COLAB_PORT root@$COLAB_HOST "nohup python train.py > log 2>&1 &"`
Tail log	`ssh -p $COLAB_PORT root@$COLAB_HOST "tail -20 /content/log"`
Disk space	`ssh -p $COLAB_PORT root@$COLAB_HOST "df -h /content"`
Kill job	`ssh -p $COLAB_PORT root@$COLAB_HOST "pkill -f train.py"`
Session alive?	`ssh -p $COLAB_PORT root@$COLAB_HOST "echo ok" 2>/dev/null`

Prerequisites

ngrok account (free): https://ngrok.com — or cloudflared: brew install cloudflared
Colab Pro/Pro+ for GPU priority and longer runtimes
agent-browser installed and working
Google account signed into Chrome