gx10-offload - SKILL.md Agent Skill

name: gx10-offload description: Offload inference, code generation, and batch processing to local GX10 DGX Spark (GB10 Blackwell) running Ollama

GX10 Offload

Offload work to the local NVIDIA DGX Spark cluster node running Ollama with Devstral models on GB10 Blackwell GPU (128GB unified memory).

When to Use

Long code generation tasks that benefit from a dedicated local model
Batch processing of multiple prompts
Draft generation for review (speculative decoding pattern)
Tasks where latency to cloud APIs is a bottleneck
Privacy-sensitive inference that must stay on-premises

Connection

Property	Value
Host (WiFi)	`10.0.0.234` / `gx10-94e2.local` (mDNS)
Host (Tailscale)	`100.67.53.87` (gx10-acee, different unit)
User	`a`
Password	`aaaaaa`
Ollama API	`http://localhost:11434` on the device
SSH tunnel	`ssh -L 11434:localhost:11434 a@10.0.0.234`

Note: The WiFi-connected unit is gx10-94e2 (discovered via mDNS). The Tailscale-reachable unit is gx10-acee (different physical Spark).

Available Models

Model	Size	Use Case
`devstral`	14GB	Fast coding tasks, lightweight generation
`devstral-2:123b`	74GB	Heavy reasoning, complex code generation
`devstral2-4k`	74GB	Same as above, 4k context window

Quick Usage

Single prompt via SSH

sshpass -p 'aaaaaa' ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no \
  a@100.67.53.87 \
  "curl -s http://localhost:11434/api/generate -d '{\"model\":\"devstral\",\"prompt\":\"YOUR_PROMPT\",\"stream\":false}'"

Via SSH tunnel (persistent)

# Open tunnel in background
sshpass -p 'aaaaaa' ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no \
  -fNL 11434:localhost:11434 a@100.67.53.87

# Then use locally as if Ollama were running here
curl http://localhost:11434/api/generate \
  -d '{"model":"devstral","prompt":"Hello","stream":false}'

OpenAI-compatible API

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral",
    "messages": [{"role": "user", "content": "Write a Python function to sort a list"}]
  }'

Using the offload script

# Simple prompt
~/.claude/skills/gx10-offload/scripts/offload.sh "Write a Rust function for binary search"

# With specific model
~/.claude/skills/gx10-offload/scripts/offload.sh "Explain monads" devstral-2:123b

# Batch mode (one prompt per line)
~/.claude/skills/gx10-offload/scripts/offload.sh --batch prompts.txt

Offload Patterns

1. Draft-and-Review

Offload draft generation to GX10, then review/refine with Claude:

# GX10 generates draft
DRAFT=$(~/.claude/skills/gx10-offload/scripts/offload.sh "Implement a Redis cache wrapper in Python with TTL support")
# Claude reviews and improves the draft

2. Batch Code Generation

Generate multiple implementations in parallel on GX10:

for task in "sort" "search" "hash" "tree"; do
  ~/.claude/skills/gx10-offload/scripts/offload.sh "Implement $task in Rust" &
done
wait

3. Test Generation

Offload test writing to the local model:

~/.claude/skills/gx10-offload/scripts/offload.sh "Write pytest tests for: $(cat src/main.py)"

Device Status Check

~/.claude/skills/gx10-offload/scripts/offload.sh --status

Ensure Ollama is Running

sshpass -p 'aaaaaa' ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no \
  a@100.67.53.87 'pgrep ollama || nohup ollama serve > /tmp/ollama.log 2>&1 &'

Hardware

GPU: NVIDIA GB10 Blackwell (DGX Spark)
Memory: 128GB unified (Grace-Blackwell architecture)
CPU: 20-core Grace ARM64
OS: Ubuntu 24.04 aarch64, kernel 6.14-nvidia
PyTorch: 2.10.0 with CUDA
Disk: 510GB free

GF(3) Assignment

Trit	Role	Description
+1	PLUS	Generator - produces code/text offloaded from Claude

Conservation triad: gx10-offload (+1) + tailscale (0) + skill-creator (-1) = 0