name: peer-comms description: Use when about to send work to a peer machine (rtxpro6000server / gpuserver1) or check what one is doing. Loads the OAuth-ceremony, tmux protocol, send-keys+C-m, file-not-heredoc rule, phone-home convention, and per-peer quirks. Trigger when typing or thinking "ssh alton@192.168.1.{100,157}" for substantive work; not needed for quick read-only one-shots.
peer-comms — talking to peer-machine Claudes
The protocol that makes cross-machine work auditable instead of ad-hoc. Each peer (rtxpro6000server, gpuserver1) runs its own local Claude in a persistent tmux session. You drive from Rocinante. Substrate is shared git, not shared filesystem.
Default path (2026-05-22 onward): use /peer-send or scripts/peer-send.py
For routine sends, do NOT walk through the manual ceremony below. Use:
- Slash command (interactive):
/peer-send rtxserver "your message here" - Script (composable from Bash tool / cron / agents):
python scripts/peer-send.py rtxserver "your message here"or--from-file <path>
The script handles preflight + OAuth refresh + tmux recovery + light/heavy path routing + C-m submit + capture-pane ack + JSONL audit log. Exit code 0 on visible-processing ack. See design at sartor/memory/projects/peer-comms-streamlining-2026-05-22.md.
The manual ceremony below is documented for debugging when the script fails and for understanding the constitutional structure (peers are NOT subagents — Constitution §14). Don't replicate it by hand unless you're debugging.
When this applies
Invoke when you are about to:
- Send a directive (>5 lines) to a peer Claude
- Spin up or restart a peer tmux session
- Resolve a phone-home (
sartor/memory/inbox/<host>/PHONE-HOME-*.md) - Recover from an OAuth expiry, peer reboot, or tmux loss
- Audit what a peer Claude has been doing
Skip for one-shot read-only queries (single nvidia-smi, df -h, etc.) — direct SSH is fine for those.
Pre-flight (ALWAYS before sending substantive work)
# 1. Reachability
ssh -o ConnectTimeout=5 alton@<peer-ip> 'hostname; uptime; date -u'
# 2. Tmux session alive
ssh alton@<peer-ip> 'tmux ls 2>&1; tmux list-windows -t claude-team-1 2>&1'
# 3. OAuth token expiry (if pushing real work)
ssh alton@<peer-ip> 'date -d @$(jq -r .claudeAiOauth.expiresAt/1000 ~/.claude/.credentials.json)'
# 4. GPU state if relevant
ssh alton@<peer-ip> 'nvidia-smi --query-gpu=index,temperature.gpu,power.draw,memory.used --format=csv,noheader'
# 5. Disk pressure
ssh alton@<peer-ip> 'df -h / | tail -1'
If OAuth expires within 4 hours, refresh first via scp ~/.claude/.credentials.json alton@<peer-ip>:~/.claude/.credentials.json && ssh alton@<peer-ip> 'chmod 600 ~/.claude/.credentials.json'. The tmux Claude session will pick up the refreshed token on its next request.
Sending a directive — the ceremony
Never inline-heredoc a substantive prompt over SSH. Apostrophes, backticks, dollar signs, and $(...) will break the bash quoting. Always go through a file.
# 1. Write the directive locally with the Write tool
# Path: C:\Users\alto8\AppData\Local\Temp\<short-name>.txt
# 2. SCP to peer
scp /c/Users/alto8/AppData/Local/Temp/<short-name>.txt alton@<peer-ip>:/tmp/<short-name>.txt
# 3. Paste into the work pane
ssh alton@<peer-ip> 'tmux send-keys -t claude-team-1:0 "$(cat /tmp/<short-name>.txt)"'
# 4. Submit (CRITICAL — Enter is literal text in send-keys; C-m is the carriage return)
ssh alton@<peer-ip> 'tmux send-keys -t claude-team-1:0 C-m'
# 5. Verify it landed
sleep 4
ssh alton@<peer-ip> 'tmux capture-pane -t claude-team-1:0 -p | tail -15'
You should see the orchestrator transition to "Forming…" / "Photosynthesizing…" / "Cogitating…" or otherwise be processing.
What every directive must contain
Per Constitution §14 + the discipline that has worked across this evening's Cato/persona-engineering chain:
- Context line — what's blocked, what's orthogonal, what to ignore
- Goal — one sentence
- Phases with explicit verify gates between them
- Decision rule for forks (e.g., "if Cato GREENLIGHTS, fire; if REVISE, phone home")
- Phone-home triggers — explicit list of conditions where peer should stop and write to
inbox/<host>/PHONE-HOME-<topic>.md - Budget — wall-clock + token cap
- What to commit and where
Reading peer state
# Tmux pane (the live work)
ssh alton@<peer-ip> 'tmux capture-pane -t claude-team-1:0 -p | tail -50'
# Recent peer-side commits (post-2026-05-02: peers push to rtxserver bare;
# read either via the bare or the peer's working tree)
git fetch origin # origin = rtxserver bare; will pull commits from any peer
git log origin/main --oneline -10
# Optional: read directly from a peer's working tree if you suspect the bare hasn't received yet
git fetch <peer-remote-name> # rtxserver / gpuserver1 working-tree remotes still configured
git log <peer>/main --oneline -10
# Phone-home inbox
ls sartor/memory/inbox/<host>/
# Or via SSH if not yet pulled
ssh alton@<peer-ip> 'ls ~/Sartor-claude-network/sartor/memory/inbox/<host>/ 2>&1'
Failure-mode table
| Symptom | Diagnosis | Fix |
|---|---|---|
error connecting to /tmp/tmux-*/default |
tmux server died (peer rebooted, oom-killer, manual kill) | rtxserver (since 2026-05-02): systemd auto-respawns the session — just run ssh alton@192.168.1.157 'systemctl --user start sartor-claude-peer.service' to force a respawn, OR wait for the next reboot which auto-starts. gpuserver1 (no systemd unit yet): manual — ssh alton@<peer> 'tmux new-session -d -s claude-team-1 -x 200 -y 50 "cd ~/Sartor-claude-network && claude --dangerously-skip-permissions"', wait 6s, capture-pane to confirm Claude is up |
Please run /login · API Error: 401 |
OAuth token expired | Run powershell.exe -ExecutionPolicy Bypass -File C:\Users\alto8\Sartor-claude-network\scripts\win-tasks\sartor-creds-sync.ps1 from Rocinante to push fresh creds to all peers, then restart Claude on the peer (systemctl --user restart sartor-claude-peer.service on rtxserver, or kill+new-session on gpuserver1). The 4h scheduled task usually keeps tokens fresh, so this manual step is rare. |
tmux capture-pane returns empty |
Pane too narrow, OR session just started, OR captured before output flushed | Re-create session with -x 200 -y 50, sleep 3-5s after submit before capturing |
| send-keys "Enter" appears literally in pane | Used Enter instead of C-m |
Send C-m separately (always two send-keys calls — one for text, one for C-m) |
| Bash heredoc EOF error sending directive | Apostrophes/backticks in content broke quoting | Use Write + SCP + $(cat /tmp/...) instead of inline heredoc |
Hook errors python: command not found (rtxserver) |
Peer's hooks reference python which Ubuntu calls python3 |
Non-blocking on rtxserver; peer Claude reports them as warnings. Filter from output, don't escalate |
| Peer pane shows queued messages but Claude isn't responding | Claude is mid-tool-call; messages get queued | Wait or esc to interrupt if truly stuck (rare; default is wait) |
| Compute work fired without expected Cato review | Discipline failure — protocol bypassed | Stop work, file in inbox, re-run Cato pattern from canonical state |
Per-peer quirks
rtxpro6000server (192.168.1.157)
- Hardware: dual RTX PRO 6000 Blackwell (96GB each, slots 3+7), Threadripper PRO 7975WX, ASUS WRX90E-SAGE SE, Noctua NH-U14S TR5-SP6 (air, zero TDP headroom on 7975WX), be quiet! 1600W PSU
- Wall outlet: 120V/15A — 1400W continuous ceiling. Production cap 450W/card (set 2026-05-02 after thermal stress sequence; was 475W previously). 500W will tag the breaker and approach 88°C abort threshold on GPU0.
- Power-cap persistence: systemd unit
nvidia-power-cap.servicere-appliesnvidia-smi -pl 450on boot, ordered before docker.service. Do not manually runnvidia-smi -pl 600and walk away — the systemd unit only fires on boot, not periodically. If you change pl manually for a test, revert before any rental container starts. - Peer Claude auto-respawn: user-level systemd unit
sartor-claude-peer.servicespawnstmux new-session ... claude --dangerously-skip-permissionsat boot. Lingering enabled foralton. To restart manually:systemctl --user restart sartor-claude-peer.service. No need to runtmux new-sessionby hand on rtxserver. - BMC fan curves saved (persistent in firmware, applied 2026-05-02 via Chrome MCP): Zones 2-6 set to 30°C/50% → 50°C/75% → 60°C/90% → 70°C/100%. Confirmed Phase B safe at 475W × 2 for 5 min (Tctl peak 65°C). Less aggressive curves were the historical Tctl bottleneck.
- GPU asymmetry: GPU0 (slot 3) runs ~11°C hotter than GPU1 (slot 7) under same load. Slot 3 is the hot slot.
- CPU thermal coupling: Noctua intake warms ~48°C from GPU exhaust ambient alone in single-card mode (only PCIE03 fan zone engaged). Dual-card mode breaks the recirculation loop — both PCIE07+PCIE03 fans run hard. Tctl is healthier under dual-card load than single-card load.
- Front-fan PWM-cord override: physical PWM cord from front-flower fans to motherboard can be unplugged + remote control set to MAX, giving hardware-100% airflow regardless of BMC PWM-scaling cap on CHA_FAN2/3 (which is stuck at 71% nameplate even at commanded 100%).
- Working dir:
~/Sartor-claude-network - Venv:
~/ml/bin/activate(torch 2.10+cu128) - No GitHub credentials — peer commits locally; Rocinante fetches via
rtxserverremote - vast.ai onboarding state: PAUSED 2026-05-02 pending network topology pivot. Resume from
inbox/rtxpro6000server/RESUME-vastai-onboarding-2026-05-02.md. Don't firevastai list machineuntil port-forward path is decided. - No vast.ai rental — clean GPU access for research
gpuserver1 (192.168.1.100)
- Hardware: single RTX 5090, i9-14900K, 128GB DDR5, MSI MAG Coreliquid A13 240 (AIO — pump should always be 100%)
- Wall outlet: 1200W PSU, dramatically over-provisioned for current workload
- Active vast.ai rental (container
C.34113802through 2026-08-24). NEVER touch the rental container. Nodocker exec, no GPU reset. Read-only inspection only. - Per
business/rental-policy.md: host-CPU work during rental is allowed if load average stays under ~3 on 32-thread i9 - Working dir:
~/Sartor-claude-network - No GitHub credentials — same git flow as rtxserver
- DMZ host — all external traffic forwards here
Git sync pattern
As of 2026-05-02, the canonical write target is the bare repo on rtxserver: alton@192.168.1.157:/home/alton/sartor-git/Sartor-claude-network.git. Every peer's origin should point there. GitHub is a 15-min-lag DR mirror written exclusively by Rocinante's "Sartor Memory Mirror" Windows Scheduled Task. Full architecture in sartor/memory/reference_memory_server.md — read it before changing this section.
# Peer commits AND pushes (peers no longer wait for Rocinante to drain)
ssh alton@<peer-ip> 'cd ~/Sartor-claude-network && git status --short && git fetch origin && git log --oneline -3'
# After peer pushes to rtxserver bare, Rocinante syncs:
git fetch origin
git merge --ff-only origin/main # or rebase if you have local work to land
# If a peer's origin still points at GitHub (pre-2026-05-02 onboarding), fall back to the
# old peer-working-tree fetch pattern until that peer's origin is repointed:
git fetch <peer-remote-name> # rtxserver / gpuserver1 working-tree remotes
git merge <peer>/main --no-edit
git push origin main # NB: origin = rtxserver bare; mirror handles GitHub
Do not push to the github remote from any peer, including Rocinante outside the mirror script. The "Sartor Memory Mirror" task pushes only main (not --mirror) to preserve claude.ai cloud-agent branches. Manual GitHub pushes from this thread risk both auth issues (no creds on peers) and overwriting cloud-agent branches.
If conflicts on persona-engineering files (CATO-PROSECUTION-*, PASSOFF-*, experiments/*), git checkout --theirs is usually right because the peer has been iterating those through Cato cycles. For machines/
Per-peer migration status (2026-05-02)
| Peer | origin URL |
Pushes work? |
|---|---|---|
| Rocinante | alton@192.168.1.157:/home/alton/sartor-git/Sartor-claude-network.git |
✅ migrated |
| rtxserver-self peer Claude | alton@192.168.1.157:/home/alton/Sartor-claude-network (working tree, pre-migration) |
⏳ next session |
| gpuserver1 peer Claude | likely GitHub HTTPS (pre-migration) | ⏳ next session |
| Aneeta laptop (future) | onboard directly to rtxserver bare per [[projects/aneeta-peer-setup]] | n/a yet |
When you sit down with a peer Claude session, the migration is one-shot:
ssh alton@<peer-ip>
cd ~/Sartor-claude-network
git remote set-url origin alton@192.168.1.157:/home/alton/sartor-git/Sartor-claude-network.git
git fetch origin
git push origin main # should be a no-op if peer was up-to-date
Add the peer's SSH pubkey to rtxserver's ~/.ssh/authorized_keys first (rtxserver's own peer can use a local file:// URL or just use its SSH pair to itself).
Phone-home flow (when you receive one)
A peer writing inbox/<host>/PHONE-HOME-<topic>.md means it stopped voluntarily because something tripped a pre-registered trigger. Read the file. The peer's recommendation is in the closing section.
Decision options usually fall into:
- (a) accept-and-revise — the peer's analysis stands, fold findings into the next plan
- (b) re-run with fix — apply small fix, retry the same protocol
- (c) strict halt + outside review — spawn fresh Cato or other adversarial pass
Surface the file, present a/b/c with your read, get Alton's call, then send the resolution back to the peer via the directive ceremony above.
What NOT to do
- Don't skip the pre-flight to save time. The 30-second OAuth+tmux check has prevented multiple compounding failures (tokens expiring mid-build, tmux-already-dead, etc.).
- Don't inline-heredoc directives over SSH — every previous attempt has eventually broken on quoting.
- Don't use
tmux send-keys ... Enter— onlyC-msubmits. - Don't drive substantive work via raw SSH bypassing the local Claude unless you have a specific reason (characterization tests, recovery, peer-Claude unavailable). The local Claude builds situated memory; bypassing it is an audit-trail loss.
- Don't
git pushfrom a peer — it doesn't have credentials. Peers commit locally, Rocinante drains. - Don't touch gpuserver1's vast.ai rental container under any circumstances.
Canonical references
.claude/agents/peer-coordinator.md— full agent definition (use for backgrounded long-running peer monitoring; this skill is for inline cross-machine work)sartor/memory/reference/HOUSEHOLD-CONSTITUTION.md§14 — why each machine has a local Claudesartor/memory/reference/OPERATING-AGREEMENT.md— Rocinante↔gpuserver1 operating agreement (extends to rtxserver)sartor/memory/business/rental-policy.md— what's allowed on gpuserver1 during active rentalsartor/memory/machines/MACHINES.md— fleet index with peer-coordinator quirkssartor/memory/machines/<host>/HARDWARE.md— per-peer hardware ground truth
History
- 2026-04-27: Created from
.claude/agents/peer-coordinator.mdto lower friction. The agent was being skipped because spawning it added overhead vs. inline SSH; the skill loads the protocol into main thread instead. Triggered by Alton noting "I drove this myself" during the thermal-baseline test where the orchestrator wasn't involved.