name: monitor-services description: "Health check and auto-restart all Pollinations GPU services (Flux/Z-Image on RunPod, LTX-2 on GH200, Klein on RunPod, legacy image on OVH, Sana on Oracle Cloud). Use with /loop for recurring checks."
Monitor Services
Check health of all Pollinations GPU backend services and auto-restart if unhealthy.
Quick Start
/loop 30m /monitor-services
Or run once: /monitor-services
Services
1. LTX-2.3 Video (GH200 - Lambda Labs)
| Property | Value |
|---|---|
| Host | 192.222.51.105 |
| Port | 8765 |
| Provider | Lambda Labs (NVIDIA GH200) |
| SSH | ssh -i <SSH_LAMBDA_SANA_LTX2_ACESTEP from SOPS> ubuntu@192.222.51.105 |
Health check:
curl -s --connect-timeout 5 --max-time 10 http://192.222.51.105:8765/health
Expected: {"status":"healthy","model":"ltx-2-comfyui"}
E2E test (through prod):
curl -s -o /dev/null -w "HTTP %{http_code}, Size: %{size_download}, Time: %{time_total}s" \
--max-time 120 \
"https://gen.pollinations.ai/video/health_check_$(date +%s)?model=ltx-2&duration=3" \
-H "Authorization: Bearer $TEST_TOKEN"
Expected: HTTP 200, ~500-800KB, ~11-13s
Restart:
ssh -i <SOPS:SSH_LAMBDA_SANA_LTX2_ACESTEP> ubuntu@192.222.51.105 "bash /home/ubuntu/start_ltx2.sh"
Wait ~60s after restart, then re-check health.
2. ACE-Step Music (GH200 - same host as LTX-2)
| Property | Value |
|---|---|
| Host | 192.222.51.105 |
| Port | 8189 |
| SSH | ssh -i <SSH_LAMBDA_SANA_LTX2_ACESTEP from SOPS> ubuntu@192.222.51.105 |
| Systemd | acestep.service |
| Auth | ACESTEP_API_KEY env var (Bearer token) |
Health check:
curl -s --connect-timeout 5 --max-time 10 http://192.222.51.105:8189/health
Expected: {"status":"ok","models_initialized":true}
Restart:
ssh -i <SOPS:SSH_LAMBDA_SANA_LTX2_ACESTEP> ubuntu@192.222.51.105 "sudo systemctl restart acestep"
Wait ~50s for model initialization, then re-check health.
Notes:
- Token auth via
Authorization: Bearer <token>— token stored in encrypted secrets asPLN_GPU_TOKEN - Server-side token set via
ACESTEP_API_KEYenv var in systemd unit - Runs on port 8189 (port 8188 is ComfyUI/LTX-2)
3. Legacy Image Service (OVH)
| Property | Value |
|---|---|
| Host | 57.130.31.42 |
| Port | 16384 |
| SSH | ssh -i ~/.ssh/id_rsa_ovh ubuntu@57.130.31.42 |
| Branch | master (separate from main) |
Health check:
ssh -i ~/.ssh/id_rsa_ovh -o ConnectTimeout=5 ubuntu@57.130.31.42 "systemctl is-active image-pollinations"
Expected: active
Restart:
ssh -i ~/.ssh/id_rsa_ovh ubuntu@57.130.31.42 "sudo systemctl restart image-pollinations"
4. Klein 4B (RunPod Pod)
| Property | Value |
|---|---|
| Pod ID | jmrbmje2fyuy46 (current — pod ID changes if recreated) |
| Host | <pod-id>-8000.proxy.runpod.net |
| Port | 8000 |
| Provider | RunPod (RTX A5000 secure cloud, $0.27/hr via API) |
| SSH | Full SSH with SSH_RUNPOD_KLEIN from SOPS; get current public host/port from RunPod runtime ports |
| Auth | x-backend-token header with PLN_GPU_TOKEN |
| Config | KLEIN_URL in gen.pollinations.ai/secrets/prod.vars.json (sops); fallback in gen.pollinations.ai/src/image/models/fluxKleinModel.ts |
Health check:
curl -s --connect-timeout 5 --max-time 10 https://jmrbmje2fyuy46-8000.proxy.runpod.net/health
Expected: {"status":"ok","model":"black-forest-labs/FLUX.2-klein-4B"}
Restart (in-pod):
# Open SSH from dashboard, then:
bash /workspace/restart.sh
Wait ~30s for model load, then re-check health.
Recovery from RunPod host outage:
Symptom: dashboard banner "This server has recently suffered a network outage"; control plane reports RUNNING but HTTPS proxy / SSH / ICMP all unreachable. Restart/reset reschedules onto the same broken host. Recreate on a different host:
- Create cheap replacement capacity first: prefer secure A5000 at <= $0.30/hr, fallback to community 3090 at <= $0.23/hr. Reject any assigned host with
machine.noteormaintenanceNotementioning an outage. Do not use a 4090 unless the higher cost is explicitly accepted. - Use
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404,8000/http,22/tcp,20GBcontainer disk,100GBvolume, and/workspace. - Get SSH runtime port from RunPod GraphQL and connect with
SSH_RUNPOD_KLEIN:sops -d enter.pollinations.ai/secrets/prod.vars.json | jq -r .SSH_RUNPOD_KLEIN > /tmp/klein-key chmod 600 /tmp/klein-key ssh -i /tmp/klein-key -p <publicPort> root@<publicIp> - Copy
image.pollinations.ai/klein-runpod/handler.pyandrequirements.txtto/workspace. - Install runtime packages without replacing the base CUDA torch:
python -m venv --system-site-packages /workspace/venv /workspace/venv/bin/python -m pip install --upgrade pip /workspace/venv/bin/python -m pip install --no-cache-dir -r /workspace/requirements.txt - Put
PLN_GPU_TOKENin/root/.gpu_token(not/workspace, which may ignore Unix mode bits) and create/workspace/restart.sh:export HF_HUB_CACHE=/workspace/hf-cache export HF_XET_HIGH_PERFORMANCE=1 export PLN_GPU_TOKEN="$(cat /root/.gpu_token)" pkill -f "/workspace/handler.py" 2>/dev/null || true nohup /workspace/venv/bin/python -u /workspace/handler.py > /workspace/klein.log 2>&1 & echo $! > /workspace/klein.pid - Run
bash /workspace/restart.sh; first startup downloads ~15-24GB of model files and should end withModel loaded and ready!. - Update
KLEIN_URLingen.pollinations.ai/secrets/prod.vars.jsonand the hardcoded fallback ingen.pollinations.ai/src/image/models/fluxKleinModel.ts. - Push the production Worker secret:
tmp=$(mktemp) sops -d gen.pollinations.ai/secrets/prod.vars.json > "$tmp" (cd gen.pollinations.ai && node scripts/push-generation-secrets.mjs "$tmp" production) rm -f "$tmp" - Verify direct pod
/health, direct authenticated/generate, and productiongen.pollinations.ai/image/...model=klein, remove/tmp/klein-key, then terminate the old outage pod.
Note: the pod uses a generic runpod/pytorch image; handler.py and restart.sh live on the pod volume only (not baked into a Docker image despite image.pollinations.ai/klein-runpod/Dockerfile). The pod volume is destroyed on terminate.
5. Z-Image Workers (RunPod, multiple single-GPU pods)
⚠️ Pod IDs, hostnames, SSH ports, and the worker count all change over time — DISCOVER them live, never trust hardcoded values here. As of 2026-06-16, zimage runs as 3 separate single-GPU pods (one 4090 + two 3090s), each serving on port 8767. The old
hsl3ksl31lvrcc4x-4090 pod is gone. Flux runs on its own worker(s) elsewhere — it is healthy and unrelated to zimage.
Step 1 — discover what's actually deployed (the source of truth):
# (a) Who is registered + heartbeating (= actually taking prod traffic):
curl -s --max-time 10 https://gen.pollinations.ai/register | python3 -c "import sys,json,time; [print(f\"{w['type']:<7} {w['url']} hb={(time.time()*1000-w['lastHeartbeat'])/1000:.0f}s\") for w in json.load(sys.stdin)]"
# (b) ALL RunPod pods + their GPU util (catch idle/dead pods NOT in the registry):
RUNPOD_TOKEN=$(cat ~/.runpod/config.toml | grep apikey | cut -d\' -f2)
runpodctl pod list
curl -s -X POST "https://api.runpod.io/graphql?api_key=$RUNPOD_TOKEN" -H "Content-Type: application/json" \
-d '{"query":"{myself{pods{id name desiredStatus machine{gpuDisplayName} runtime{gpus{gpuUtilPercent}}}}}"}' \
| python3 -c "import sys,json; [print(p['name'], p['id'], p['desiredStatus'], (p.get('runtime') or {}).get('gpus')) for p in json.load(sys.stdin)['data']['myself']['pods']]"
Key failure mode (this caused the 2026-06-16 incident): a pod can be RUNNING + costing money but 0% util and absent from /register because its server.py died or its GPU fell off the bus. All traffic then piles onto the remaining worker(s) → 524 timeouts + 100% util on the survivor. Always cross-check (a) vs (b): every RUNNING zimage pod should appear in the registry.
SSH key: the working key is SSH_RUNPOD_KLEIN (the documented SSH_RUNPOD_FLUX_ZIMAGE does NOT auth against these pods). Get the rotating SSH port per pod:
sops -d enter.pollinations.ai/secrets/prod.vars.json | jq -r .SSH_RUNPOD_KLEIN > /tmp/zk; chmod 600 /tmp/zk
curl -s -X POST "https://api.runpod.io/graphql?api_key=$RUNPOD_TOKEN" -H "Content-Type: application/json" \
-d '{"query":"{pod(input:{podId:\"<POD_ID>\"}){runtime{ports{ip publicPort type}}}}"}' \
| python3 -c "import sys,json; [print(p) for p in json.load(sys.stdin)['data']['pod']['runtime']['ports'] if p['type']=='tcp']"
Diagnose a single pod (SSH in on the discovered tcp port):
ssh -i /tmp/zk -p <PORT> -o StrictHostKeyChecking=no root@<IP> \
"nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader; ps aux | grep 'z-image/server.py' | grep -v grep; ss -ltn | grep 8767"
Code lives at /root/pollinations/image.pollinations.ai/z-image/; service auto-sources env from PID 1 (port 8767, ZIMAGE_MODEL_ID, PLN_GPU_TOKEN, HF_TOKEN). Direct curl /generate returns 403 without the GPU token — a 403 still proves the server is up.
Recovery decision tree:
| Symptom | Fix |
|---|---|
server.py not running, but nvidia-smi works |
Process relaunch (cheap): ssh ... "bash /root/relaunch-zimage.sh" — kills + relaunches via /root/launch.sh, survives SSH disconnect. Verify Heartbeat sent successfully in /root/logs/zimage.log and that the pod reappears in /register. |
nvidia-smi → Failed to initialize NVML: Unknown Error (/dev/nvidia* owned by nobody) |
GPU off the bus — needs pod stop/start (NOT recoverable in-container). See below. |
Worker in /register + heartbeating, but 524 timeouts route only to that one host (and it's slow / inference rate degrades mid-request / huge RSS from long uptime) |
Stale degraded process — relaunch it, but confirm the OLD pid actually died. A heartbeat thread keeps firing even when generation hangs, so the pod looks healthy. See pkill caveat below. |
| All workers up + 100% util + still 524s | Genuine capacity deficit — add a GPU. Note 401/402 are rejected at the edge (~100ms) and do NOT load the GPU, so it's real paid demand, not abuse. |
⚠️
relaunch-zimage.shpkill caveat (2026-06-16): the script kills by patternpkill -f 'z-image/server.py', but the running process command is often justpython server.py(launched from within the z-image dir), so the pattern does not match and the relaunch silently no-ops — the new process then fails withaddress already in useand exits while the old, degraded process keeps serving. Always verify the old PID actually died (ps -o pid,etime -C python— elapsed time should reset to seconds). If it didn't:kill -9 <pid>; fuser -k 8767/tcp; bash /root/relaunch-zimage.sh.
Pod stop/start (for NVML-broken GPU) — WIPES the container overlay disk. Code is on /root (overlay), there is no persistent /workspace volume. So back up the provision script first, then rebuild:
# 1. Back up provision script BEFORE stopping:
ssh -i /tmp/zk -p <PORT> root@<IP> "cat /root/provision-zimage.sh" > /tmp/provision-zimage.sh
# 2. Stop + start (port rotates; GPU re-attaches on fresh container):
runpodctl pod stop <POD_ID>; sleep 10; runpodctl pod start <POD_ID>
# 3. Wait for RUNNING + tcp port, confirm GPU healthy (nvidia-smi shows the card), then re-provision:
scp -i /tmp/zk -P <NEW_PORT> /tmp/provision-zimage.sh root@<IP>:/root/
ssh -i /tmp/zk -p <NEW_PORT> root@<IP> \
"export POD_ID=<POD_ID>; setsid bash /root/provision-zimage.sh > /root/provision.log 2>&1 </dev/null & disown"
# provision: re-clones branch feat/zimage-nf4-model-option, builds venv, downloads model (~9GB) + SPAN
# upscaler, writes launch.sh, starts server. ~3-6 min. Tokens come from PID1 env (auto re-injected).
# 4. Verify: server LISTENING :8767, model loaded, heartbeat sent, pod appears in /register, e2e 200 via prod.
Verify recovery end-to-end (prod success + load now shared):
TOKEN=$(grep ENTER_API_TOKEN_REMOTE enter.pollinations.ai/.testingtokens | cut -d= -f2)
for i in 1 2 3 4 5; do curl -s -o /dev/null -w "HTTP %{http_code}\n" --max-time 60 \
"https://gen.pollinations.ai/image/verify_${i}_$(date +%s%N)?model=zimage&width=512&height=512&seed=$i" \
-H "Authorization: Bearer $TOKEN"; done
# Then confirm the 524 trend dropped in Tinybird (model_health / generation_event, model_requested='zimage').
The registry is Cloudflare KV-backed (image:server:<env>:<type>:<hash>, 240s TTL); workers heartbeat to gen.pollinations.ai/register.
6. Sana Sprint 1.6B Worker (GH200 - same host as LTX-2)
One worker registered as sana type with OVH legacy service via heartbeat.
| Instance | GPU | Host | Port | SSH |
|---|---|---|---|---|
| Lambda GH200 | GH200 (96GB) | 192.222.51.105 |
8766 |
ssh -i <SOPS:SSH_LAMBDA_SANA_LTX2_ACESTEP> ubuntu@192.222.51.105 |
Health check:
curl -s --connect-timeout 5 --max-time 10 http://192.222.51.105:8766/health
Expected: {"status":"healthy","model":"Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers"}
Sana registry check (OVH side):
ssh -i ~/.ssh/id_rsa_ovh -o ConnectTimeout=5 ubuntu@57.130.31.42 "curl -s http://localhost:16384/register"
Expected: 1 worker with 0% error rate
Restart:
ssh -i <SOPS:SSH_LAMBDA_SANA_LTX2_ACESTEP> ubuntu@192.222.51.105 "sudo systemctl restart sana"
Notes:
- GH200 generates at ~0.165s/img
- Runs alongside LTX-2 (port 8765) and ACE-Step (port 8189) on the same host
- Oracle A10/A100 instances decommissioned on 2026-04-12
- Server code:
image.pollinations.ai/sana/server.py(MAX_DIM=768, MAX_PIXELS=512*512) - Systemd service:
sana.service
7. OVH Disk Space
Check:
ssh -i ~/.ssh/id_rsa_ovh -o ConnectTimeout=5 ubuntu@57.130.31.42 "df -h / | tail -1"
Alert if usage > 85%.
Fix (if full):
ssh -i ~/.ssh/id_rsa_ovh ubuntu@57.130.31.42 "sudo truncate -s 0 /var/log/syslog && sudo journalctl --vacuum-size=100M"
Procedure
When invoked, run checks in this order:
- gen.pollinations.ai registry -
curl https://gen.pollinations.ai/register(KV-backed), check worker count and error rates - Flux/Z-Image RunPod - verify 4 workers registered with 0% error rate
- LTX-2 health - curl health endpoint
- LTX-2 e2e - if healthy, test through gen.pollinations.ai (use test token from
.testingtokens) - ACE-Step health - curl health endpoint on port 8189
- Klein health - curl RunPod proxy health endpoint
- Legacy image service - check systemctl status on OVH
- Sana worker - curl health on GH200 port 8766
- Sana registry - check OVH legacy registry for 1 worker with 0% errors
- Disk space - check OVH disk usage
For each:
- If healthy: report OK with latency
- If unhealthy: attempt restart, wait, re-check, report result
Auth
- Test token: Read from
enter.pollinations.ai/.testingtokens(ENTER_API_TOKEN_REMOTE) - SSH keys: Stored in SOPS (
enter.pollinations.ai/secrets/prod.vars.json):SSH_RUNPOD_FLUX_ZIMAGE— RunPod Flux+Z-Image podSSH_LAMBDA_SANA_LTX2_ACESTEP— Lambda GH200 (LTX-2, ACE-Step, Sana)- Klein uses the RunPod relay (
ssh.runpod.io) with~/.ssh/id_ed25519— get the full command from the dashboard "Connect" tab - Extract:
sops -d enter.pollinations.ai/secrets/prod.vars.json | jq -r '.KEY_NAME' > /tmp/key && chmod 600 /tmp/key
- OVH:
~/.ssh/id_rsa_ovh(not in SOPS)
Output
Report a brief status table:
| Service | Status | Latency | Notes |
|---------|--------|---------|-------|
| gen registry | OK | 0.1s | 4 workers, 0% errors |
| Flux RunPod (gpu0) | OK | 2.9s | hsl3ksl31lvrcc-8765 |
| Flux RunPod (gpu1) | OK | 2.9s | hsl3ksl31lvrcc-8766 |
| Z-Image RunPod (gpu2) | OK | 1.5s | hsl3ksl31lvrcc-8767 |
| Z-Image RunPod (gpu3) | OK | 1.5s | hsl3ksl31lvrcc-8768 |
| LTX-2 health | OK | 0.2s | |
| LTX-2 e2e | OK | 11.3s | 682KB |
| ACE-Step | OK | 0.1s | |
| Klein 4B | OK | 0.3s | RunPod |
| Legacy image | OK | - | active |
| Sana (GH200) | OK | 0.2s | 1.6B, ~0.165s/img |
| Sana registry | OK | - | 1 worker, 0% errors |
| OVH disk | OK | - | 45% used |