talon

star 57

Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.

joelhooks By joelhooks schedule Updated 6/6/2026

name: talon description: Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.

Talon — Infrastructure Watchdog Daemon

Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.

Quick Reference

talon validate         # Parse/validate config + services files, print summary JSON
talon --check          # Single probe cycle, print results, exit
talon --status         # Current state machine position
talon --dry-run        # Print loaded config, exit
talon --worker-only    # Supervisor only, no infra probes
talon                  # Full daemon mode (worker + probes + escalation)

Paths

What Where
Binary ~/.local/bin/talon
Source ~/Code/joelhooks/joelclaw/infra/talon/src/
Config ~/.config/talon/config.toml
Service monitors ~/.joelclaw/talon/services.toml
Default config ~/Code/joelhooks/joelclaw/infra/talon/config.default.toml
Default services template ~/Code/joelhooks/joelclaw/infra/talon/services.default.toml
Voice stale cleanup ~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
State ~/.local/state/talon/state.json
Probe results ~/.local/state/talon/last-probe.json
Log ~/.local/state/talon/talon.log (JSON lines, 10MB rotation)
Launchd plist ~/Code/joelhooks/joelclaw/infra/launchd/com.joel.talon.plist
RBAC guard manifest ~/Code/joelhooks/joelclaw/k8s/apiserver-kubelet-client-rbac.yaml
Worker stdout ~/.local/log/system-bus-worker.log
Worker stderr ~/.local/log/system-bus-worker.err
Talon launchd log ~/.local/log/talon.err

Build

export PATH="$HOME/.cargo/bin:$PATH"
cd ~/Code/joelhooks/joelclaw/infra/talon
cargo build --release
cp target/release/talon ~/.local/bin/talon

Architecture

talon (single binary)
├── Worker Supervisor Thread (only when external launchd supervisor is not loaded)
│   ├── Kill orphan on port 3111
│   ├── Spawn bun (child process)
│   ├── Signal forwarding (SIGTERM → bun)
│   ├── Health poll every 30s
│   ├── PUT sync after healthy startup
│   └── Crash recovery: exponential backoff 1s→30s
│
├── Infrastructure Probe Loop (main thread, 60s)
│   ├── Colima VM alive?
│   ├── Docker socket responding?
│   ├── Talos container running?
│   ├── k8s API reachable?
│   ├── Node Ready + schedulable?
│   ├── Flannel daemonset ready?
│   ├── Redis PONG?
│   ├── Inngest /health 200?
│   ├── Typesense /health ok?
│   └── Worker /api/inngest 200?
│
└── Escalation (on failure)
    ├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain)
    ├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`)
    ├── Tier 2: pi agent (approved cloud model, 10min cooldown, bounded by `agent.timeout_secs`; subprocess output uses temp files and timeout kills the whole process group so a stuck pi child cannot freeze Talon's health loop)
    ├── Tier 3: pi agent (approved secondary model fallback, same process-group timeout guard)
    └── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold, 4h repeat cooldown)

State Machine

healthy → degraded (1 critical probe failure)
degraded → failed (3 consecutive failures)
failed → investigating (agent spawned)
investigating → healthy (probes pass again)
investigating → critical (agent failed to fix)
critical → sos (SOS sent via Telegram + iMessage)
any → healthy (all probes pass)

Probes

Probe Command Critical?
colima colima status Yes
docker docker ps (Colima socket) Yes
talos_container docker inspect joelclaw-controlplane-1 Yes
k8s_api kubectl get nodes Yes
node_ready kubectl jsonpath for Ready condition Yes
node_schedulable kubectl jsonpath for spec (cordon + non-control-plane NoSchedule taints; allows the normal single-node control-plane taint) Yes
flannel kubectl -n kube-system get daemonset kube-flannel -o jsonpath=... No
redis kubectl exec redis-0 -- redis-cli ping Yes
kubelet_proxy_rbac kubectl auth can-i --as=<apiserver-kubelet-client*> {get,create} nodes --subresource=proxy Yes
vm:docker ssh -F ~/.colima/_lima/colima/ssh.config lima-colima docker ps No
vm:k8s_api ssh ... python socket probe :6443 No
vm:redis ssh ... python socket probe :6379 No
vm:inngest ssh ... python socket probe :8288 No
vm:typesense ssh ... python socket probe :8108 No
inngest curl localhost:8288/health No
typesense curl localhost:8108/health No
worker curl localhost:3111/api/inngest No

Built-in critical probes use probes.critical_after_consecutive_failures (default 2) before escalation, so one transient probe miss does not launch heal/agent/SOS theatre. Dynamic critical probes use their own critical_after_consecutive_failures values. Non-critical probes need the global consecutive failure threshold.

VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.

Dynamic service probes

Add probes in ~/.joelclaw/talon/services.toml without rebuilding talon:

[launchd.gateway]
label = "com.joel.gateway"
critical = true
timeout_secs = 5

[http.gateway_slack]
url = "http://127.0.0.1:3018/health/slack"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5

[launchd.voice_agent]
label = "com.joel.voice-agent"
critical = false
timeout_secs = 5

[script.gateway_telegram_409]
command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5

[script.colima_orphan_usernet]
command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2"
critical = true
critical_after_consecutive_failures = 2
timeout_secs = 5

[script.k8s_disk_pressure]
command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure"
critical = true
critical_after_consecutive_failures = 1
timeout_secs = 10
  • launchd.<name> passes when launchctl list <label> reports a non-zero PID, or when launchctl print system/<label> / launchctl print gui/$(id -u)/<label> reports state = running. This matters because com.joel.gateway is a system LaunchDaemon while Talon itself is a user LaunchAgent.
  • http.<name> passes on HTTP 200
  • script.<name> passes on exit code 0, fails on non-zero (runs via sh -c)
  • critical = true escalates when the probe is marked critical (or after debounce if configured)
  • critical_after_consecutive_failures = N debounces critical alerts for dynamic probes (default 1 = immediate)
  • http.gateway_slack uses gateway endpoint GET /health/slack, fails (503) when Slack channel is not started, and should be debounced (recommended 3 cycles)
  • Do not probe http://127.0.0.1:8081/ for voice_agent by default — root returns 503 when idle and causes false SOS noise
  • Service-heal pre-cleanup for voice_agent now clears stale uv/main.py listeners on :8081 before launchctl kickstart to avoid bind conflicts after force-cycles
  • Talon hot-reloads service probes when services.toml mtime changes (no restart required)
  • kill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}') forces immediate reload

Recent dynamic probes added for the 2026-03-17 Colima/Restate incident:

  • script.redis_aof_health — critical after 3 failures; checks aof_last_bgrewrite_status:ok to catch Redis AOF rewrite/persistence corruption.
  • script.colima_vm_uptime — critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.
  • script.restate_worker_ready — critical after 3 failures; verifies the restate-worker pod reports Ready=true before workloads are trusted.
  • script.kvm_device_present — non-critical witness probe; records whether /dev/kvm is present inside Colima for nested-virt / Firecracker diagnosis.

Health endpoint

  • GET http://127.0.0.1:9999/health returns Talon state JSON
  • Gateway heartbeat consumes this as an additional watchdog signal
  • Configure via [health] in ~/.config/talon/config.toml

SOS channel config

  • Tier 4 sends to both Telegram and iMessage
  • Repeated SOS for the same persistent outage is throttled by sos_cooldown_secs (default 4h) to avoid paging spam after the first actionable alert.
  • Telegram fields in [escalation]:
    • sos_telegram_chat_id
    • sos_telegram_secret_name (defaults to telegram_bot_token)
  • Talon now leases Telegram tokens via secrets lease <name> --ttl ... (no --raw). If you still see curl: (3) URL rejected: Malformed input to a URL function, redeploy the latest Talon binary.
  • Agent fallback commands must stay on approved models only: openai-codex/gpt-5.5 primary and anthropic/claude-opus-4.7 secondary. Do not restore Ollama/Azure/provider drift in Talon config.
  • iMessage recipient remains sos_recipient

Launchd Management

Talon is active as com.joel.talon:

launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="

Reload binary/config after deploy:

launchctl kickstart -k gui/$(id -u)/com.joel.talon

Single owner for worker supervision is mandatory:

  • If com.joel.system-bus-worker is loaded, Talon auto-disables its internal worker supervisor to prevent port-3111 thrash.
  • com.joel.system-bus-worker is a system LaunchDaemon, so verify it with launchctl print system/...; launchctl list <label> only checks the current user bootstrap domain and can lie by omission.
  • Preferred end-state is Talon-only supervision, but coexistence must not cause kill/restart loops.
launchctl print system/com.joel.system-bus-worker | rg "state =|pid =|program =|last exit code ="

Legacy services should stay disabled when fully cut over:

launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist

Troubleshooting

# Validate config + service monitor files
talon validate | python3 -m json.tool

# Check what talon sees right now
talon --check | python3 -m json.tool

# Check state machine
talon --status | python3 -m json.tool

# Broken-pipe robustness smoke test (should exit 0)
talon --check | head -n 1 >/dev/null

# Check health endpoint payload
curl -sS http://127.0.0.1:9999/health | python3 -m json.tool

# Check talon's own logs
tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool

# Check launchd
launchctl list | grep talon
tail -50 ~/.local/log/talon.err

# Manual probe test
DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1
kubectl exec -n joelclaw redis-0 -- redis-cli ping
kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces
kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health'

# Force bridge repair (same behavior Talon uses for split-brain)
colima stop --force && colima start

# Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs)
~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh

Colima Stability Monitoring (2026-03-17)

Talon now monitors failure modes discovered during the Firecracker development incident:

Probe What it detects Critical?
script:redis_aof_health Corrupted Redis AOF from VM crash mid-write Yes (after 3)
script:colima_vm_uptime VM crash-loop (uptime < 120s = just restarted) Yes (after 2)
script:restate_worker_ready Restate worker pod not 1/1 Ready Yes (after 3)
script:kvm_available Whether /dev/kvm exists (nested virt status) No (informational)

Known failure chain: nestedVirtualization → cascade

nestedVirtualization ON + heavy Docker build
  → Colima VZ VM crash (silent, no crash report)
    → Docker daemon restart → Talos container killed
      → Redis mid-write → AOF corruption → crash-loop
      → Restate mid-journal → stale invocations → infinite retries
      → Lima socket forwarding broken → docker CLI dead on macOS

Talon detects each stage:

  1. colima_vm_uptime < 120s → VM just crashed
  2. redis probe fails → Redis down
  3. redis_aof_health fails → AOF corrupted (needs manual fix)
  4. restate_worker_ready fails → worker can't start (may be /dev/kvm mount or image pull)

Talon cannot auto-fix Redis AOF corruption (requires redis-check-aof --fix). It WILL escalate to the pi agent (Tier 2) which should load the k8s skill's Redis AOF Recovery procedure.

Key Design Decisions

  • Zero external deps — no tokio, no serde, no reqwest. Pure std. Keeps binary at ~444KB.
  • Compiles its own PATH — immune to launchd environment brittleness (the class of bug that caused the 6-day outage).
  • Worker is a child process — not a separate launchd service. Signal forwarding prevents orphans.
  • TOML config parsed by hand — same pattern as worker-supervisor. No dependency just for config.
  • Probes use Colima docker socket for critical host checks and add VM witness probes over Colima SSH for split-brain detection.

Related

  • ADR-0159: Talon proposal
  • ADR-0158: Worker supervisor (superseded by talon)
  • infra/k8s-reboot-heal.sh: Tier 1 heal script
  • infra/worker-supervisor/: Original standalone worker supervisor (superseded)
  • anthropic/claude-opus-4.7: Tier 3 approved secondary fallback model
Install via CLI
npx skills add https://github.com/joelhooks/joelclaw --skill talon
Repository Details
star Stars 57
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator