name: talon description: Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
Talon — Infrastructure Watchdog Daemon
Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.
Quick Reference
talon validate # Parse/validate config + services files, print summary JSON
talon --check # Single probe cycle, print results, exit
talon --status # Current state machine position
talon --dry-run # Print loaded config, exit
talon --worker-only # Supervisor only, no infra probes
talon # Full daemon mode (worker + probes + escalation)
Paths
| What | Where |
|---|---|
| Binary | ~/.local/bin/talon |
| Source | ~/Code/joelhooks/joelclaw/infra/talon/src/ |
| Config | ~/.config/talon/config.toml |
| Service monitors | ~/.joelclaw/talon/services.toml |
| Default config | ~/Code/joelhooks/joelclaw/infra/talon/config.default.toml |
| Default services template | ~/Code/joelhooks/joelclaw/infra/talon/services.default.toml |
| Voice stale cleanup | ~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh |
| State | ~/.local/state/talon/state.json |
| Probe results | ~/.local/state/talon/last-probe.json |
| Log | ~/.local/state/talon/talon.log (JSON lines, 10MB rotation) |
| Launchd plist | ~/Code/joelhooks/joelclaw/infra/launchd/com.joel.talon.plist |
| RBAC guard manifest | ~/Code/joelhooks/joelclaw/k8s/apiserver-kubelet-client-rbac.yaml |
| Worker stdout | ~/.local/log/system-bus-worker.log |
| Worker stderr | ~/.local/log/system-bus-worker.err |
| Talon launchd log | ~/.local/log/talon.err |
Build
export PATH="$HOME/.cargo/bin:$PATH"
cd ~/Code/joelhooks/joelclaw/infra/talon
cargo build --release
cp target/release/talon ~/.local/bin/talon
Architecture
talon (single binary)
├── Worker Supervisor Thread (only when external launchd supervisor is not loaded)
│ ├── Kill orphan on port 3111
│ ├── Spawn bun (child process)
│ ├── Signal forwarding (SIGTERM → bun)
│ ├── Health poll every 30s
│ ├── PUT sync after healthy startup
│ └── Crash recovery: exponential backoff 1s→30s
│
├── Infrastructure Probe Loop (main thread, 60s)
│ ├── Colima VM alive?
│ ├── Docker socket responding?
│ ├── Talos container running?
│ ├── k8s API reachable?
│ ├── Node Ready + schedulable?
│ ├── Flannel daemonset ready?
│ ├── Redis PONG?
│ ├── Inngest /health 200?
│ ├── Typesense /health ok?
│ └── Worker /api/inngest 200?
│
└── Escalation (on failure)
├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain)
├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`)
├── Tier 2: pi agent (approved cloud model, 10min cooldown, bounded by `agent.timeout_secs`; subprocess output uses temp files and timeout kills the whole process group so a stuck pi child cannot freeze Talon's health loop)
├── Tier 3: pi agent (approved secondary model fallback, same process-group timeout guard)
└── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold, 4h repeat cooldown)
State Machine
healthy → degraded (1 critical probe failure)
degraded → failed (3 consecutive failures)
failed → investigating (agent spawned)
investigating → healthy (probes pass again)
investigating → critical (agent failed to fix)
critical → sos (SOS sent via Telegram + iMessage)
any → healthy (all probes pass)
Probes
| Probe | Command | Critical? |
|---|---|---|
| colima | colima status |
Yes |
| docker | docker ps (Colima socket) |
Yes |
| talos_container | docker inspect joelclaw-controlplane-1 |
Yes |
| k8s_api | kubectl get nodes |
Yes |
| node_ready | kubectl jsonpath for Ready condition | Yes |
| node_schedulable | kubectl jsonpath for spec (cordon + non-control-plane NoSchedule taints; allows the normal single-node control-plane taint) |
Yes |
| flannel | kubectl -n kube-system get daemonset kube-flannel -o jsonpath=... |
No |
| redis | kubectl exec redis-0 -- redis-cli ping |
Yes |
| kubelet_proxy_rbac | kubectl auth can-i --as=<apiserver-kubelet-client*> {get,create} nodes --subresource=proxy |
Yes |
| vm:docker | ssh -F ~/.colima/_lima/colima/ssh.config lima-colima docker ps |
No |
| vm:k8s_api | ssh ... python socket probe :6443 |
No |
| vm:redis | ssh ... python socket probe :6379 |
No |
| vm:inngest | ssh ... python socket probe :8288 |
No |
| vm:typesense | ssh ... python socket probe :8108 |
No |
| inngest | curl localhost:8288/health |
No |
| typesense | curl localhost:8108/health |
No |
| worker | curl localhost:3111/api/inngest |
No |
Built-in critical probes use probes.critical_after_consecutive_failures (default 2) before escalation, so one transient probe miss does not launch heal/agent/SOS theatre. Dynamic critical probes use their own critical_after_consecutive_failures values. Non-critical probes need the global consecutive failure threshold.
VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.
Dynamic service probes
Add probes in ~/.joelclaw/talon/services.toml without rebuilding talon:
[launchd.gateway]
label = "com.joel.gateway"
critical = true
timeout_secs = 5
[http.gateway_slack]
url = "http://127.0.0.1:3018/health/slack"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[launchd.voice_agent]
label = "com.joel.voice-agent"
critical = false
timeout_secs = 5
[script.gateway_telegram_409]
command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[script.colima_orphan_usernet]
command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2"
critical = true
critical_after_consecutive_failures = 2
timeout_secs = 5
[script.k8s_disk_pressure]
command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure"
critical = true
critical_after_consecutive_failures = 1
timeout_secs = 10
launchd.<name>passes whenlaunchctl list <label>reports a non-zero PID, or whenlaunchctl print system/<label>/launchctl print gui/$(id -u)/<label>reportsstate = running. This matters becausecom.joel.gatewayis a system LaunchDaemon while Talon itself is a user LaunchAgent.http.<name>passes on HTTP200script.<name>passes on exit code 0, fails on non-zero (runs viash -c)critical = trueescalates when the probe is marked critical (or after debounce if configured)critical_after_consecutive_failures = Ndebounces critical alerts for dynamic probes (default1= immediate)http.gateway_slackuses gateway endpointGET /health/slack, fails (503) when Slack channel is not started, and should be debounced (recommended3cycles)- Do not probe
http://127.0.0.1:8081/forvoice_agentby default — root returns503when idle and causes false SOS noise - Service-heal pre-cleanup for
voice_agentnow clears staleuv/main.pylisteners on:8081beforelaunchctl kickstartto avoid bind conflicts after force-cycles - Talon hot-reloads service probes when
services.tomlmtime changes (no restart required) kill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}')forces immediate reload
Recent dynamic probes added for the 2026-03-17 Colima/Restate incident:
script.redis_aof_health— critical after 3 failures; checksaof_last_bgrewrite_status:okto catch Redis AOF rewrite/persistence corruption.script.colima_vm_uptime— critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.script.restate_worker_ready— critical after 3 failures; verifies therestate-workerpod reportsReady=truebefore workloads are trusted.script.kvm_device_present— non-critical witness probe; records whether/dev/kvmis present inside Colima for nested-virt / Firecracker diagnosis.
Health endpoint
GET http://127.0.0.1:9999/healthreturns Talon state JSON- Gateway heartbeat consumes this as an additional watchdog signal
- Configure via
[health]in~/.config/talon/config.toml
SOS channel config
- Tier 4 sends to both Telegram and iMessage
- Repeated SOS for the same persistent outage is throttled by
sos_cooldown_secs(default 4h) to avoid paging spam after the first actionable alert. - Telegram fields in
[escalation]:sos_telegram_chat_idsos_telegram_secret_name(defaults totelegram_bot_token)
- Talon now leases Telegram tokens via
secrets lease <name> --ttl ...(no--raw). If you still seecurl: (3) URL rejected: Malformed input to a URL function, redeploy the latest Talon binary. - Agent fallback commands must stay on approved models only:
openai-codex/gpt-5.5primary andanthropic/claude-opus-4.7secondary. Do not restore Ollama/Azure/provider drift in Talon config. - iMessage recipient remains
sos_recipient
Launchd Management
Talon is active as com.joel.talon:
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="
Reload binary/config after deploy:
launchctl kickstart -k gui/$(id -u)/com.joel.talon
Single owner for worker supervision is mandatory:
- If
com.joel.system-bus-workeris loaded, Talon auto-disables its internal worker supervisor to prevent port-3111 thrash. com.joel.system-bus-workeris a system LaunchDaemon, so verify it withlaunchctl print system/...;launchctl list <label>only checks the current user bootstrap domain and can lie by omission.- Preferred end-state is Talon-only supervision, but coexistence must not cause kill/restart loops.
launchctl print system/com.joel.system-bus-worker | rg "state =|pid =|program =|last exit code ="
Legacy services should stay disabled when fully cut over:
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist
Troubleshooting
# Validate config + service monitor files
talon validate | python3 -m json.tool
# Check what talon sees right now
talon --check | python3 -m json.tool
# Check state machine
talon --status | python3 -m json.tool
# Broken-pipe robustness smoke test (should exit 0)
talon --check | head -n 1 >/dev/null
# Check health endpoint payload
curl -sS http://127.0.0.1:9999/health | python3 -m json.tool
# Check talon's own logs
tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool
# Check launchd
launchctl list | grep talon
tail -50 ~/.local/log/talon.err
# Manual probe test
DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1
kubectl exec -n joelclaw redis-0 -- redis-cli ping
kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces
kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health'
# Force bridge repair (same behavior Talon uses for split-brain)
colima stop --force && colima start
# Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs)
~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
Colima Stability Monitoring (2026-03-17)
Talon now monitors failure modes discovered during the Firecracker development incident:
| Probe | What it detects | Critical? |
|---|---|---|
script:redis_aof_health |
Corrupted Redis AOF from VM crash mid-write | Yes (after 3) |
script:colima_vm_uptime |
VM crash-loop (uptime < 120s = just restarted) | Yes (after 2) |
script:restate_worker_ready |
Restate worker pod not 1/1 Ready | Yes (after 3) |
script:kvm_available |
Whether /dev/kvm exists (nested virt status) | No (informational) |
Known failure chain: nestedVirtualization → cascade
nestedVirtualization ON + heavy Docker build
→ Colima VZ VM crash (silent, no crash report)
→ Docker daemon restart → Talos container killed
→ Redis mid-write → AOF corruption → crash-loop
→ Restate mid-journal → stale invocations → infinite retries
→ Lima socket forwarding broken → docker CLI dead on macOS
Talon detects each stage:
colima_vm_uptime< 120s → VM just crashedredisprobe fails → Redis downredis_aof_healthfails → AOF corrupted (needs manual fix)restate_worker_readyfails → worker can't start (may be /dev/kvm mount or image pull)
Talon cannot auto-fix Redis AOF corruption (requires redis-check-aof --fix). It WILL escalate to the pi agent (Tier 2) which should load the k8s skill's Redis AOF Recovery procedure.
Key Design Decisions
- Zero external deps — no tokio, no serde, no reqwest. Pure std. Keeps binary at ~444KB.
- Compiles its own PATH — immune to launchd environment brittleness (the class of bug that caused the 6-day outage).
- Worker is a child process — not a separate launchd service. Signal forwarding prevents orphans.
- TOML config parsed by hand — same pattern as worker-supervisor. No dependency just for config.
- Probes use Colima docker socket for critical host checks and add VM witness probes over Colima SSH for split-brain detection.
Related
- ADR-0159: Talon proposal
- ADR-0158: Worker supervisor (superseded by talon)
infra/k8s-reboot-heal.sh: Tier 1 heal scriptinfra/worker-supervisor/: Original standalone worker supervisor (superseded)anthropic/claude-opus-4.7: Tier 3 approved secondary fallback model