joelclaw-system-check

star 57

Run a comprehensive health check of the joelclaw system — k8s cluster, worker, Inngest, Redis, Typesense/OTEL, tests, TypeScript, repo sync, memory pipeline, pi-tools, git config, active loops, disk, stale tests. Outputs a 1-10 score with per-component breakdown. Use when: 'system health', 'health check', 'is everything working', 'system status', 'how's the system', 'check everything', or at session start to orient.

joelhooks By joelhooks schedule Updated 6/6/2026

name: joelclaw-system-check displayName: Joelclaw System Check description: "Run comprehensive joelclaw health checks, including postboot/daily checks for NAS, MinIO, Convex, Postgres, Typesense, Inngest, Panda/Flagg authority split, k8s, worker, Redis, OTEL, tests, repo sync, memory pipeline, and disk. Use when: 'system health', 'postboot check', 'daily system check', 'health check', 'is everything working', 'system status', 'how's the system', 'check everything', or at session start to orient." version: 1.1.0 author: Joel Hooks tags: [joelclaw, health, diagnostics, checks, operations]

joelclaw System Health Check

Run scripts/postboot.sh for the read-only postboot/daily topology sweep. It checks the Flagg/Panda authority split, NAS mounts, NFS tuning, custom NAS MinIO, Central Postgres, local Convex, Typesense, Inngest, and shadow Central health without printing secrets. The daily automation also reviews recent Codex infra transcripts so completed work does not stay trapped in chat.

~/Code/joelhooks/joelclaw-runtime/skills/joelclaw-system-check/scripts/postboot.sh

Run scripts/health.sh for the older full system health report with 1-10 score. This is heavier because it includes tests and TypeScript.

~/Code/joelhooks/joelclaw-runtime/skills/joelclaw-system-check/scripts/health.sh

Postboot/Daily Check

Use this after a reboot, after storage/network work, and for the daily automation.

Default command:

POSTBOOT_NAS_BENCHMARK_MIB=1 \
  ~/Code/joelhooks/joelclaw-runtime/skills/joelclaw-system-check/scripts/postboot.sh

Default posture:

  • Read-only except for the tiny NAS verifier write probe under the already-designated s3 proof paths.
  • Does not run the Convex write/read/delete smoke unless POSTBOOT_HEAVY=1.
  • Does not run the legacy full health.sh unless POSTBOOT_RUN_FULL_HEALTH=1.
  • Never prints raw .env, MinIO keys, Convex admin keys, or Postgres URLs with passwords.
  • Treats custom NAS MinIO on 100.67.156.41:39000 as canonical.
  • Treats ASUSTOR MinIO CE on 29990 as a warning/reference surface only.
  • On Flagg, expects JOELCLAW_CENTRAL_URL and direct Typesense helpers to still point at Panda until explicit Central cutover.

Critical checks:

Check What Green Means
authority split env Flagg still points Central capture/search at Panda while Flagg is shadow
Inngest direct health the authoritative Inngest endpoint responds on /health; on Flagg this falls back to http://panda:8288 if INNGEST_URL is unset
Run capture health Panda /api/runs/health returns ok=true with local Machine auth
Typesense configured health configured Typesense endpoint responds on /health
NAS launchd label system/com.joelclaw.central.nas-mounts is loaded with no failing last exit
NAS route 10GbE/MTU route to 192.168.1.163 uses en0 and MTU 8192
NAS mounts status /Volumes/nas-nvme and /Volumes/three-body are mounted from LAN IP exports
NAS verifier write probe service checkout verify-nas.sh passes
NFS tuned options live mounts show rsize=524288,wsize=524288,readahead=128
custom MinIO ready/live custom NAS MinIO responds on 39000
Central Postgres socket, TCP, and readiness checks pass
local Convex backend /version and dashboard respond

Warning/reference checks:

  • custom MinIO console TCP on 39001
  • ASUSTOR MinIO CE readiness on 29990
  • aggregate joelclaw status and joelclaw inngest status CLI wrappers, because those can reveal local env drift even when the authoritative endpoint is healthy
  • Convex LAN and tailnet forwards
  • local Convex LAN forwarder LaunchAgent
  • Flagg shadow Central health
  • gated heavy checks

If postboot.sh fails, fix critical failures first. Do not chase warning/reference failures before the required path is green.

Recent Codex Transcript Review

The daily automation should review recent Codex infra sessions after the health script runs.

Preferred indexed search:

joelclaw sessions search \
  "NAS three-body MTU MinIO Convex Postgres Typesense Inngest Panda Flagg Central postboot" \
  --source typesense \
  --machine all \
  --runtime codex \
  --limit 10 \
  --extract

If indexed Codex results are stale, irrelevant, or missing today's work, use a bounded local raw fallback:

tail -n 80 ~/.codex/session_index.jsonl
find ~/.codex -type f -name "rollout-*.jsonl" -mtime -2

Review only bounded snippets, final answers, commands, and receipts. Do not dump full transcripts or secrets.

Promote durable facts into the right surface:

  • system topology and Panda/Flagg authority -> skills/system-architecture/SKILL.md
  • NAS access, LAN/MTU/NFS behavior -> skills/three-body/SKILL.md
  • custom NAS MinIO behavior -> skills/minio/SKILL.md
  • local Convex exposure and daily-use shape -> skills/local-convex/SKILL.md
  • dated receipts and decisions -> .brain/resources/*.svx

Treat in-progress subagent threads and sessions without a final answer as leads, not architecture truth.

Active Codex automation:

  • ID: daily-joelclaw-postboot-system-check
  • Schedule: daily at 08:15 local time
  • Workspace: /Users/joel/Code/joelhooks/joelclaw-runtime
  • Command intent: run POSTBOOT_NAS_BENCHMARK_MIB=1 skills/joelclaw-system-check/scripts/postboot.sh, summarize failures/warnings, review recent Codex infra transcripts, report uncaptured durable facts, and avoid secret output.

Full Health Check (16 components)

Check What Green (10) Yellow (5-7) Red (1-3)
k8s cluster pods in joelclaw namespace 4/4 Running, 0 restarts partial pods no pods
pds AT Proto PDS on :9627 version + collections pod running, host publish degraded pod not running
worker system-bus on :3111 16+ functions responding, low count down
inngest server :8288 reachable responding down
redis/gateway Redis + gateway session queues connected, low pending queue connected, backlog rising unavailable
typesense/otel Typesense health + OTEL query path healthy + queryable healthy, query degraded unavailable
tests isolated per-file bun test in system-bus 0 fail failures
tsc tsc --noEmit clean type errors
repo sync monorepo HEAD vs origin/main in sync ahead/behind repo unavailable
memory pipeline joelclaw inngest memory-health healthy checks degraded checks failing checks
pi-tools extension deps installed all 3 deps missing
git config user.name + email set set missing
active loops joelclaw loop list queryable query degraded unavailable
gogcli Google Workspace auth account authed, token valid token stored, no password not configured
disk free space + loop tmp <80% used >80%
stale tests __tests__/ + acceptance tests clean present

When to Run

  • Session start — orient on system state before doing work
  • After loops complete — verify nothing broke
  • After infra changes — k8s, worker, Redis config
  • When something feels off — quick triage

Fixing Common Issues

Repo drift: cd ~/Code/joelhooks/joelclaw && git fetch origin && git status -sb

pi-tools broken: cd ~/.pi/agent/git/github.com/joelhooks/pi-tools && bun add @sinclair/typebox @mariozechner/pi-coding-agent @mariozechner/pi-tui @mariozechner/pi-ai

PDS unreachable: curl -fsS http://localhost:9627/xrpc/_health then kubectl get deploy,svc,pods,pvc -n joelclaw | rg 'bluesky-pds|NAME' (or if pod down: kubectl rollout restart deployment/bluesky-pds -n joelclaw)

Worker down: joelclaw inngest restart-worker --register

Stale tests: rm -rf ~/Code/joelhooks/joelclaw/packages/system-bus/__tests__/ && find ~/Code/joelhooks/joelclaw/packages/system-bus/src -name "*.acceptance.test.ts" -delete

System-bus test false reds: the health script runs each src/**/*.test.ts file in its own Bun process because several legacy tests monkey-patch globals or use mock.module. If the aggregate health check is green but raw bun test is red, suspect inter-file mock leakage before treating runtime code as broken.

Loop tmp bloat: rm -rf /tmp/agent-loop/loop-*/ (only when no loops are running)

Inngest Hung-Run Quick Triage

When a run appears stuck after first step:

joelclaw run <run-id>

If trace shows Finalization failure with "Unable to reach SDK URL":

  1. Verify registration/health: joelclaw inngest status

  2. Verify function is present where expected: joelclaw functions | rg -i "manifest-archive|<function-name>"

  3. Check for stale app registrations in Inngest UI/API and remove stale SDK URLs.

  4. Assume possible handler blocking (not just network): review recent step code for filesystem/Redis/subprocess blocking before step response.

Install via CLI
npx skills add https://github.com/joelhooks/joelclaw --skill joelclaw-system-check
Repository Details
star Stars 57
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator