wb-troubleshooting - SKILL.md Agent Skill

name: wb-troubleshooting description: "General Wiren Board controller diagnostics — failed systemd services, low disk space, kernel/firmware mismatch, Docker, iptables, diagnostic archive (wb-diag-collect), boot issues, web UI inaccessible. Use when user says controller is broken, not working, service down, asks for logs for support, or needs a diagnostic archive. NOT for serial/Modbus (use wb-serial), NOT for network-only issues (use wb-network)." allowed-tools: Bash Read Write WebFetch WebSearch

troubleshooting

CRITICAL RULES

NEVER call wb-cli without --json from an agent. Human-mode output is unparseable; always use: wb-cli --json <command> This applies to every call including help: wb-cli --json <group> --help.

General diagnostics for issues on a Wiren Board controller. Load this when the user says: "doesn't work", "fix it", "broken", "error", "won't start", "service crashed", "issue with...", "collect diagnostics", "diagnostic archive", "logs and state" — and it's NOT about serial/Modbus (for serial there's wb-serial skill).

Don't confuse with backup (wb-controller-backup). The diagnostic archive is for analysis and support, not for restore. Collected by the wb-diag-collect utility and includes: configs from /etc, service logs (wb*, mosquitto, NetworkManager, etc.), output of diagnostic commands (df, ps, ip, dpkg, etc.).

HOST variable: in all examples below <HOST> means wirenboard-<SN>.local, where <SN> is the serial number (e.g. wirenboard-AABBCCDD.local). Substitute the real address.

First steps — always

Before fixing — figure out the cause. Don't fix symptoms.

0. Quick health check

ssh root@<HOST> wb-cli --json audit

This runs automated checks (failed units, controller identity). If wb-cli is not installed, proceed with manual steps below.

0a. Documentation — MANDATORY

Before any fix use WebFetch on the wiki page of the problem component. For example: Docker — WebFetch('https://wiki.wirenboard.com/wiki/Docker'), Modbus — WebFetch('https://wiki.wirenboard.com/wiki/Modbus'), Home Assistant — WebFetch('https://wiki.wirenboard.com/wiki/Home_Assistant'). Look for "Known issues", "Troubleshooting", "Limitations" sections. If a solution is there — apply it, don't invent your own.

1. Kernel mismatch

The most common cause of issues after upgrade. Check first:

ssh root@<HOST> 'echo "running: $(uname -r)"; dpkg -l "linux-image-wb*" 2>/dev/null | grep ^ii | awk "{print \"installed:\", \$3}"'

If versions don't match — the controller is on the old kernel. Kernel modules (br_netfilter, iptable_nat, can, i2c, etc.) won't load, Docker/iptables/network may not work. The only fix is a reboot. Don't try to work around via modprobe/iptables-legacy — useless under kernel mismatch.

2. Disk space

ssh root@<HOST> "df -h / /mnt/data"

use% > 95% or free space < 100 MB (on a typical 2 GB rootfs) is critical: apt doesn't work, logs aren't written, services crash. Look at percent used, not absolute values — / size depends on platform (wb6 — 2 GB, wb7/wb8 — 2 GB, on old builds you can encounter ~700 MB). Cleanup: apt clean; journalctl --vacuum-time=3d; rm -rf /tmp/*.

3. Failed services

ssh root@<HOST> "systemctl --failed --no-pager"

For each failed unit — two queries (together they give the full picture):

ssh root@<HOST> "systemctl status <unit> --no-pager"        # exit code, Result, ExecMainStatus — short summary
ssh root@<HOST> "journalctl -u <unit> -n 50 --no-pager"    # detailed logs with the failure cause

systemctl status for a failed unit itself returns exit code 3 — that's normal (systemctl status code, not an ssh error). When automating, don't confuse it with a real connection error.

4. Error journal

ssh root@<HOST> "journalctl -p err --since '1 hour ago' --no-pager"

Without --since, journalctl returns N latest lines regardless of age — they may be week-old errors. Pick the period by context ('10 minutes ago', 'today', '1 hour ago').

5. Load and memory

ssh root@<HOST> "uptime; free -h"

Load > 4 on WB — overloaded.

ssh root@<HOST> "top -bn1 | head -20"

Shows who's eating CPU.

Typical issues

Symptom	First step
Service won't start after upgrade	Kernel mismatch -> reboot
Docker won't start, iptables errors	First kernel mismatch. If kernel OK — iptables-legacy fix (see below)
modprobe: module not found	Kernel mismatch -> reboot
apt doesn't work, dpkg lock	`fuser /var/lib/dpkg/lock-frontend` — who holds it. Zombie from interrupted apt: `dpkg --configure -a`
Service crashes in a loop	`journalctl -u <unit> -n 100` — look for the cause, don't restart blindly
`fstrim.service` failed, `status=64/USAGE`	An entry in `/etc/fstab` points to a physically absent partition (typically `/mnt/sdcard` without an inserted SD). `fstrim --listed-in /etc/fstab` fails before reaching other mount points. Check `mount` and `ls /dev/mmcblk1*`. Cure: remove the line from fstab or drop-in with `ExecStart=/sbin/fstrim --fstab --quiet-unsupported`
No network	`ip addr`, `nmcli`, `ping 8.8.8.8`, `cat /etc/resolv.conf`
MQTT doesn't work	`systemctl is-active mosquitto`, `wb-cli --json audit`
Web UI doesn't open	`systemctl is-active nginx wb-mqtt-homeui`

Docker and iptables

If Docker won't start with errors like Chain 'MASQUERADE' does not exist, DOCKER-ISOLATION-STAGE, Failed to Setup IP tables — and kernel mismatch is ruled out:

Switch iptables to legacy:

ssh root@<HOST> "update-alternatives --set iptables /usr/sbin/iptables-legacy && update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy"

Create the missing NAT rule:

ssh root@<HOST> "iptables -w10 -t nat -I POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE"

Restart Docker:

ssh root@<HOST> "systemctl restart docker && systemctl is-active docker"

If that didn't help — reboot:

ssh root@<HOST> "reboot"

More: https://wiki.wirenboard.com/wiki/Docker.

Diagnostic archive

Collect ONLY in two cases:

The user explicitly asks "send the diag archive" / "diagnostic archive"
Composing a bug report — the archive is mandatory as an attachment together with issue-specific logs

In all other cases (diagnosis, root cause, fix) — don't create the archive, work with logs directly via SSH.

Collection takes 30-60 seconds, run as a background task:

ssh root@<HOST> 'systemd-run --unit=wb-ai-job-$(cat /dev/urandom | tr -dc a-z0-9 | head -c8) --collect bash -c "wb-diag-collect /tmp/diag"'

wb-diag-collect takes the argument as a prefix and itself appends _SN_DATE.zip — the actual name isn't known in advance.

After completion — find the file and download:

ssh root@<HOST> "ls /tmp/diag*.zip | tail -1"

Then copy:

scp root@<HOST>:<path from ls output> ./

What the agent does NOT do

Fix symptoms before identifying the root cause. "Restarting the service made it work" is not a fix — surface the root cause.
Collect the diagnostic archive unless the user asks or it's a bug report. The archive is heavy; use direct SSH for routine diagnosis.
rm files to free disk space without showing the user what's being deleted. Especially under /mnt/data/, /var/log/.
Restart services blindly in a "try everything" loop. Each restart loses the chance to read the failure state.
Run reboot without the user's explicit OK — the controller may not come back cleanly (FIT in progress, broken filesystem).
Edit configs to "see if it helps". Back up first, then change one thing at a time.
Trust the kernel mismatch warning silently — surface it; firmware/kernel skew is a known cause of obscure failures.

When to ask the user

Root cause is uncertain — propose a hypothesis and ask before testing.
The fix requires a service restart that interrupts production (mqtt-serial, wb-rules, mosquitto) — confirm window.
About to clear logs or rotate them aggressively — confirm.
The diagnostic archive contains MQTT passwords / API tokens in configs — confirm whether to redact before sending to support.
The problem requires a full reboot — confirm timing; the controller is offline for ~60 seconds.

Principle

Diagnose -> read documentation -> explain the cause -> propose a solution -> wait for confirmation. Don't fix blindly.