name: kaitu-node-ops description: Unified runbook for Kaitu VPN node operations — operate existing k2s nodes (status/logs/restart/update, user auth, batch fleet ops), provision NEW private/shared nodes from scratch, and build/configure/observe the node-authority traffic metering + quota cutoff sidecar (incl. AWS Lightsail billing rules). Canonical deployment is the single-compose k2s + k2-sidecar stack. Covers safety guardrails, architecture ID, env vars, verification, troubleshooting, and device-log triage. triggers: - node ops - server ops - node management - docker ops - k2 node - exec on node - list nodes - node health - node restart - node update - node logs - docker compose - k2s - k2-slave - provision node - provision private node - provisioning intent - claim provisioning - provision job - private node - 专属节点 - dedicated node - K2_PRIVATE_CLAIM - metering ops - traffic metering - sidecar deploy - sidecar image - build sidecar - traffic limit - traffic quota - traffic cutoff - quota cutoff - set-usage - node usage - NodeUsage - usage reporter - K2_NODE_TRAFFIC_LIMIT_GB - K2_NODE_BILLING_START_DATE - Lightsail billing
Kaitu Node Operations (unified)
One skill for everything you do to a Kaitu VPN node, via the kaitu-center MCP (list_nodes, exec_on_node, ping_node, delete_node, the cloud*/node_operation tools) + SSH.
Canonical deployment = the single-compose k2s + k2-sidecar stack (docker/docker-compose.yml, dir /apps/k2s/). Everything below assumes it. The old k2-slave SNI router is legacy — see §1.
One stack, three jobs — the unifying fact: shared-pool nodes and private (dedicated) nodes run the exact same compose + images. They differ only by a few .env variables. Provisioning a private node = the standard node deploy + a Center work-item wrapper. Metering + cutoff run on every node that has a billing date set. So this is one skill, not three.
Quick index — what do you want to do?
| Goal | Where |
|---|---|
| Safety rules (read once) | §0 Guardrails |
| Identify a node's architecture (k2s vs legacy) | §1 |
Look up / set an .env variable |
§2 Env master reference |
| Status / logs / restart / update one node | §3 Standard ops |
| Manage k2s users (auth file) | §3.1 |
| Deploy or update a single node | §4 |
| Roll compose / images / auto-update across the fleet | §5 Batch ops |
| Verify a node after deploy | §6 Verification |
| Fix a broken node (iptables-nft, restart loop) | §7 Troubleshooting |
| AWS Lightsail specifics (reboot, billing) | §8 + references/metering.md |
| Provision a NEW private node from a Center work item | → references/provisioning.md |
| Build / push the sidecar image | → references/metering.md Part A |
| Per-provider quota knobs (reset day / bundle / already-used) | → references/metering.md Part B |
| Configure quota / billing / mid-cycle seed | → references/metering.md Part C |
| AWS Lightsail billing rules (max(in,out), calendar month, proration) | → references/metering.md |
| Observe metering / operate the cutoff | → references/metering.md Parts D–E |
| Triage a user's device logs (DIAG analysis) | → references/device-logs.md |
The three references/*.md files are pulled in with the Read tool only when the task needs them — keep the hub lean.
§0 Safety Guardrails
Full SSH root access — these rules prevent accidental damage:
K2_NODE_SECRET/K2_PRIVATE_CLAIM/ claim tokens are untouchable — never read, display, modify, transmit, or echo them. Write secrets only via heredoc. Redact when printing.env:sed -E "s/(SECRET|CLAIM|TOKEN)=.*/\1=<redacted>/". MCP stdout redaction is a backstop, not a license.- Never touch another node — only operate the node you were asked about. Never read/modify a different node's secret or config.
- Update =
pull+up -d, neverdown.up -drecreates only changed containers;downremoves them and interrupts service. Use--remove-orphansto clear stalek2v5/k2v4-slave/k2-occontainers after a rename. docker restartdoes NOT re-read.env— it reuses old env. To apply an.envchange you mustdocker compose up -d(orrestartvia compose). See §4.- Config changes go through
.envonly — don't hand-editdocker-compose.ymlor/etc/kaitu/(auto-generated by the sidecar; overwritten on restart). Port mapping (40000-40019/udp) is Docker-managed — no manual iptables DNAT. - Never delete
/apps/k2s/— it is the entire node deployment. - Confirm before restart —
docker compose psfirst; understand what's affected. K2_DOMAINstays empty — the sidecar auto-derives a unique{ipv4-with-dashes}.sslip.io(see §2). No manual assignment, no collision risk.- Pinned image tags only; never move
:latestoutside a deliberate fleet rollout (it's what unpinned nodes auto-pull → touches every node).
§1 Identify Node Architecture
Run first on every node:
exec_on_node(ip, "docker ps --format '{{.Names}}'")
- Output has
k2s→ canonical (k2s tunnel container). Everything in this skill applies directly. - Output has
k2-slave→ legacy SNI router (no ECH, host network). All commands are identical — just substitute the tunnel container namek2-slavefork2s. Migrate it to k2s when convenient (push current compose +up -d --remove-orphans).
Canonical stack (dir /apps/k2s/, images from public.ecr.aws/d6n9t2r2/):
k2-sidecar (bridge net, k2-internal)
└── healthy ──→ k2s (bridge net; Docker port map :443 TCP+UDP + 40000-40019 UDP → container 443)
| Container | Role | Network | Image |
|---|---|---|---|
k2-sidecar |
Registration, config-gen, health report, traffic metering + cutoff | bridge | k2-sidecar:${K2_VERSION} |
k2s |
Tunnel data plane. ECH front door on 443 → in-process QUIC + TCP-WS | bridge | k2s:${K2_VERSION} |
- Sidecar writes
/etc/kaitu/.readywhen config-gen completes; k2s waits on the sidecar healthcheck. - No iptables management, no NET_ADMIN, no wrapper scripts.
k2-oc(OpenConnect) was retired 2026-04-30 — if you see it, push compose +up -d --remove-orphans.
§2 Env Master Reference (/apps/k2s/.env)
One table for every .env var across ops / provisioning / metering. "Set by": who writes it.
| Variable | Purpose | Set by | Notes |
|---|---|---|---|
K2_NODE_SECRET |
Node auth key + Center Basic-auth (ipv4:secret) for usage reports |
provision (openssl rand -hex 32) |
SECRET — never read/log. TOFU-recorded by Center at first registration. |
K2_PRIVATE_CLAIM |
One-time claim token → Center sets Class=private, binds owner, activates owner's sub |
provision (from claim) | SECRET — one-time, never log. Identity/activation only — NOT metering. Empty on shared nodes. |
K2_CENTER_URL |
Center API base (registration + /slave/usage) |
provision | dev/test = https://k2.52j.me. |
K2_DOMAIN |
Tunnel domain | leave empty | Empty → sidecar auto-derives {ipv4-dashes}.sslip.io (unique by IP). Set only for a custom domain (you own its DNS). |
K2_VERSION |
Image tag for both images | ops | Pin a known-good tag, never :latest outside a fleet rollout. If k2s/sidecar need different tags, pin k2s's image: line directly in compose and let K2_VERSION drive the sidecar. |
K2_NODE_NAME |
Human name, {region}.{provider}.wm{NN} (private: pn-<subId>) |
provision | registration meta. |
K2_NODE_REGION |
Region id, e.g. jp-tokyo.aws |
provision | registration meta. |
K2_NODE_ARCH |
Protocol arch tag (default k2v5) |
ops | registration meta. |
K2_IP_TYPE |
residential / non_residential / unknown |
provision | Sidecar reports → Center SlaveNode.ip_type (drives 住宅IP visibility). Last-writer-wins with ops update_node. |
K2_NODE_BILLING_START_DATE |
Monthly cycle anchor, yyyy-MM-dd (day-of-month extracted) |
provision/ops | REQUIRED to meter. Empty → metering OFF, node runs uncapped. Day must match the provider's reset day (Lightsail=01; 搬瓦工=KiwiVM "Next reset" day — don't assume 01) — derive per provider in references/metering.md Part B. |
K2_NODE_TRAFFIC_LIMIT_GB |
Monthly quota (GiB). Node pauses k2s at used ≥ limit − 500 MiB |
provision/ops | 0 = unlimited (safe fallback). |
K2_NODE_TRAFFIC_USED_GB |
Mid-cycle onboarding seed (GiB already used) | provision | Applied once on first boot (no state). Prefer set-usage later — see references/metering.md Part C. |
K2_CUTOFF_POLL_INTERVAL |
Enforcer poll period | ops | default 5s. |
K2_JUMP_PORT_MIN/MAX |
Hop port range (default 40000/40019) | ops | Docker port map → container 443; 20 ports, high range to dodge GFW scan. |
K2_LOG_LEVEL |
debug/info/warn/error |
ops |
Metering is decoupled from the private claim. Any node — shared or private — meters and self-cuts iff
K2_NODE_BILLING_START_DATEis set.K2_PRIVATE_CLAIMonly controls identity/activation. The reporter cadence is owned by Center (next_report_interval, 10s floor) — don't tune it from the node.
§3 Standard Operations
exec_on_node(ip, command). For legacy nodes substitute k2-slave for k2s.
| Operation | Command |
|---|---|
| All container status | cd /apps/k2s && docker compose ps |
| Container logs (tail) | docker logs --tail 100 <container> |
| Pull + restart all | cd /apps/k2s && docker compose pull && docker compose up -d |
| Restart one container | docker restart <container> (⚠ no .env reload — §4) |
View .env (redacted) |
sed -E "s/(SECRET|CLAIM|TOKEN)=.*/\1=<redacted>/" /apps/k2s/.env |
| Sidecar health | docker inspect --format='{{.State.Health.Status}}' k2-sidecar |
| Disk / mem / CPU | df -h && free -h && top -bn1 | head -5 |
| Network conns | ss -s |
| IPv6 status | ip -6 addr show scope global |
| Hop port mapping | docker port k2s (expect 22 maps: 443/tcp + 443/udp + 40000-40019/udp) |
| BBR status | sysctl net.ipv4.tcp_congestion_control (expect bbr) |
| Container outbound net | docker exec k2-sidecar wget -qO- --timeout=5 https://api.ipify.org |
| Auto-update log | tail -50 /apps/k2s/auto-update.log |
| Timezone | timedatectl | grep 'Time zone' (must be Asia/Singapore) |
§3.1 k2s user auth
Two modes, first match wins: (1) /apps/k2s/users (bind-mounted), (2) Center remote /slave/device-check-auth (fallback). Empty file = pure remote auth = default. No restart needed (changes apply on next full auth; 1h tickets stay valid). Format = one udid:token per line.
| Operation | Command |
|---|---|
| View | cat /apps/k2s/users |
| Add | echo "udid:token" >> /apps/k2s/users |
| Remove | sed -i "/^UDID:/d" /apps/k2s/users |
| Clear (remote only) | truncate -s 0 /apps/k2s/users |
§4 Deploy / Update a Single Node
# Push the canonical compose (idempotent)
exec_on_node(ip, "sudo tee /apps/k2s/docker-compose.yml > /dev/null", { scriptPath: "docker/docker-compose.yml" })
# Edit .env (heredoc for any secret; never on the command line)
exec_on_node(ip, "sudo tee /apps/k2s/.env > /dev/null <<'ENVEOF'\n<contents>\nENVEOF")
# Apply: pull + up -d (NEVER down)
exec_on_node(ip, "cd /apps/k2s && sudo docker compose pull && sudo docker compose up -d --remove-orphans")
docker compose up -dis the only way to apply.envchanges (docker restartreuses old env).- If sidecar & k2s need different tags:
sudo sed -i "s#k2s:\${K2_VERSION:-latest}#k2s:v0.4.6-<sha>#" /apps/k2s/docker-compose.yml. - Brand-new node (from nothing)? That's provisioning →
references/provisioning.md.
Script execution (mandatory form)
For scripts in docker/scripts/, pipe via SSH stdin — never inline a large script or base64:
exec_on_node(ip, "sudo bash -s", { scriptPath: "docker/scripts/provision-node.sh", timeout: 300 }) # root
exec_on_node(ip, "bash -s", { scriptPath: "docker/scripts/enable-ipv6.sh" }) # non-root
Script (docker/scripts/) |
Purpose | Warning |
|---|---|---|
provision-node.sh |
Full OS prep: TZ Asia/Singapore + Docker CE + IPv6 + BBR + nftables + daemon.json + UFW-Docker + SSH→1022 + journald + auto-update cron | Destructive (stops containers). Fresh/rebuild only. sudo + confirmation. |
auto-update.sh |
Daily pull/compare/restart + Slack notify | Safe. Cron 04:00 Beijing (needs Asia/Singapore TZ). |
enable-ipv6.sh / totally-reinstall-docker.sh |
Subsets of provision-node.sh | Superseded by provision-node.sh. |
simple-docker-pull-restart.sh |
Pull + restart | Safe routine update. |
§5 Batch Fleet Operations
Local scripts in this skill dir (.claude/skills/kaitu-node-ops/). Need KAITU_CENTER_URL + KAITU_ACCESS_KEY; SSH via KAITU_SSH_USER (default ubuntu) port KAITU_SSH_PORT (default 1022).
| Script | Purpose | Flags |
|---|---|---|
deploy-compose.sh |
SCP docker/docker-compose.yml to all active nodes (MD5-skip, no restart) |
--all, --dry-run |
update-compose.sh |
pull + up -d across active nodes, rolling |
--sleep=N, --node=IP, --dry-run |
deploy-auto-update.sh |
SCP auto-update.sh + install daily cron (idempotent) |
--all, --node=IP, --dry-run |
⚠ DIR-MIGRATION GUARD (active until the whole fleet is on
/apps/k2s): the canonical compose now carriesname: k2s. Do NOT rundeploy-compose.sh/update-compose.sh/deploy-auto-update.shagainst the full fleet while any node is still at/apps/kaitu-slave. Pushing thename: k2scompose to an un-migrated node makes its nextup -d/ nightlyauto-updateresolve projectk2s→ emptyk2s_*volumes (cert + metering state lost) +container_namecollision. During the migration window use--node=<already-migrated-IP>only, or run fleet-wide after the sweep completes. Remove this note once all nodes are on/apps/k2s. (Background: spec2026-06-23-node-deploy-dir-k2s-migration-design.md.)
Node activity heuristic (no explicit status field): active = tunnelCount > 0 and a real name (hk.aliyun.wm01, not IP-as-name) and SSH on :1022. tunnelCount==0 + IP-name = decommissioned; scripts skip them by default (--all includes them).
MCP tools: list_nodes(country?,name?), exec_on_node(ip,command,timeout?=60,scriptPath?), ping_node(ip), delete_node(ip). exec_on_node returns status (success/ssh_error/timeout); check exitCode for pass/fail; stdout capped 10k, stderr 2k, both secret-redacted.
§6 Post-Deployment Verification
MANDATORY after deploying to a new/restarted node. In order:
| # | Check | Command | Expected |
|---|---|---|---|
| 0 | Timezone | timedatectl | grep 'Time zone' |
Asia/Singapore (+08, +0800) |
| 1 | Containers up | cd /apps/k2s && docker compose ps |
all Up, sidecar (healthy), no k2-oc orphan |
| 2 | Sidecar registered | docker logs --tail 30 k2-sidecar | grep "Registration completed" |
tunnels=1 |
| 3 | Tunnel domain | docker logs --tail 30 k2-sidecar | grep "Tunnel registered" |
{ipv4-dashes}.sslip.io, created=true |
| 4 | k2s ready | docker logs --tail 20 k2s | grep "server ready" |
k2s server ready listen=:443 |
| 5 | Container net | docker exec k2-sidecar wget -qO- --timeout=5 https://api.ipify.org |
node's public IP |
| 6 | MCP cross-check | list_nodes(name=<node>) |
one tunnels entry with the sslip.io domain |
| 7 | Port mapping | docker port k2s |
22 maps: 443/tcp + 443/udp + 40000-40019/udp |
Metering nodes — also (when K2_NODE_BILLING_START_DATE is set):
| Check | Command | Expected |
|---|---|---|
| Meter initialized | docker logs k2-sidecar | grep "Traffic monitor initialized" |
billingDate=… limitGB=… rx=… tx=… |
| Reporter running | docker logs --tail 80 k2-sidecar | grep usage-reporter |
usage-reporter-start + periodic usage-reporter-cycle-ok cumulative=… quotaTotal=… (cumulative climbs) |
If the reporter line is absent: confirm K2_NODE_BILLING_START_DATE (+ K2_NODE_TRAFFIC_LIMIT_GB) + K2_NODE_SECRET are present; otherwise the sidecar logs Metering disabled: no billing date and runs uncapped. Deep metering ops → references/metering.md.
§7 Troubleshooting
iptables-nft incompatibility (Ubuntu 20.04)
Symptom: sidecar loops Failed to get IPv4 address: all ipv4 services failed (i/o timeout); host curl fine but docker exec net calls fail.
Cause: iptables on nf_tables backend; Docker NAT needs legacy. MASQUERADE fails silently.
Detect: iptables --version → BAD (nf_tables), GOOD (legacy).
Fix:
sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
sudo systemctl restart docker
cd /apps/k2s && sudo docker compose up -d --remove-orphans
(provision-node.sh step 2 handles this for new nodes.)
Sidecar restart loop (no registration)
docker compose ps shows sidecar restarting; logs repeat Detecting missing network info… → error → restart. Causes: (1) container net broken (above), (2) missing/invalid K2_NODE_SECRET, (3) K2_CENTER_URL unreachable. Diagnose via sidecar logs + the container-net check (§6 #5).
§8 Cloud Provider Notes
- AWS nodes are Lightsail, not EC2. Use
aws lightsail …(get-instances,reboot-instance), notaws ec2. Profiledefaultreaches the Lightsail account. - Lightsail data-transfer billing = calendar month + first-month proration +
max(in,out). This drives howK2_NODE_BILLING_START_DATE/K2_NODE_TRAFFIC_LIMIT_GBmust be set. Full rules + worked examples inreferences/metering.md.
Reference files (Read when needed)
references/provisioning.md— provision a NEW private node end-to-end from a Center work item (claim intent → create VPS → OS prep →.env→ deploy → self-register → verify). It's the §4 deploy + a Center wrapper + claim env.references/metering.md— build/push the sidecar image (Part A), configure quota & seed mid-cycle usage incl. proration calc (Part C), AWS Lightsail billing rules, observe the recorder (Part D), operate & test the cutoff (Part E).references/device-logs.md— triage a user's uploaded device logs: MCP query/download tools + the k2 client DIAG-log analysis workflow.