kaitu-support

star 0

Use when triaging Kaitu VPN feedback tickets (无法连接 / 速度慢 / 应用受影响 / App 卡死 / 订阅失败). Maps user symptoms to specific k2 DIAG log patterns, distinguishes client-only vs server-required diagnoses, and enforces confidence floors for performance complaints.

kaitu-io By kaitu-io schedule Updated 4/30/2026

name: kaitu-support description: Use when triaging Kaitu VPN feedback tickets (无法连接 / 速度慢 / 应用受影响 / App 卡死 / 订阅失败). Maps user symptoms to specific k2 DIAG log patterns, distinguishes client-only vs server-required diagnoses, and enforces confidence floors for performance complaints. triggers: - support ticket - feedback ticket - user issue - device log - diagnose - troubleshoot - customer support - 无法连接 - 速度慢 - k2 log - DIAG

Kaitu Technical Support

Use this skill when triaging user feedback tickets. All operations use kaitu-center MCP tools.

Available Tools

Tool Purpose
lookup_user Find user by email or UUID
list_user_devices List user's registered devices
query_device_logs Find device logs in S3 (filter by feedback_id to link to a ticket)
download_device_log Download + extract log files to /tmp/kaitu-device-logs/
query_feedback_tickets Search feedback tickets
list_ticket_replies List all replies on a ticket
reply_feedback_ticket Reply to user (triggers aggregated email after 5min)
resolve_feedback_ticket Mark ticket as resolved
close_feedback_ticket Close ticket (not actionable)

Triage Workflow

Step 1 — Identify the Ticket

query_feedback_tickets(id=<ticket_id>)

Extract: userId, email, meta.os, meta.appVersion, meta.vpnState, content, logCount.

Step 2 — User Context

lookup_user(uuid=<user_uuid>)        # membership, plan, cloud instance
list_user_devices(uuid=<user_uuid>)  # UDID list, app versions, last seen

Step 3 — Read Prior Replies

list_ticket_replies(id=<ticket_id>)

Avoid duplicate responses.

Step 4 — Pull Device Logs

query_device_logs(feedback_id=<feedback_uuid>)   # preferred — links log to this ticket
download_device_log(s3_key=<key>)                # extracts to /tmp/kaitu-device-logs/

Desktop logs: desktop.log (Tauri/Rust shell) + system--k2.log (Go core — DIAG events live here). Mobile logs: platform-specific, same DIAG events.

Step 5 — Quick Diag

bash scripts/k2-quick-diag.sh /tmp/kaitu-device-logs/<dir>/system--k2.log

Run this FIRST — it prints last heartbeat, DIAG event counts, health transitions, panics.

Step 6 — Symptom-Driven Investigation

Pick the row matching the user's complaint. Run the greps against system--k2.log (or mobile equivalent).

User symptom Primary grep What to conclude
无法连接 grep "DIAG: transport-race-fail|wire-handshake-fail|wire-error|DIAG: connected" No DIAG: connected after attempts → never handshaked. transport-race-fail with all three of quic443Err/quicHopErr/tcpwsErr populated → all transports blocked (likely GFW escalation or ISP). wire-error code 570/503 → server-side; 401/402/403 → account/auth.
速度慢 grep "DIAG: heartbeat" | tail -20 + grep "DIAG: dns-slow|proxy-dial-slow|udp-relay-timeout|transport-switch" Heartbeat loss/rttMs/fallback tell the story: fallback=true = TCP-WS degraded, loss>0.05 = lossy link. Many udp-relay-timeout or transport-switch = UDP hostile network. Client-only caps at 5/10 — MUST do §8 server-log correlation to go higher.
连接不稳定 grep "DIAG: wake|transport-rerace|echo-probe-fail|transport-switch" DIAG: wake sleepS=... → system sleep caused the break (expected). transport-rerace with reason=3-echo-fails → link silently died, re-raced. Repeated transport-switch QUIC↔TCP-WS → flaky UDP path.
微信/WhatsApp/通话受影响 grep "DIAG: dns-proxy-|udp-relay-timeout|DIAG: proxy-dial-fail" and filter by dest= matching the app's domain Voice/video need UDP: udp-relay-timeout correlated to the app's relay hosts = UDP starved. Many dns-proxy-recv-no-callback = DNS proxy overloaded, domains never resolved. proxy-dial-fail on specific dest = rule routed that host wrong.
App crash / VPN 把手机卡死 grep -i panic then grep "DIAG: heartbeat" | tail -5 and grep "DIAG: pipe-watchdog" Panic stack → CLIENT_BUG. Heartbeats suddenly stop (no 30s tick) = daemon hung. pipe-watchdog firstExitDir=... = stuck half-closed pipe was force-closed. For iOS "梯子卡住导致手机断网" — the Network Extension didn't cleanly tear down; check session-end is logged.
订阅 / 节点刷新失败 grep "DIAG: subs-refresh-fail" + check Center /api/subs directly endpoint+err fields are definitive. If endpoint returns 5xx → SERVER_ISSUE; if TLS handshake fails → network / GFW block on the subs domain.
[Auto] bad connection experience Ticket body already names Server, Duration, Rule. Grep the last DIAG: session-end and the 20 events before it. Semantics: [Auto] = user manually clicked disconnect AND rated session bad on the post-disconnect prompt. The tunnel did successfully connect (otherwise no rating prompt) — Duration is how long the user actively used it before deciding it was bad. Bad rating reflects subjective in-session experience, not connect failure. Reads: Duration=0–10s repeated across multiple tickets = user connected, immediately found nothing worked, disconnected — most likely cause is post-handshake breakage (e.g., DNS-via-proxy timing out, proxy-dial-fail storm, udp-relay-timeout). Duration minutes-to-hours = quality degraded mid-session — check heartbeat trends, transport-switch, transport-rerace, server-side k2s.log loss. Do NOT classify these as "couldn't connect" — connect succeeded; the issue is what happened after.
登录失败 / 收不到验证码 Don't read k2 logs — this is Center API, not k2 tunnel. Invoke the center-ops skill to grep Center app logs for /api/auth/code, /api/auth/login, /api/auth/web-login on the user's email. MUST query BOTH center-1 (35.77.181.30) and center-2 (13.230.22.35) — ALB load-balances, the record may live on either server, not both. Also check lookup_user for isFirstOrderDone / isActivated to see if the account eventually did log in and purchase. If no matching record on either server → code truly wasn't sent (check rate-limit / bounce). If code was 200 + login succeeded later → transient provider delay (especially @qq.com / @outlook), often self-heals. If user insists they didn't get code but log shows 200 → confirm email spelling (typo like wrong prefix / wrong domain); common cause: friend-reports-for-friend mix-up.
已付款但会员未到账 / 能退吗我微信支付的 Don't read k2 logs — payment flow, not tunnel. Follow reference_wordgate_webhook_integration.md in memory. Core query: JOIN kaitu.orders k ON wordgate.orders ww.is_paid=1 AND (k.is_paid=0 OR NULL). Grep /apps/wordgate/wordgate.log 看 WordGate 那边是否收到 Stripe 通知。Grep Kaitu /apps/kaitu/logs/app.log 两台都查 — ALB 轮询,webhook 只落其中一台。关键 reqId 链:[Webhook] received → MarkOrderAsPaid → addProExpiredDays → status:500/200 WordGate 付款成 + Kaitu 未到账 = webhook 没处理完。两种根因:(a) Kaitu 返 5xx(binary 有 bug / schema 漂移 / 死锁)+ WordGate SQS fallback 凭据 AKIASWWJ4TKXW7XCPUGPInvalidClientTokenId → 通知永久丢;(b) 真的没发(少见,看 WordGate log)。处理:先修 Kaitu 侧根因,再用 reference memory 里的 curl 模板 replay webhook(state-changing,走 center-deploy)。Replay 是幂等的(FOR UPDATE + localIsPaid 检查)。不要直接改 DB — 会跳过返现/邀请奖励/tier 同步。
macOS 11.x 不支持 Don't read logs. Tauri v2 requires macOS 12+. Reply with the supported range; no fix.

Step 7 — Confidence Ladder (MANDATORY)

Confidence is a function of how many independent sources confirm the same root cause. State current tier and max-reachable-tier before replying.

Tier 1 (≤ 5/10)  Client log only
Tier 2 (≤ 7/10)  + Server k2s.log from §8 covering the same time window, same user IP
Tier 3 (≤ 9/10)  + Code read at user's exact commit/version from §9, symptom reproduces in code path
Tier 4 (= 10/10) + Panic stack trace pointing to a specific line in §9-resolved code

Per-symptom caps on top of the ladder:

Symptom Max tier without server log Rationale
速度慢 / 不稳定 / 丢包 Tier 1 only (5/10) Client rxMB = direct + proxy mixed; client loss = uplink only. Downlink truth lives in k2s.log.
无法连接 with clear transport-race-fail (all 3 transports err set) Tier 2 (7/10) client-only OK All-transport failure is conclusive GFW/ISP signal; server log adds little.
无法连接, partial evidence Tier 1 (5/10) Need server log to confirm client never reached node.
微信/WhatsApp 受影响 Tier 1 (5/10) UDP starvation may be server-side; must cross-check k2s.log.
App crash / panic Tier 4 possible if stack + code match Panic is self-contained evidence.
Login / verification code Up to 9/10 from Center API logs + DB Not a k2 issue — different evidence chain.
已付款未到账 Up to 10/10 from kaitu.orders ⟷ wordgate.orders JOIN + WordGate + Kaitu logs with full reqId trace Two authoritative DBs; root cause usually visible in one side's log. See reference_wordgate_webhook_integration.md.
No evidence UNKNOWN — ask user for specifics, do NOT resolve

Hard rules:

  • Below Tier 2 for any non-panic complaint → MUST escalate to §8 before resolving.
  • Below Tier 3 for any reply that names a code-level cause → MUST do §9 first.
  • Stay UNKNOWN if evidence doesn't triangulate. claude-supportclaude; resolving at UNKNOWN is a process violation.

Step 8 — Server-Side Log Correlation (§8 escalation)

Invoked when Tier 1 is insufficient. Requires kaitu-node-ops skill for node ops.

8.1 — Identify the node

# From client log
grep "DIAG: connected" system--k2.log | tail -5
# → server=www.<province>.people.cn  ← SNI cover, NOT real geography

Resolve real node: list_nodes() (kaitu-center MCP), match the server domain to a node record, take ip field. The province-cover domain maps to a specific node via tunnels[].sniDomain.

8.2 — Identify the time window + user IP From client log: connect time (DIAG: connected timestamp) and session end (DIAG: session-end). Convert to node's timezone (most nodes: UTC or local TZ from list_nodes metadata).

k2s.log is indexed by client public IP, not UDID. If the user IP isn't known:

  • Check if ticket meta leaked it (rare)
  • Cross-match by DNS-fingerprint timing: pick a uniquely-timed DNS query from client log (e.g. a 02:14:23.451 lookup for a rare domain), grep k2s.log in the ±2s window for a matching incoming request. See memory reference_udid_to_public_ip.md.

8.3 — Pull the real k2s.log Do NOT use docker logs k2s — that's only the 5-line startup tail (see memory reference_k2s_log_location.md). Real log: /apps/k2s/logs/k2s.log on the node.

# Via kaitu-node-ops (exec_on_node):
exec_on_node(
  ip=<node-ip>,
  command="grep -E '<user-ip>' /apps/k2s/logs/k2s.log | awk '$0 >= \"<YYYY-MM-DDTHH:MM:SS>\" && $0 <= \"<YYYY-MM-DDTHH:MM:SS>\"' | head -200"
)
# For rotated days:
exec_on_node(ip=<node-ip>, command="zgrep '<user-ip>' /apps/k2s/logs/k2s-*.log.gz | head -100")

8.4 — Read what matters

  • mode=app-limited loss=0 → user connected but never pushed real traffic (not a VPN problem — see memory reference_k2cc_app_limited.md)
  • loss > 0.05 on the server side during the session = real downlink degradation (SERVER_ISSUE or user's ISP path)
  • Any wire-error emitted server-side → already the cause, no more digging needed
  • Ignore sidecar metricsnetIn/conn in sidecar logs reflect the sidecar itself, not user traffic (memory reference_sidecar_metrics_misleading.md)

Raise confidence to Tier 2 when client + server agree on the same fault window. If they disagree (client says loss, server says clean), that IS the finding — it localizes the problem to the path between them (usually user's ISP).

Step 9 — Code at User's Exact Version (§9 escalation)

Invoked when you need to read code to confirm a bug path. Don't read HEAD — it may have diverged from what the user is running.

9.1 — Identify the version From ticket meta:

  • meta.appVersion → e.g. "0.4.3" (always present)
  • meta.commit → e.g. "9e12d0b" (present on 0.4.2+ builds)

Client log also stamps the build at startup — grep -i "build\|version\|commit" system--k2.log | head -5.

9.2 — Read code without mutating the submodule The k2 submodule is read-only from the parent worktree. Do NOT git checkout inside k2/ from here. Use one of:

# A. Read a single file at the user's commit (preferred, zero state change)
cd k2 && git show <commit>:engine/health.go | less

# B. Grep across files at the user's commit
cd k2 && git grep "DIAG: transport-rerace" <commit>

# C. Temporary worktree at the user's commit (for larger investigations)
cd k2 && git worktree add /tmp/k2-at-<commit> <commit>
# ... read files under /tmp/k2-at-<commit>/ ...
cd k2 && git worktree remove /tmp/k2-at-<commit>

If meta.commit is empty (legacy clients): git show v<appVersion>:path using the release tag, or find the commit from git log --grep "release <appVersion>".

9.3 — Trace the symptom Given a DIAG event name, grep the commit to find emit site and callers:

cd k2 && git grep -n "DIAG: <event-name>" <commit>

Cross-reference with the architecture map in k2/CLAUDE.md at that commit.

9.4 — Bug is fixed on main?

cd k2 && git log --oneline <commit>..HEAD -- <file-with-bug>

If a fix commit exists post-<commit>, classification becomes KNOWN_FIXED; identify the release it shipped in (git tag --contains <fix-commit>) and tell the user which version to upgrade to.

Step 10 — Reply

reply_feedback_ticket(id=<ticket_id>, content="...")

Guidelines:

  • Write in the user's language (detect from ticket content).
  • Be concise: state the problem, then solution/workaround.
  • Include specific version numbers when recommending an upgrade.
  • NEVER expose internal infra details — no server IPs, stack traces, DIAG event names, error codes, node hostnames.
  • If user action is required, give clear step-by-step instructions.

Step 11 — Resolve or Close

Situation Action
Diagnosed, reply sent resolve_feedback_ticket(id, resolved_by="claude")
Fixed in later version Reply with version info → resolve_feedback_ticket
Not actionable / spam / feature request out of scope close_feedback_ticket(id)
Cannot determine, need more info Reply asking specifics, do NOT resolve yet

Step 12 — Cleanup

rm -rf /tmp/kaitu-device-logs/<extract-dir>/

k2 DIAG Event Reference (cheat sheet)

Three layers in system--k2.log:

  • Heartbeat (every 30s): DIAG: heartbeat health=... transport=... loss=... rttMs=... fallback=... heapMB=... goroutines=...
  • Events (threshold-gated): DIAG: <kebab-name> with context fields
  • DEBUG (off by default): full per-operation trace

Full reserved event table lives in k2/CLAUDE.md § Diagnostic Logging Constitution. Common ones surface-relevant for support:

  • Connection: connected, session-end, transport-race-start/winner/fail, wire-handshake, wire-handshake-fail, wire-error
  • Runtime health: heartbeat, wake, transport-switch, transport-rerace, echo-probe-fail
  • DNS: dns-slow, dns-fail, dns-proxy-timeout, dns-proxy-recv-no-callback, dns-proxy-conn-dead
  • Proxy: proxy-dial-fail, proxy-dial-slow, udp-relay-timeout
  • Subs / misc: subs-refresh-fail, pipe-watchdog, datagram-readloop-exit

Diagnostic Anti-Patterns (do not make these mistakes)

"节点远 → 速度慢 / 卡顿" — WRONG

Distance to a node only adds RTT (latency); it does NOT cause throughput loss, app hangs, or pipe-watchdog bursts. Kaitu's k2cc congestion control + BBR are designed for long links — a 200ms-RTT path to AU is not "slower" than a 60ms-RTT path to HK in terms of bandwidth or reliability. RTT 100ms vs 250ms is a UX nuance for interactive apps, not a root cause for connectivity failure.

Never recommend "switch to a closer node" as a fix for:

  • Slow downloads / video buffering
  • Apps not loading
  • pipe-watchdog / proxy-dial-fail bursts
  • Goroutine spikes
  • "VPN connected but nothing works"

If the heartbeat shows healthy loss=0 and stable rttMs, distance is not the cause — keep digging. Real causes for those symptoms: server-side egress saturation, destination geo-blocking, GFW interference on the specific path, or app-side issues. None of those are fixed by picking a node 100ms closer.

Acceptable distance-related advice is narrow: only mention RTT when the user's complaint is explicitly about interactive latency (gaming ping, voice call lag, SSH responsiveness) — and even then, frame it as "lower RTT improves interactive feel", not "fixes speed".

Classification Guide

Classification Meaning Reply Template
CLIENT_BUG Bug in app code (panic, logic error) Acknowledge + workaround if any + "will fix in next version"
CLIENT_CONFIG User config issue (wrong mode, wrong server) Step-by-step fix instructions
SERVER_ISSUE Server/node problem (confirmed via k2s.log) "We've identified the issue and are working on it"
NETWORK User's ISP / network (GFW, ISP throttling, captive portal) Network troubleshooting — restart router, try different network
KNOWN_FIXED Fixed in later version "Please update to version X.Y.Z"
PLATFORM_UNSUPPORTED macOS <12, old iOS, etc. State supported range
NOT_K2_ISSUE Login / verification code / account / billing Route to Center API diagnosis, not k2 logs
UNKNOWN Cannot determine Ask user for specifics — do NOT resolve

Safety Rules

  • NEVER expose internal details to users (server IPs, DIAG events, stack traces, error codes, node names).
  • NEVER modify code during diagnosis — read-only analysis only.
  • Follow the §7 confidence ladder. Do not resolve below Tier 2 for non-panic complaints.
  • Always check list_ticket_replies before replying to avoid duplicates.
  • Clean up /tmp/kaitu-device-logs/ after finishing.
  • 登录 / 验证码 类工单不要花时间读 k2 日志 — 和 k2 核心无关。
Install via CLI
npx skills add https://github.com/kaitu-io/k2app --skill kaitu-support
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator