name: kaitu-support description: Use when triaging Kaitu VPN feedback tickets (无法连接 / 速度慢 / 应用受影响 / App 卡死 / 订阅失败). Maps user symptoms to specific k2 DIAG log patterns, distinguishes client-only vs server-required diagnoses, and enforces confidence floors for performance complaints. triggers: - support ticket - feedback ticket - user issue - device log - diagnose - troubleshoot - customer support - 无法连接 - 速度慢 - k2 log - DIAG
Kaitu Technical Support
Use this skill when triaging user feedback tickets. All operations use kaitu-center MCP tools.
Available Tools
| Tool | Purpose |
|---|---|
lookup_user |
Find user by email or UUID |
list_user_devices |
List user's registered devices |
query_device_logs |
Find device logs in S3 (filter by feedback_id to link to a ticket) |
download_device_log |
Download + extract log files to /tmp/kaitu-device-logs/ |
query_feedback_tickets |
Search feedback tickets |
list_ticket_replies |
List all replies on a ticket |
reply_feedback_ticket |
Reply to user (triggers aggregated email after 5min) |
resolve_feedback_ticket |
Mark ticket as resolved |
close_feedback_ticket |
Close ticket (not actionable) |
Triage Workflow
Step 1 — Identify the Ticket
query_feedback_tickets(id=<ticket_id>)
Extract: userId, email, meta.os, meta.appVersion, meta.vpnState, content, logCount.
Step 2 — User Context
lookup_user(uuid=<user_uuid>) # membership, plan, cloud instance
list_user_devices(uuid=<user_uuid>) # UDID list, app versions, last seen
Step 3 — Read Prior Replies
list_ticket_replies(id=<ticket_id>)
Avoid duplicate responses.
Step 4 — Pull Device Logs
query_device_logs(feedback_id=<feedback_uuid>) # preferred — links log to this ticket
download_device_log(s3_key=<key>) # extracts to /tmp/kaitu-device-logs/
Desktop logs: desktop.log (Tauri/Rust shell) + system--k2.log (Go core — DIAG events live here).
Mobile logs: platform-specific, same DIAG events.
Step 5 — Quick Diag
bash scripts/k2-quick-diag.sh /tmp/kaitu-device-logs/<dir>/system--k2.log
Run this FIRST — it prints last heartbeat, DIAG event counts, health transitions, panics.
Step 6 — Symptom-Driven Investigation
Pick the row matching the user's complaint. Run the greps against system--k2.log (or mobile equivalent).
| User symptom | Primary grep | What to conclude |
|---|---|---|
| 无法连接 | grep "DIAG: transport-race-fail|wire-handshake-fail|wire-error|DIAG: connected" |
No DIAG: connected after attempts → never handshaked. transport-race-fail with all three of quic443Err/quicHopErr/tcpwsErr populated → all transports blocked (likely GFW escalation or ISP). wire-error code 570/503 → server-side; 401/402/403 → account/auth. |
| 速度慢 | grep "DIAG: heartbeat" | tail -20 + grep "DIAG: dns-slow|proxy-dial-slow|udp-relay-timeout|transport-switch" |
Heartbeat loss/rttMs/fallback tell the story: fallback=true = TCP-WS degraded, loss>0.05 = lossy link. Many udp-relay-timeout or transport-switch = UDP hostile network. Client-only caps at 5/10 — MUST do §8 server-log correlation to go higher. |
| 连接不稳定 | grep "DIAG: wake|transport-rerace|echo-probe-fail|transport-switch" |
DIAG: wake sleepS=... → system sleep caused the break (expected). transport-rerace with reason=3-echo-fails → link silently died, re-raced. Repeated transport-switch QUIC↔TCP-WS → flaky UDP path. |
| 微信/WhatsApp/通话受影响 | grep "DIAG: dns-proxy-|udp-relay-timeout|DIAG: proxy-dial-fail" and filter by dest= matching the app's domain |
Voice/video need UDP: udp-relay-timeout correlated to the app's relay hosts = UDP starved. Many dns-proxy-recv-no-callback = DNS proxy overloaded, domains never resolved. proxy-dial-fail on specific dest = rule routed that host wrong. |
| App crash / VPN 把手机卡死 | grep -i panic then grep "DIAG: heartbeat" | tail -5 and grep "DIAG: pipe-watchdog" |
Panic stack → CLIENT_BUG. Heartbeats suddenly stop (no 30s tick) = daemon hung. pipe-watchdog firstExitDir=... = stuck half-closed pipe was force-closed. For iOS "梯子卡住导致手机断网" — the Network Extension didn't cleanly tear down; check session-end is logged. |
| 订阅 / 节点刷新失败 | grep "DIAG: subs-refresh-fail" + check Center /api/subs directly |
endpoint+err fields are definitive. If endpoint returns 5xx → SERVER_ISSUE; if TLS handshake fails → network / GFW block on the subs domain. |
| [Auto] bad connection experience | Ticket body already names Server, Duration, Rule. Grep the last DIAG: session-end and the 20 events before it. |
Semantics: [Auto] = user manually clicked disconnect AND rated session bad on the post-disconnect prompt. The tunnel did successfully connect (otherwise no rating prompt) — Duration is how long the user actively used it before deciding it was bad. Bad rating reflects subjective in-session experience, not connect failure. Reads: Duration=0–10s repeated across multiple tickets = user connected, immediately found nothing worked, disconnected — most likely cause is post-handshake breakage (e.g., DNS-via-proxy timing out, proxy-dial-fail storm, udp-relay-timeout). Duration minutes-to-hours = quality degraded mid-session — check heartbeat trends, transport-switch, transport-rerace, server-side k2s.log loss. Do NOT classify these as "couldn't connect" — connect succeeded; the issue is what happened after. |
| 登录失败 / 收不到验证码 | Don't read k2 logs — this is Center API, not k2 tunnel. Invoke the center-ops skill to grep Center app logs for /api/auth/code, /api/auth/login, /api/auth/web-login on the user's email. MUST query BOTH center-1 (35.77.181.30) and center-2 (13.230.22.35) — ALB load-balances, the record may live on either server, not both. Also check lookup_user for isFirstOrderDone / isActivated to see if the account eventually did log in and purchase. |
If no matching record on either server → code truly wasn't sent (check rate-limit / bounce). If code was 200 + login succeeded later → transient provider delay (especially @qq.com / @outlook), often self-heals. If user insists they didn't get code but log shows 200 → confirm email spelling (typo like wrong prefix / wrong domain); common cause: friend-reports-for-friend mix-up. |
| 已付款但会员未到账 / 能退吗我微信支付的 | Don't read k2 logs — payment flow, not tunnel. Follow reference_wordgate_webhook_integration.md in memory. Core query: JOIN kaitu.orders k ON wordgate.orders w 找 w.is_paid=1 AND (k.is_paid=0 OR NULL). Grep /apps/wordgate/wordgate.log 看 WordGate 那边是否收到 Stripe 通知。Grep Kaitu /apps/kaitu/logs/app.log 两台都查 — ALB 轮询,webhook 只落其中一台。关键 reqId 链:[Webhook] received → MarkOrderAsPaid → addProExpiredDays → status:500/200。 |
WordGate 付款成 + Kaitu 未到账 = webhook 没处理完。两种根因:(a) Kaitu 返 5xx(binary 有 bug / schema 漂移 / 死锁)+ WordGate SQS fallback 凭据 AKIASWWJ4TKXW7XCPUGP 已 InvalidClientTokenId → 通知永久丢;(b) 真的没发(少见,看 WordGate log)。处理:先修 Kaitu 侧根因,再用 reference memory 里的 curl 模板 replay webhook(state-changing,走 center-deploy)。Replay 是幂等的(FOR UPDATE + localIsPaid 检查)。不要直接改 DB — 会跳过返现/邀请奖励/tier 同步。 |
| macOS 11.x 不支持 | Don't read logs. | Tauri v2 requires macOS 12+. Reply with the supported range; no fix. |
Step 7 — Confidence Ladder (MANDATORY)
Confidence is a function of how many independent sources confirm the same root cause. State current tier and max-reachable-tier before replying.
Tier 1 (≤ 5/10) Client log only
Tier 2 (≤ 7/10) + Server k2s.log from §8 covering the same time window, same user IP
Tier 3 (≤ 9/10) + Code read at user's exact commit/version from §9, symptom reproduces in code path
Tier 4 (= 10/10) + Panic stack trace pointing to a specific line in §9-resolved code
Per-symptom caps on top of the ladder:
| Symptom | Max tier without server log | Rationale |
|---|---|---|
| 速度慢 / 不稳定 / 丢包 | Tier 1 only (5/10) | Client rxMB = direct + proxy mixed; client loss = uplink only. Downlink truth lives in k2s.log. |
无法连接 with clear transport-race-fail (all 3 transports err set) |
Tier 2 (7/10) client-only OK | All-transport failure is conclusive GFW/ISP signal; server log adds little. |
| 无法连接, partial evidence | Tier 1 (5/10) | Need server log to confirm client never reached node. |
| 微信/WhatsApp 受影响 | Tier 1 (5/10) | UDP starvation may be server-side; must cross-check k2s.log. |
| App crash / panic | Tier 4 possible if stack + code match | Panic is self-contained evidence. |
| Login / verification code | Up to 9/10 from Center API logs + DB | Not a k2 issue — different evidence chain. |
| 已付款未到账 | Up to 10/10 from kaitu.orders ⟷ wordgate.orders JOIN + WordGate + Kaitu logs with full reqId trace |
Two authoritative DBs; root cause usually visible in one side's log. See reference_wordgate_webhook_integration.md. |
| No evidence | UNKNOWN — ask user for specifics, do NOT resolve |
Hard rules:
- Below Tier 2 for any non-panic complaint → MUST escalate to §8 before resolving.
- Below Tier 3 for any reply that names a code-level cause → MUST do §9 first.
- Stay UNKNOWN if evidence doesn't triangulate.
claude-support≠claude; resolving at UNKNOWN is a process violation.
Step 8 — Server-Side Log Correlation (§8 escalation)
Invoked when Tier 1 is insufficient. Requires kaitu-node-ops skill for node ops.
8.1 — Identify the node
# From client log
grep "DIAG: connected" system--k2.log | tail -5
# → server=www.<province>.people.cn ← SNI cover, NOT real geography
Resolve real node: list_nodes() (kaitu-center MCP), match the server domain to a node record, take ip field. The province-cover domain maps to a specific node via tunnels[].sniDomain.
8.2 — Identify the time window + user IP
From client log: connect time (DIAG: connected timestamp) and session end (DIAG: session-end). Convert to node's timezone (most nodes: UTC or local TZ from list_nodes metadata).
k2s.log is indexed by client public IP, not UDID. If the user IP isn't known:
- Check if ticket meta leaked it (rare)
- Cross-match by DNS-fingerprint timing: pick a uniquely-timed DNS query from client log (e.g. a 02:14:23.451 lookup for a rare domain), grep
k2s.login the ±2s window for a matching incoming request. See memoryreference_udid_to_public_ip.md.
8.3 — Pull the real k2s.log
Do NOT use docker logs k2s — that's only the 5-line startup tail (see memory reference_k2s_log_location.md). Real log: /apps/k2s/logs/k2s.log on the node.
# Via kaitu-node-ops (exec_on_node):
exec_on_node(
ip=<node-ip>,
command="grep -E '<user-ip>' /apps/k2s/logs/k2s.log | awk '$0 >= \"<YYYY-MM-DDTHH:MM:SS>\" && $0 <= \"<YYYY-MM-DDTHH:MM:SS>\"' | head -200"
)
# For rotated days:
exec_on_node(ip=<node-ip>, command="zgrep '<user-ip>' /apps/k2s/logs/k2s-*.log.gz | head -100")
8.4 — Read what matters
mode=app-limited loss=0→ user connected but never pushed real traffic (not a VPN problem — see memoryreference_k2cc_app_limited.md)loss > 0.05on the server side during the session = real downlink degradation (SERVER_ISSUE or user's ISP path)- Any
wire-erroremitted server-side → already the cause, no more digging needed - Ignore sidecar metrics —
netIn/connin sidecar logs reflect the sidecar itself, not user traffic (memoryreference_sidecar_metrics_misleading.md)
Raise confidence to Tier 2 when client + server agree on the same fault window. If they disagree (client says loss, server says clean), that IS the finding — it localizes the problem to the path between them (usually user's ISP).
Step 9 — Code at User's Exact Version (§9 escalation)
Invoked when you need to read code to confirm a bug path. Don't read HEAD — it may have diverged from what the user is running.
9.1 — Identify the version
From ticket meta:
meta.appVersion→ e.g."0.4.3"(always present)meta.commit→ e.g."9e12d0b"(present on 0.4.2+ builds)
Client log also stamps the build at startup — grep -i "build\|version\|commit" system--k2.log | head -5.
9.2 — Read code without mutating the submodule
The k2 submodule is read-only from the parent worktree. Do NOT git checkout inside k2/ from here. Use one of:
# A. Read a single file at the user's commit (preferred, zero state change)
cd k2 && git show <commit>:engine/health.go | less
# B. Grep across files at the user's commit
cd k2 && git grep "DIAG: transport-rerace" <commit>
# C. Temporary worktree at the user's commit (for larger investigations)
cd k2 && git worktree add /tmp/k2-at-<commit> <commit>
# ... read files under /tmp/k2-at-<commit>/ ...
cd k2 && git worktree remove /tmp/k2-at-<commit>
If meta.commit is empty (legacy clients): git show v<appVersion>:path using the release tag, or find the commit from git log --grep "release <appVersion>".
9.3 — Trace the symptom Given a DIAG event name, grep the commit to find emit site and callers:
cd k2 && git grep -n "DIAG: <event-name>" <commit>
Cross-reference with the architecture map in k2/CLAUDE.md at that commit.
9.4 — Bug is fixed on main?
cd k2 && git log --oneline <commit>..HEAD -- <file-with-bug>
If a fix commit exists post-<commit>, classification becomes KNOWN_FIXED; identify the release it shipped in (git tag --contains <fix-commit>) and tell the user which version to upgrade to.
Step 10 — Reply
reply_feedback_ticket(id=<ticket_id>, content="...")
Guidelines:
- Write in the user's language (detect from ticket content).
- Be concise: state the problem, then solution/workaround.
- Include specific version numbers when recommending an upgrade.
- NEVER expose internal infra details — no server IPs, stack traces, DIAG event names, error codes, node hostnames.
- If user action is required, give clear step-by-step instructions.
Step 11 — Resolve or Close
| Situation | Action |
|---|---|
| Diagnosed, reply sent | resolve_feedback_ticket(id, resolved_by="claude") |
| Fixed in later version | Reply with version info → resolve_feedback_ticket |
| Not actionable / spam / feature request out of scope | close_feedback_ticket(id) |
| Cannot determine, need more info | Reply asking specifics, do NOT resolve yet |
Step 12 — Cleanup
rm -rf /tmp/kaitu-device-logs/<extract-dir>/
k2 DIAG Event Reference (cheat sheet)
Three layers in system--k2.log:
- Heartbeat (every 30s):
DIAG: heartbeat health=... transport=... loss=... rttMs=... fallback=... heapMB=... goroutines=... - Events (threshold-gated):
DIAG: <kebab-name>with context fields - DEBUG (off by default): full per-operation trace
Full reserved event table lives in k2/CLAUDE.md § Diagnostic Logging Constitution. Common ones surface-relevant for support:
- Connection:
connected,session-end,transport-race-start/winner/fail,wire-handshake,wire-handshake-fail,wire-error - Runtime health:
heartbeat,wake,transport-switch,transport-rerace,echo-probe-fail - DNS:
dns-slow,dns-fail,dns-proxy-timeout,dns-proxy-recv-no-callback,dns-proxy-conn-dead - Proxy:
proxy-dial-fail,proxy-dial-slow,udp-relay-timeout - Subs / misc:
subs-refresh-fail,pipe-watchdog,datagram-readloop-exit
Diagnostic Anti-Patterns (do not make these mistakes)
"节点远 → 速度慢 / 卡顿" — WRONG
Distance to a node only adds RTT (latency); it does NOT cause throughput loss, app hangs, or pipe-watchdog bursts. Kaitu's k2cc congestion control + BBR are designed for long links — a 200ms-RTT path to AU is not "slower" than a 60ms-RTT path to HK in terms of bandwidth or reliability. RTT 100ms vs 250ms is a UX nuance for interactive apps, not a root cause for connectivity failure.
Never recommend "switch to a closer node" as a fix for:
- Slow downloads / video buffering
- Apps not loading
- pipe-watchdog / proxy-dial-fail bursts
- Goroutine spikes
- "VPN connected but nothing works"
If the heartbeat shows healthy loss=0 and stable rttMs, distance is not the cause — keep digging. Real causes for those symptoms: server-side egress saturation, destination geo-blocking, GFW interference on the specific path, or app-side issues. None of those are fixed by picking a node 100ms closer.
Acceptable distance-related advice is narrow: only mention RTT when the user's complaint is explicitly about interactive latency (gaming ping, voice call lag, SSH responsiveness) — and even then, frame it as "lower RTT improves interactive feel", not "fixes speed".
Classification Guide
| Classification | Meaning | Reply Template |
|---|---|---|
| CLIENT_BUG | Bug in app code (panic, logic error) | Acknowledge + workaround if any + "will fix in next version" |
| CLIENT_CONFIG | User config issue (wrong mode, wrong server) | Step-by-step fix instructions |
| SERVER_ISSUE | Server/node problem (confirmed via k2s.log) | "We've identified the issue and are working on it" |
| NETWORK | User's ISP / network (GFW, ISP throttling, captive portal) | Network troubleshooting — restart router, try different network |
| KNOWN_FIXED | Fixed in later version | "Please update to version X.Y.Z" |
| PLATFORM_UNSUPPORTED | macOS <12, old iOS, etc. | State supported range |
| NOT_K2_ISSUE | Login / verification code / account / billing | Route to Center API diagnosis, not k2 logs |
| UNKNOWN | Cannot determine | Ask user for specifics — do NOT resolve |
Safety Rules
- NEVER expose internal details to users (server IPs, DIAG events, stack traces, error codes, node names).
- NEVER modify code during diagnosis — read-only analysis only.
- Follow the §7 confidence ladder. Do not resolve below Tier 2 for non-panic complaints.
- Always check
list_ticket_repliesbefore replying to avoid duplicates. - Clean up
/tmp/kaitu-device-logs/after finishing. - 登录 / 验证码 类工单不要花时间读 k2 日志 — 和 k2 核心无关。