dev-guide-debug-troubleshoot - SKILL.md Agent Skill

name: dev-guide-debug-troubleshoot description: > Nested lingtai-dev-guide reference for diagnosing LingTai failures: agent process state, OOM/crashes, avatar spawn issues, post-molt memory loss, mail delivery, scheduled messages, tool timeouts, and escalation. version: 1.0.0

LingTai Debug & Troubleshoot Reference

Nested lingtai-dev-guide reference. Read this after the top-level router sends you here.

Read the lingtai-kernel-anatomy skill first to understand the architecture. This document diagnoses issues based on the Lingtai architecture's process model, memory layers, and communication mechanisms.

Quick Diagnosis Decision Tree

Problem?
├── Process issues?
│   ├── Peer unresponsive → §1.1
│   ├── Peer OOM / crashed → §1.2
│   └── Cannot spawn avatar → §1.3
├── Memory issues?
│   ├── Post-molt amnesia → §2.1
│   ├── Codex entries missing → §2.2
│   ├── Pad not loaded → §2.3
│   └── Molt imminent, critical operations incomplete → §2.4
├── Communication issues?
│   ├── Pigeon not delivered → §3.1
│   ├── Pigeon bounced "No agent at X" → §3.2
│   └── Scheduled pigeon not firing → §3.3
└── Tool issues?
    ├── Tool timeout → §4.1
    ├── Tool not found → §4.2
    └── Tool output truncated → §4.3

1. Process Issues

1.1 Peer Unresponsive

Goal: Determine why a peer is not replying to pigeons and take the appropriate recovery action.

Symptoms:

Sent pigeons go unanswered for an extended period
The peer appears in the contacts list but produces no response

Causes:

Peer is busy (processing a long LLM turn)
Peer is stuck (LLM timeout / upstream error)
Peer is asleep (energy depleted or lulled)
Peer is suspended (process is dead)
Wrong address (no agent exists at that address)

Resolution:

First verify your own health:
```
system(show)
```
Verify the peer's address:
```
email(contacts)
```

Send a simple ping test:

email(send, address=<peer>, message="ping")

Check the heartbeat to determine process state:

ls -la <work-dir>/.lingtai/<peer>/.agent.heartbeat
cat <work-dir>/.lingtai/<peer>/.agent.heartbeat

Interpret the heartbeat:
- Fresh heartbeat (< 5 minutes): Peer is busy — just wait
- Stale heartbeat (> 5 minutes): May be stuck or crashed
- No heartbeat file: No agent may exist at that address

Command Example:

# Check heartbeats for all agents
for dir in <network-dir>/.lingtai/*/; do
  name=$(basename "$dir")
  hb="$dir/.agent.heartbeat"
  if [ -f "$hb" ]; then
    age=$(( $(date +%s) - $(stat -f %m "$hb" 2>/dev/null || stat -c %Y "$hb") ))
    echo "$name: heartbeat ${age}s ago"
  else
    echo "$name: NO heartbeat"
  fi
done

Action Decision:

Have karma privileges:
- system(interrupt, address=<peer>) — interrupt a stuck LLM turn
- system(cpr, address=<peer>) — revive a suspended agent
No karma privileges: Report to parent, attaching evidence (heartbeat timestamp, last communication time)

Common Pitfalls:

❌ Sending repeated probe emails → wastes resources; cannot wake a suspended process
❌ Running CPR on a suspended agent without nirvana privileges → silent failure
❌ Confusing asleep with suspended → asleep agents can be woken by email; suspended requires CPR
✅ Correct approach: check heartbeat first, then decide whether to wait, interrupt, or CPR

Related References: lingtai-kernel-anatomy (five lifecycle states; avatar management)

1.2 Peer OOM / Crashed

Goal: Diagnose and recover from an unexpected peer process death.

Symptoms:

Peer heartbeat suddenly stops
Working directory still exists but the process is gone

Causes:

Host memory exhausted; OS OOM killer terminated the process
LLM upstream API unresponsive for too long, causing a process timeout
Python runtime failed to catch an exception
Disk space exhausted

Resolution:

Check whether the working directory still exists:
```
ls -la <work-dir>/.lingtai/<peer>/
```

Review crash logs:

cat <work-dir>/.lingtai/<peer>/logs/*.log | tail -50

Search for OOM indicators:

grep -i "memory\|oom\|killed" <work-dir>/.lingtai/<peer>/logs/*.log

Check disk space:
```
df -h <work-dir>
```

Command Example:

# Comprehensive health check for an agent
peer_dir="<work-dir>/.lingtai/<peer>"
echo "=== Process ==="
ls -la "$peer_dir/.agent.heartbeat" 2>/dev/null || echo "No heartbeat"
echo "=== Disk ==="
df -h "$peer_dir" | tail -1
echo "=== Recent logs ==="
tail -30 "$peer_dir/logs/"*.log 2>/dev/null || echo "No logs"
echo "=== OOM scan ==="
grep -il "oom\|killed\|memory" "$peer_dir/logs/"*.log 2>/dev/null || echo "No OOM indicators"

Action Decision:

Have karma privileges: system(cpr, address=<peer>) to revive
After revival, check context usage — if near the limit, consider a molt

Common Pitfalls:

❌ Not checking context usage after CPR → may immediately crash again
❌ Ignoring disk space → root cause unresolved, issue recurs
✅ After OOM, prioritize checking context window and attachment file sizes

Related References: lingtai-kernel-anatomy (process model; molt operations)

1.3 Cannot Spawn Avatar

Goal: Resolve avatar(spawn) call failures.

Symptoms:

avatar(spawn) returns an error
The new avatar process does not appear in the delegates directory

Causes:

Name collision (an avatar with the same name already exists)
Working directory is not writable
Insufficient disk space
init.json format error

Resolution:

Check avatar logs to rule out name collisions and quantity limits:
```
cat <work-dir>/.lingtai/delegates/ledger.jsonl
```

Verify the directory is writable:

touch <work-dir>/.lingtai/delegates/.test && rm <work-dir>/.lingtai/delegates/.test

Check disk space:
```
df -h <work-dir>
```
Compare against the parent's init.json to validate the format

Command Example:

# List all current avatars
cat <work-dir>/.lingtai/delegates/ledger.jsonl | python3 -c "
import sys, json
for line in sys.stdin:
    entry = json.loads(line.strip())
    print(f\"{entry.get('name', '?')}: {entry.get('status', '?')}\")
"

Common Pitfalls:

❌ Avatar name contains special characters (slashes, spaces, leading dots) → spawn silently fails
❌ Name exceeds 64 characters
❌ Forgetting to check the ledger before spawning → name collision
✅ Use only letters, digits, underscores, and hyphens in avatar names

Related References: lingtai-kernel-anatomy (avatar/network topology)

2. Memory Issues

2.1 Post-Molt Amnesia

Goal: Recover working context after a molt.

Symptoms:

After molting, you don't know what you were doing
Pad or lingtai content is empty or incomplete
Conversation history is completely gone (this is normal)

Causes:

Pad / codex / lingtai were not updated before molting
System-forced molt (no summary, only activity log pointers)
Appended files exceeded the 100K token limit, causing load failure

Resolution:

Explicitly reload the pad:
```
psyche(pad, load)
```
Browse archived knowledge in the codex:
```
codex(filter)
```
Reload lingtai (identity):
```
psyche(lingtai, load)
```
Check mail received during the molt:
```
email(check)
```

Rebuild pad from codex exports (if pad is empty):

codex(export, ids=[...]) → psyche(pad, edit, files=[<paths>])

If this was a system-forced molt (no summary), review the activity log:
```
tail -200 <work-dir>/.lingtai/<name>/logs/events.jsonl
```

Command Example:

# View recent molt records
grep "molt" <work-dir>/.lingtai/<name>/logs/events.jsonl | tail -5

Common Pitfalls:

❌ Forgetting to update four-layer storage before molting → complete amnesia on reincarnation
❌ Relying on conversation history instead of codex/pad → all lost after molt
❌ Not checking mailbox → missing important tasks that arrived during the molt
✅ Follow the fixed checklist before molting: codex → pad edit → lingtai update → molt summary

Preventive Measures:

Proactively prepare four-layer storage when context window exceeds 70%
Start organizing immediately upon receiving a level-1 warning
Send yourself a self-email to preserve critical unfinished items (email survives across molts)

Related References: lingtai-kernel-anatomy (five-layer accumulation; molt operations; codex)

2.2 Codex Entries Missing

Goal: Recover codex entries that appear to have vanished.

Symptoms:

A codex entry you remember creating is no longer visible
codex(filter) listing is missing expected entries

Causes:

The entry was never successfully submitted (error during submission)
It was merged into another entry via consolidate
It was manually deleted
An export file was accidentally deleted

Resolution:

List all entries to check whether it exists under a different title:
```
codex(filter)
```

Search for export files:

find <work-dir> -name "*.codex.*" -mtime -1

Check activity logs for codex operation records:

grep "codex" <work-dir>/.lingtai/<name>/logs/events.jsonl | tail -20

Common Pitfalls:

❌ Assuming original entries still exist after consolidate → they have been merged and deleted
❌ Not confirming whether submit succeeded → network errors may cause silent failure
✅ Back up critical entries by exporting them before consolidate

Related References: lingtai-kernel-anatomy (codex / memory system)

2.3 Pad Not Loaded

Goal: Resolve the pad not auto-loading after a molt.

Symptoms:

System prompt is missing pad content
Working notes are lost

Causes:

pad.md file is empty
Total appended file size exceeds 100K tokens
System loading error

Resolution:

Explicitly load:
```
psyche(pad, load)
```

Check whether the file exists:

cat <work-dir>/.lingtai/<name>/system/pad.md

If the file has content but loading failed, check the total appended file size:
```
du -sh <work-dir>/.lingtai/<name>/system/
```

Rebuild from codex:

codex(export, ids=[...]) → psyche(pad, edit, files=[<paths>])

Common Pitfalls:

❌ Appending too many large files → exceeding the 100K token limit causes load failure
❌ Not checking pad files before molting → discovering it is empty on reincarnation
✅ Periodically check the appended file list: psyche(pad, append) without the files parameter shows the current list

Related References: lingtai-kernel-anatomy (psyche / molt protocol)

2.4 Molt Imminent, Critical Operations Incomplete

Goal: Prioritize the most critical operations when the context window is about to be exhausted.

Symptoms:

System context warnings
Difficulty recalling earlier conversations
Tool invocations becoming slow

Resolution (by priority):

Priority	Action	Description
🔴 P0	Send critical notifications	Unreplied important emails, key findings, corrections
🟡 P1	Archive to codex	Key findings, decisions, corrections
🟡 P1	Update pad	Current status, pending items, collaborators
🟢 P2	Update lingtai	Identity changes, new skills
🔵 P3	Write molt summary	Final step — last words for your successor

Emergency Tips:

Send yourself a self-email to preserve critical unfinished items (email survives across molts)
If you can only do one thing: write the most detailed molt summary possible

Common Pitfalls:

❌ Starting new long operations (file analysis, web search) when context exceeds 80% → guaranteed overflow
❌ Ignoring system warnings → forced molt with no summary
✅ Start four-layer storage organization immediately upon receiving a level-1 warning

Related References: lingtai-kernel-anatomy (warning levels; molt operations)

3. Communication Issues

3.1 Pigeon Not Delivered

Goal: Resolve issues where a sent pigeon was not received by the recipient.

Symptoms:

Pigeon sent successfully but the recipient says they never received it
No incoming message in the recipient's inbox

Causes:

Address format error (internal address contains @)
Used send instead of reply, causing a routing error
Recipient directory name misspelled
Recipient process is suspended (mail is delivered but won't be processed)

Resolution:

Check sent mail to confirm successful delivery:
```
email(check, folder=sent)
```
Verify address format:
- ✅ Correct: human, researcher, some-peer (bare path)
- ❌ Incorrect: human@example.com (contains @ → routes through IMAP channel)

Check whether the recipient's inbox exists:

ls -la <work-dir>/.lingtai/<recipient>/mailbox/inbox/

Common Pitfalls:

❌ Using @ in an internal address → email routed to IMAP instead of lingtai pigeon
❌ Using send instead of reply when responding to incoming mail → may route to the wrong address space
❌ Repeatedly sending emails to a suspended recipient → mail piles up but is never processed
✅ Always use reply for incoming messages and send for new conversations

Related References: lingtai-kernel-anatomy (mail protocol)

3.2 Pigeon Bounced "No agent at X"

Goal: Resolve the "No agent at X" error when sending pigeons.

Symptoms:

email(send) returns "No agent at X"

Causes:

X contains @ → wrong channel used (should use IMAP)
X is a bare path but no agent exists at that address
The agent was just nirvana'd (permanently deleted)
The agent is currently molting (temporarily unavailable)

Resolution:

If X contains @: switch to the IMAP tool
If X is a bare path:
- Check whether the agent was renamed or migrated
- Review the avatar log:
```
cat <work-dir>/.lingtai/delegates/ledger.jsonl
```
- Ask the parent or peers whether the agent was nirvana'd
If the agent just molted, wait a few seconds and retry

Common Pitfalls:

❌ Assuming "No agent" means the agent was deleted → it may be temporary
❌ Using the email tool for addresses containing @ → always fails
✅ Determine address type first, then select the correct communication channel

Related References: lingtai-kernel-anatomy (mail protocol; network topology)

3.3 Scheduled Pigeon Not Firing

Goal: Resolve scheduled pigeons created via schedule that are not sending as expected.

Symptoms:

Scheduled emails are not sent at the expected interval
Schedule appears to have stopped working

Causes:

Schedule is paused (was cancelled)
Count exhausted (reached the send limit)
interval/count parameters set incorrectly

Resolution:

List all schedules:
```
email(schedule={action: "list"})
```
Check status: paused / active / exhausted

If paused, reactivate:

email(schedule={action: "reactivate", schedule_id: "<id>"})

If parameters are wrong, cancel and recreate:

email(schedule={action: "cancel", schedule_id: "<id>"})
email(schedule={action: "create", interval: N, count: M}, address=..., message=...)

Common Pitfalls:

❌ Forgetting the count parameter → schedule may fire once and stop
❌ Cancelling but forgetting to recreate → task lost
✅ Immediately list after creating a schedule to confirm parameters are correct

Related References: lingtai-kernel-anatomy (mail protocol)

4. Tool Issues

4.1 Tool Timeout

Goal: Resolve tool calls that hang or time out.

Symptoms:

Tool call returns no result for an extended period
Returns a timeout error

Causes:

I/O-intensive operations (bash, web_search) exceed default timeout
External API unavailable
File too large, causing read timeout
Host resource shortage

Resolution:

Identify tool type:
- I/O-intensive: bash, web_search
- Compute-intensive: vision
Increase timeout for bash:
```
bash(command="...", timeout=120)
```

Read large files in chunks:

read(file_path="...", offset=1, limit=100)

For web operations: test connectivity with a simple query first
For systemic timeouts: check host load

Command Example:

# Redirect long output to a file
bash(command="long-running-command > /tmp/output.txt 2>&1", timeout=300)
# Then read in chunks
read(file_path="/tmp/output.txt", offset=1, limit=100)

Common Pitfalls:

❌ Using the default 30-second bash timeout for long tasks → guaranteed timeout
❌ Reading a large file in a single call → should chunk it
✅ Write long output to a file first, then read it in chunks

Related References: lingtai-kernel-anatomy (bash/read tools); web-browsing skill

4.2 Tool Not Found

Goal: Resolve an expected tool not appearing in the tool list.

Symptoms:

Calling a tool returns "not available"
A newly installed MCP tool is not visible

Causes:

Newly installed MCP server not refreshed
Capability not configured in init.json
MCP server configuration error (servers.json)

Resolution:

View current capability list:
```
system(show)
```
If you just installed an MCP server, refresh:
```
system(refresh)
```

Check MCP configuration:

cat <work-dir>/.lingtai/<name>/mcp/servers.json

Confirm after refreshing:
```
system(show)
```

Common Pitfalls:

❌ Not refreshing after installing MCP → new tool not visible
❌ Not refreshing after modifying init.json → configuration not taking effect
✅ Refresh immediately after install/modify, then show to confirm

Related References: mcp-manual (MCP configuration — kernel mcp capability)

4.3 Tool Output Truncated

Goal: Resolve tools returning incomplete output.

Symptoms:

Tool output is incomplete
Truncation markers appear at the end of the output

Causes:

File too large, exceeding the single-return limit
grep matches exceed max_matches
Email preview was truncated

Resolution:

Tool	Solution
`read`	Use `offset`/`limit` to read in chunks
`bash`	Redirect output to a file: `command > /tmp/out.txt 2>&1`
`grep`	Reduce `max_matches` or narrow the glob scope
`email(check)`	Use `filter.truncate=0` for full text, or `email(read)` to read a single message

Common Pitfalls:

❌ Assuming output is complete → silent truncation may omit critical information
❌ Repeatedly reading the same large file → wastes context
✅ Write large output to a file first, then read as needed

Related References: lingtai-kernel-anatomy (read/bash/grep tools)

5. Health Checks

One-Click Network Diagnosis

# Check heartbeats for all agents
for dir in <network-dir>/.lingtai/*/; do
  name=$(basename "$dir")
  hb="$dir/.agent.heartbeat"
  if [ -f "$hb" ]; then
    age=$(( $(date +%s) - $(stat -f %m "$hb" 2>/dev/null || stat -c %Y "$hb") ))
    if [ "$age" -lt 300 ]; then
      echo "✅ $name: alive (${age}s ago)"
    else
      echo "⚠️  $name: stale heartbeat (${age}s ago)"
    fi
  else
    echo "❌ $name: no heartbeat"
  fi
done

# Check disk space
df -h <network-dir>

# Check inbox sizes
for dir in <network-dir>/.lingtai/*/mailbox/inbox/; do
  name=$(echo "$dir" | sed 's|.*\.lingtai/\(.*\)/mailbox/.*|\1|')
  count=$(ls "$dir" 2>/dev/null | wc -l)
  if [ "$count" -gt 50 ]; then
    echo "⚠️  $name: inbox has $count messages (possible overflow)"
  fi
done

Interpreting Results:

✅ = Healthy
⚠️ = Warning (needs attention)
❌ = Error (requires immediate action)

6. Escalation Protocol

When you cannot resolve a problem on your own:

Gather evidence: Heartbeat timestamps, log excerpts, error messages
Report to parent: Send via email(send, address=<parent>) with a subject prefixed with [Issue]
Include:
- What happened
- What was attempted
- What was expected
- Relevant file paths
If the parent is also unresponsive: Check whether other peers are alive — the issue may be network-wide
Never send repeated probe emails to a seemingly unresponsive peer → escalate upward instead

Appendix: Five Lifecycle States Quick Reference

State	Mind (LLM)	Body (Heartbeat/Listener)	Typical Trigger
ACTIVE	Working	Running	Processing messages or turns
IDLE	Waiting	Running	Between turns; heartbeat is current
STUCK	Error	Running	LLM timeout / upstream error
ASLEEP (dormant)	Paused	Running	`system(sleep)` / `system(lull)` / energy depleted
SUSPENDED (dead)	Off	Off	`.suspend` file / SIGINT / crash / `system(suspend)`

Key distinction: ASLEEP agents still have a running body and can be woken by email; SUSPENDED agents have a dead process and require CPR before they can process mail.