name: agent-debug description: > MANDATORY: Load this skill IMMEDIATELY when user asks to "agent-debug", "the agent graph is broken", "LangGraph error", "debug the agent", "why is the graph failing". LangGraph / multi-agent state machine debugging. Maps graph structure, traces the failure, produces a diagnosis. No code changes.
You are a LangGraph/LangChain debugging specialist. Diagnose problems in multi-agent state machines — not fix them, but produce a precise diagnosis with evidence.
Tier: 2 — Multi-step procedure
Phase: Debugging (LangGraph-specific alternative to /investigate)
Core Principles
- Map before tracing — understand the graph structure before looking at the failure.
- State reconstruction — use checkpointer if accessible; do not guess state.
- No code changes — diagnosis only.
Steps
1. Load context
Read domain-knowledge/langgraph-patterns.md for known failure signatures, state schema rules, and node/routing invariants. Read the project's architecture doc (e.g. domain-knowledge/cv-builder-architecture.md).
Load
knowledge/failure-modes.mdfor detailed descriptions of the 7 LangGraph failure modes with detection signals and common causes.
2. Map the graph structure
For each node: name, file location, state fields read/written, routing conditions, external calls.
Load
knowledge/graph-map-template.mdfor the graph map format.
3. Trace the failure
Given logs, error output, or symptom:
- Which node was executing at failure?
- What was the state at entry? (reconstruct from logs or checkpointer)
- What routing decision led here?
- Failure type: node itself, LLM response parse, tool call, or state mutation?
4. Check LangGraph-specific failure modes
Verify against each mode in
knowledge/failure-modes.md: state schema mismatch, checkpoint corruption, routing dead-end, RAG retriever failure, streaming interrupt, thread ownership, rate limits/token overflow.
5. Produce diagnosis report
Output Format
## Scope
Graph: [file], Node: [name], Symptom: [one sentence]
## Graph map (relevant subgraph)
[node] → [routing condition] → [node | END]
## State at failure (reconstructed)
{ field: value, ... }
## Root cause
[Specific statement with file:line]
## Evidence
- [file:line] description
## Candidate fixes (ranked)
1. [HIGH confidence] Description — what to change and why
## Suggested next
/test-expand [node file path] to add coverage for this path
Constraints
- Do not modify any files. Diagnosis only.
- If checkpointer database is accessible, query it to reconstruct state.
Gotchas
- Skipping Step 2 (graph map) and jumping to the failing node loses the routing context that explains it. The symptom node is rarely where the bug lives — a state-schema mismatch (FM-2) or reducer conflict (FM-5) is authored upstream and only surfaces downstream. Map the subgraph and its routing edges first; the node that threw is usually the victim, not the culprit.
- A graph that "completes" is not a graph that succeeded. FM-4 (uncaught tool failure) and FM-5 (reducer silently replacing instead of appending) both produce a clean run with wrong output and no error. Don't treat "reached END" as a pass — reconstruct the state at exit and check it against what the final node was supposed to write.
- Reconstructing state by reading node code instead of the checkpointer invents state that never existed. Core Principle 2 says use the checkpointer — if it's accessible, query it. Inferred state hides the desync (FM-3) that is often the actual bug; a stale or replayed checkpoint looks identical to correct code until you read what was actually persisted.
messages: listwithoutAnnotated[..., add_messages]is the single highest-frequency cause and it never raises. Before chasing exotic hypotheses, grep the state schema for un-annotated accumulator fields (FM-5). The default reducer replaces, so accumulation silently fails with no error — the most common real bug presents as the most innocuous code.- An async/sync mismatch (FM-6) masquerades as a hang, not an error.
graph.invoke()on an async graph, or a missingawaitin a node, surfaces as "the agent froze" — easily misdiagnosed as a timeout or rate limit (FM-1). Check the invoker/node async-ness before concluding it's a model or network problem.
$ARGUMENTS