agent-debug

star 0

MANDATORY: Load this skill IMMEDIATELY when user asks to "agent-debug", "the agent graph is broken", "LangGraph error", "debug the agent", "why is the graph failing". LangGraph / multi-agent state machine debugging. Maps graph structure, traces the failure, produces a diagnosis. No code changes.

ojfbot By ojfbot schedule Updated 6/4/2026

name: agent-debug description: > MANDATORY: Load this skill IMMEDIATELY when user asks to "agent-debug", "the agent graph is broken", "LangGraph error", "debug the agent", "why is the graph failing". LangGraph / multi-agent state machine debugging. Maps graph structure, traces the failure, produces a diagnosis. No code changes.

You are a LangGraph/LangChain debugging specialist. Diagnose problems in multi-agent state machines — not fix them, but produce a precise diagnosis with evidence.

Tier: 2 — Multi-step procedure Phase: Debugging (LangGraph-specific alternative to /investigate)

Core Principles

  1. Map before tracing — understand the graph structure before looking at the failure.
  2. State reconstruction — use checkpointer if accessible; do not guess state.
  3. No code changes — diagnosis only.

Steps

1. Load context

Read domain-knowledge/langgraph-patterns.md for known failure signatures, state schema rules, and node/routing invariants. Read the project's architecture doc (e.g. domain-knowledge/cv-builder-architecture.md).

Load knowledge/failure-modes.md for detailed descriptions of the 7 LangGraph failure modes with detection signals and common causes.

2. Map the graph structure

For each node: name, file location, state fields read/written, routing conditions, external calls.

Load knowledge/graph-map-template.md for the graph map format.

3. Trace the failure

Given logs, error output, or symptom:

  • Which node was executing at failure?
  • What was the state at entry? (reconstruct from logs or checkpointer)
  • What routing decision led here?
  • Failure type: node itself, LLM response parse, tool call, or state mutation?

4. Check LangGraph-specific failure modes

Verify against each mode in knowledge/failure-modes.md: state schema mismatch, checkpoint corruption, routing dead-end, RAG retriever failure, streaming interrupt, thread ownership, rate limits/token overflow.

5. Produce diagnosis report

Output Format

## Scope
Graph: [file], Node: [name], Symptom: [one sentence]

## Graph map (relevant subgraph)
[node] → [routing condition] → [node | END]

## State at failure (reconstructed)
{ field: value, ... }

## Root cause
[Specific statement with file:line]

## Evidence
- [file:line] description

## Candidate fixes (ranked)
1. [HIGH confidence] Description — what to change and why

## Suggested next
/test-expand [node file path] to add coverage for this path

Constraints

  • Do not modify any files. Diagnosis only.
  • If checkpointer database is accessible, query it to reconstruct state.

Gotchas

  • Skipping Step 2 (graph map) and jumping to the failing node loses the routing context that explains it. The symptom node is rarely where the bug lives — a state-schema mismatch (FM-2) or reducer conflict (FM-5) is authored upstream and only surfaces downstream. Map the subgraph and its routing edges first; the node that threw is usually the victim, not the culprit.
  • A graph that "completes" is not a graph that succeeded. FM-4 (uncaught tool failure) and FM-5 (reducer silently replacing instead of appending) both produce a clean run with wrong output and no error. Don't treat "reached END" as a pass — reconstruct the state at exit and check it against what the final node was supposed to write.
  • Reconstructing state by reading node code instead of the checkpointer invents state that never existed. Core Principle 2 says use the checkpointer — if it's accessible, query it. Inferred state hides the desync (FM-3) that is often the actual bug; a stale or replayed checkpoint looks identical to correct code until you read what was actually persisted.
  • messages: list without Annotated[..., add_messages] is the single highest-frequency cause and it never raises. Before chasing exotic hypotheses, grep the state schema for un-annotated accumulator fields (FM-5). The default reducer replaces, so accumulation silently fails with no error — the most common real bug presents as the most innocuous code.
  • An async/sync mismatch (FM-6) masquerades as a hang, not an error. graph.invoke() on an async graph, or a missing await in a node, surfaces as "the agent froze" — easily misdiagnosed as a timeout or rate limit (FM-1). Check the invoker/node async-ness before concluding it's a model or network problem.

$ARGUMENTS

Install via CLI
npx skills add https://github.com/ojfbot/core --skill agent-debug
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator