why-ai-agents-systematically - SKILL.md Agent Skill

name: "why-ai-agents-systematically" description: > Diagnose and fix systematic failure modes in LLM-based multi-agent systems performing root cause analysis on cloud incidents. Applies the 12-pitfall taxonomy from Kim et al. (2026) to audit agent architectures, harden inter-agent communication, and eliminate hallucinated reasoning in diagnostic workflows. Use when: - "audit my agent pipeline for common failure modes" - "why does my RCA agent keep hallucinating root causes" - "harden multi-agent communication for incident response" - "review this agent workflow for reasoning pitfalls" - "my agents lose context across handoffs" - "fix incomplete exploration in my diagnostic agent"

Diagnosing and Fixing Systematic Failures in LLM Agent Systems

This skill equips Claude to audit, diagnose, and remediate the 12 systematic failure modes that plague LLM-based agents performing complex diagnostic tasks like cloud Root Cause Analysis. Drawing from a taxonomy developed across 1,675 agent runs on the OpenRCA benchmark, it provides a concrete diagnostic methodology for identifying whether failures originate from intra-agent reasoning, inter-agent communication, or agent-environment interaction -- and prescribes targeted fixes for each category rather than blanket prompt engineering that the research proves ineffective for the dominant pitfalls.

When to Use

When a user has a multi-agent system that produces incorrect or hallucinated conclusions during incident investigation or diagnostic workflows
When agents in a pipeline lose critical context during handoffs, producing contradictory or degraded analyses
When an LLM agent terminates investigation prematurely, missing the actual root cause
When building or reviewing a ReAct-style agent that interacts with monitoring tools, logs, or APIs and misinterprets their outputs
When an agent enters circular reasoning loops, repeatedly querying the same data without progressing
When designing a multi-agent architecture for cloud operations, SRE automation, or any complex diagnostic task and wanting to avoid known pitfalls
When debugging why an agent system works on simple cases but fails on multi-step, multi-system failure scenarios

Key Technique: The 12-Pitfall Taxonomy

The core insight from this research is that LLM agent failures in diagnostic tasks are architectural, not model-dependent. The same failure patterns appear across all model capability tiers (from smaller to frontier models), which means upgrading your model alone will not fix them. The failures cluster into three interaction boundaries, each requiring different mitigations:

Intra-agent reasoning failures (35-40% of all failures) include hallucinated data interpretation (the agent fabricates plausible but wrong conclusions from telemetry), circular reasoning (looping through identical steps), incomplete exploration (terminating after symptom identification instead of pursuing causal chains), and tool misalignment (misunderstanding what a diagnostic tool actually does). These are the most prevalent and the hardest to fix with prompt engineering alone.

Inter-agent communication failures (20-25%) include information loss across handoffs, conflicting conclusions between agents investigating the same incident, coordination breakdowns in sequencing, and context degradation where understanding erodes across transitions. The paper shows enriching the communication protocol (structured message schemas, explicit state passing) reduces these by up to 15 percentage points.

Agent-environment interaction failures (30-35%) include tool output misinterpretation (misreading JSON structures in 12-14% of queries), incorrect state assumptions, resource/token exhaustion before analysis completes, and misinterpreting error messages from the environment.

Step-by-Step Workflow

1. Collect Agent Execution Traces

Gather complete execution logs for the agent system: every LLM call, tool invocation, inter-agent message, and final output. If traces are not available, instrument the system to capture them. You need the full observation-reasoning-action cycle for each agent.

2. Reconstruct Reasoning Paths

For each agent run, map the sequential chain: what the agent observed, what hypothesis it formed, what action it took, and what it concluded. Identify where the chain diverges from ground truth or expected diagnostic logic.

3. Classify Each Failure Against the 12-Pitfall Taxonomy

Examine each failed run and assign it to one or more pitfall categories:

Intra-Agent Reasoning:

Pitfall	Signature
Hallucinated Interpretation	Agent states facts not present in tool output
Circular Reasoning	Agent repeats same query/action 2+ times with no parameter change
Incomplete Exploration	Agent declares root cause after examining <50% of relevant signals
Tool Misalignment	Agent calls a tool expecting output format X but receives format Y

Inter-Agent Communication:

Pitfall	Signature
Information Loss	Downstream agent lacks findings that upstream agent discovered
Conflicting Conclusions	Two agents attribute the same incident to different root causes
Coordination Breakdown	Agents duplicate work or skip steps assuming another agent handled them
Context Degradation	Later agents operate on progressively less accurate summaries

Agent-Environment Interaction:

Pitfall	Signature
Tool Misinterpretation	Agent extracts wrong field from structured output (JSON, table)
State Assumption Error	Agent acts on assumed system state that contradicts actual state
Resource Exhaustion	Agent hits token limit or API quota before completing analysis
Feedback Misunderstanding	Agent treats a warning as success or an error as irrelevant

4. Quantify Failure Distribution

Count the frequency of each pitfall across your failure corpus. This determines where to focus remediation. If 40% of failures are hallucinated interpretation, prompt engineering tweaks to tool output formatting will have more impact than communication protocol changes.

5. Apply Targeted Mitigations by Category

For intra-agent reasoning failures:

Add explicit verification loops: after forming a hypothesis, require the agent to cite the specific data point supporting it and verify that data point exists in the tool output
Implement circular-reasoning detection: track the last N actions and halt + redirect if repetition is detected
Enforce exploration budgets: require the agent to examine a minimum set of diagnostic signals before declaring a root cause
Provide structured tool specifications with exact output schemas so the agent knows what fields to expect

For inter-agent communication failures:

Replace free-text handoffs with structured message schemas containing: findings (list of verified observations), hypotheses (ranked), eliminated_causes, next_steps, and confidence_level
Implement a shared artifact store where each agent writes its findings to a persistent document that downstream agents read directly (not through summarization)
Add a validation agent that checks for contradictions across agent outputs before producing a final answer

For agent-environment interaction failures:

Wrap tool outputs in a parsing layer that validates expected schema before passing to the agent
Implement explicit token budgeting with graceful degradation (partial results > no results)
Add output format examples to tool descriptions so the agent knows exactly what a successful vs. failed response looks like

6. Implement Communication Protocol Enrichment

This is the single highest-ROI intervention from the paper. Replace any free-form inter-agent messaging with a structured protocol:

{
  "from_agent": "metrics-analyzer",
  "to_agent": "root-cause-synthesizer",
  "findings": [
    {"signal": "CPU utilization on node-3", "value": "98.7%", "source": "prometheus query at 14:32 UTC"},
    {"signal": "memory pressure on node-3", "value": "normal", "source": "prometheus query at 14:32 UTC"}
  ],
  "hypotheses": [
    {"cause": "CPU throttling on node-3 due to noisy neighbor", "confidence": 0.7},
    {"cause": "Application-level infinite loop", "confidence": 0.3}
  ],
  "eliminated": ["network partition", "disk I/O saturation"],
  "remaining_questions": ["Is the CPU spike correlated with a specific deployment?"]
}

7. Add Verification Checkpoints

Insert explicit verification steps in the agent workflow where the agent must confirm its intermediate conclusions against raw data before proceeding. This reduces hallucination propagation by 25-30%.

8. Test Against Known Failure Scenarios

Re-run the agent against the same failure corpus and re-classify failures. Confirm that the targeted pitfall category has reduced in frequency without introducing regressions in other categories.

Concrete Examples

Example 1: Auditing an existing multi-agent RCA system

User: "My multi-agent incident response system gets the wrong root cause about 50% of the time. Can you help me figure out why?"

Approach:

Request agent execution traces (LLM call logs, tool outputs, inter-agent messages) for 10-20 failed runs
For each failed run, reconstruct the reasoning path and identify the exact step where the agent diverged from correct analysis
Classify each failure using the 12-pitfall taxonomy
Produce a distribution: e.g., "6/20 failures are hallucinated interpretation, 5/20 are information loss at handoffs, 4/20 are incomplete exploration, 3/20 are tool misinterpretation, 2/20 are circular reasoning"
Recommend targeted fixes: structured communication protocol for the handoff losses, verification loops for hallucination, exploration budgets for premature termination

Output:

## Failure Audit Report

Analyzed: 20 failed agent runs
Failure distribution:
  Intra-agent reasoning:   12/20 (60%)
    - Hallucinated interpretation: 6
    - Incomplete exploration: 4
    - Circular reasoning: 2
  Inter-agent communication: 5/20 (25%)
    - Information loss: 5
  Agent-environment:        3/20 (15%)
    - Tool misinterpretation: 3

Top priority fix: Hallucinated interpretation (30% of all failures)
  -> Add citation-required verification: agent must quote exact
     tool output line supporting each claim
  -> Expected reduction: 25-30% of these failures

Second priority: Information loss at agent handoffs (25%)
  -> Replace free-text summaries with structured finding schemas
  -> Expected reduction: up to 15 percentage points

Example 2: Designing a new diagnostic agent with built-in pitfall resistance

User: "I'm building a multi-agent system to diagnose Kubernetes pod failures. How do I avoid the common pitfalls?"

Approach:

Design the agent architecture with pitfall-resistant patterns from the start
Define structured inter-agent communication protocol
Add verification checkpoints and exploration minimums
Implement tool output validation

Output:

# Agent architecture with pitfall-resistant patterns

AGENT_CONFIG = {
    "log_analyzer": {
        "role": "Analyze pod logs and extract error patterns",
        "tools": ["kubectl_logs", "log_search"],
        "verification": {
            # Anti-hallucination: require citing log lines
            "require_source_citation": True,
            # Anti-incomplete-exploration: minimum signals
            "min_signals_before_conclusion": 3,
            # Anti-circular-reasoning: detect repeated actions
            "max_identical_actions": 2,
        },
        "token_budget": 8000,  # Anti-resource-exhaustion
        "graceful_degradation": "return partial findings if budget exceeded",
    },
    "metrics_analyzer": {
        "role": "Query Prometheus for resource metrics around failure time",
        "tools": ["prometheus_query"],
        "verification": {
            "require_source_citation": True,
            "min_signals_before_conclusion": 4,
            "max_identical_actions": 2,
        },
        "tool_output_validation": {
            # Anti-tool-misinterpretation: parse and validate
            "prometheus_query": {
                "expected_fields": ["metric", "value", "timestamp"],
                "parse_before_llm": True,
            }
        },
    },
}

# Structured inter-agent handoff schema (anti-information-loss)
HANDOFF_SCHEMA = {
    "findings": [{"signal": str, "value": str, "source": str}],
    "hypotheses": [{"cause": str, "confidence": float}],
    "eliminated_causes": [str],
    "remaining_questions": [str],
    "raw_evidence_refs": [str],  # pointers to raw data, not summaries
}

# Validation agent (anti-conflicting-conclusions)
SYNTHESIS_CONFIG = {
    "role": "Cross-check findings from all agents, flag contradictions",
    "contradiction_check": True,
    "require_consensus_or_explain_divergence": True,
}

Example 3: Fixing a specific hallucination problem

User: "My RCA agent keeps saying 'memory usage spiked to 95%' when the actual Prometheus data shows 45%. How do I stop this?"

Approach:

This is a classic "hallucinated data interpretation" pitfall -- the agent fabricates plausible metric values
Implement a citation-and-verify pattern between the tool call and the agent's interpretation
Add output parsing that extracts the actual numeric value before the LLM sees free-form text

Output:

# Before (vulnerable to hallucination):
# Agent sees raw Prometheus JSON and interprets freely

# After (citation-required pattern):
def query_and_verify(agent, metric_query):
    raw_output = prometheus_client.query(metric_query)

    # Step 1: Parse structured values BEFORE LLM interpretation
    parsed = extract_metrics(raw_output)  # deterministic parsing
    # parsed = {"memory_usage_percent": 45.2, "timestamp": "..."}

    # Step 2: Present parsed values to agent with verification requirement
    prompt = f"""
    Parsed metric values (these are ground truth, do not modify):
    {json.dumps(parsed, indent=2)}

    Based on EXACTLY these values (do not infer different numbers),
    what does this tell us about the system state?
    You MUST quote the exact numeric values from above in your response.
    """

    response = agent.generate(prompt)

    # Step 3: Post-hoc verification -- check agent's stated values
    # match what was actually in the parsed output
    for key, value in parsed.items():
        if str(value) not in response and not approx_match(value, response):
            return f"VERIFICATION FAILED: Agent misquoted {key}. "
                   f"Actual: {value}. Re-running with correction."

    return response

Best Practices

Do:

Classify failures before attempting fixes. The pitfall taxonomy exists precisely because different failure types require different mitigations. Shotgun prompt engineering wastes effort.
Use structured schemas for all inter-agent communication. Free-text summaries are the primary vector for information loss and context degradation.
Require agents to cite specific data sources for every factual claim. "Memory is high" is not acceptable; "memory_usage=94.2% from prometheus at 14:32" is.
Implement deterministic parsing of tool outputs before passing them to the LLM. Let code extract the numbers; let the LLM interpret what the numbers mean.
Set explicit exploration minimums. Require agents to examine N distinct signal types before declaring a root cause, preventing premature termination.

Avoid:

Relying on prompt engineering alone to fix hallucinated interpretation or incomplete exploration. The research shows these pitfalls persist across all model capability tiers and are architectural problems.
Letting agents pass free-text summaries to downstream agents. Each summarization step degrades context. Pass structured artifacts and raw evidence references instead.
Assuming a more capable model will eliminate failures. The same 12 pitfalls appear in every model tested. Fix the architecture, not the model.
Ignoring token budgets. Resource exhaustion causes agents to produce truncated, unreliable analyses. Budget explicitly and implement graceful degradation.

Error Handling

Agent produces no output (resource exhaustion): Implement token budget monitoring with a 75% threshold warning that triggers the agent to produce a partial summary of findings so far before hitting the hard limit.
Contradictory conclusions from multiple agents: Route both conclusions plus their supporting evidence to a synthesis agent that must explain the contradiction and identify which evidence chain is stronger.
Circular reasoning detected: After detecting 2 repeated identical actions, inject a redirect prompt: "You have queried this metric twice with the same parameters. Either change the parameters, investigate a different signal, or state why the current data is sufficient to proceed."
Tool returns unexpected format: Fail loud, not silent. If a tool output does not match the expected schema, return a clear error to the agent rather than letting it interpret malformed data (which causes tool misinterpretation pitfalls).

Limitations

The 12-pitfall taxonomy was developed specifically against cloud RCA scenarios using the OpenRCA benchmark. While the categories generalize to other diagnostic domains (security incident response, debugging, medical diagnosis triage), the specific frequency distributions may differ in non-cloud contexts.
The mitigations reduce failure rates significantly but do not eliminate them. Combined interventions reduce failures from ~50% to ~25-30%, meaning complex multi-system failures still challenge even hardened agent architectures.
Human operators remain superior at recognizing novel failure patterns and adapting to unprecedented configurations. Agent systems should be designed for augmentation, not full replacement, in high-stakes diagnostic scenarios.
The structured communication protocol adds overhead (more tokens, more latency). For simple, single-cause incidents, a lighter-weight architecture may be more appropriate.

Reference

Kim, T., Park, W., Yun, H., & Lee, K. (2026). Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis? arXiv:2602.09937v1. https://arxiv.org/abs/2602.09937v1

Look for: Table/figure enumerating the 12 pitfall types with frequency data across models, the structured communication protocol specification that achieved 15pp improvement, and the controlled mitigation experiment results showing which interventions work for which pitfall categories.