longcat-flash-thinking-2601-technical-report - SKILL.md Agent Skill

name: "longcat-flash-thinking-2601-technical-report" description: "Build robust multi-tool agentic pipelines with noise-aware execution, parallel reasoning, and environment scaling patterns from the LongCat-Flash-Thinking architecture. Use when: 'build a multi-tool agent pipeline', 'make my agent robust to API failures', 'scale tool orchestration across domains', 'add parallel reasoning to my agent', 'handle noisy tool responses', 'design a fault-tolerant agentic workflow'."

Robust Multi-Tool Agentic Pipeline Design (LongCat-Flash-Thinking Patterns)

This skill teaches Claude to design and implement robust multi-tool agentic systems using architectural patterns from LongCat-Flash-Thinking-2601. The core insight: real-world tool-use agents fail not because of reasoning quality but because of noisy environments, brittle tool chains, and poor error recovery. This skill applies the paper's three key innovations — dependency-aware tool graph construction, noise-aware execution with curriculum-based hardening, and parallel reasoning with aggregation — to build agent pipelines that work reliably in production.

When To Use

When the user asks to build a multi-step agent that calls multiple external APIs or tools in sequence
When designing a pipeline where tool outputs feed into subsequent tool calls (dependency chains)
When the user needs their agent to handle flaky APIs, partial responses, or inconsistent tool behavior
When orchestrating parallel tool calls that must be aggregated into a single coherent result
When scaling an agent system across many domains or tool types (e.g., search + code execution + database queries)
When implementing retry logic, fallback strategies, or graceful degradation for agentic workflows
When the user says "my agent breaks when the API returns unexpected data" or "how do I make this pipeline more reliable"

Key Technique

Tool Dependency Graphs with Controlled Expansion. Instead of treating tool calls as a flat sequence, model them as a directed acyclic graph (DAG) where nodes are tools and edges represent parameter dependencies. A tool node is only invoked when all its upstream dependencies have resolved successfully. This prevents cascading failures: if tool B depends on tool A's output, B is never called with stale or missing data. When adding new tools to an existing pipeline, use BFS-style expansion — only add a new tool node if all its dependencies are already satisfied by previously instantiated tools. This guarantees executability at every expansion step.

Noise-Aware Execution with Progressive Hardening. Real-world tool responses are noisy: APIs return partial data, rate-limit, timeout, or change response schemas. Rather than hoping for clean data, explicitly categorize noise into instruction noise (ambiguous user queries, underspecified parameters) and tool noise (execution failures, inconsistent responses, partial results). Build a curriculum of increasing noise severity: start with clean tool responses during development, then progressively inject realistic failure modes (timeouts, malformed JSON, missing fields, rate limits) to harden the pipeline. This mirrors the paper's finding that noise-trained agents maintain clean-environment performance (29.3% vs 28.6%) while dramatically improving under noisy conditions (20.5% vs 13.3%).

Parallel Reasoning with Reflective Aggregation (Heavy Thinking Pattern). For complex decisions requiring multiple perspectives, generate N candidate reasoning paths in parallel, then run a second-pass aggregation step that synthesizes the parallel results into a final answer. This expands both reasoning depth (each path can go deep) and width (multiple approaches explored simultaneously). Practically, this means: fan out tool calls or reasoning branches in parallel, collect all results, then use a dedicated aggregation step that has access to all intermediate reasoning — not just final outputs.

Step-by-Step Workflow

Map the tool dependency graph. For each tool the agent can call, list its required inputs and outputs. Draw edges from each tool that produces an output to every tool that consumes it. Represent this as an adjacency list or DAG structure in code. Verify the graph is acyclic.
Validate executability before invocation. Before calling any tool, check that all upstream dependencies have resolved with valid data. Implement a readiness check: tool.ready = all(dep.status == 'success' for dep in tool.dependencies). Never invoke a tool whose dependencies have not fully resolved.
Categorize expected noise patterns. For each tool, enumerate its failure modes: timeout, rate limit, partial response, schema change, authentication expiry, empty result. Tag each as instruction noise (ambiguity in what was asked) or tool noise (execution-level failure). Store these as a noise profile per tool.
Implement layered error handling per noise type. For tool noise: add retries with exponential backoff, response schema validation, and fallback to cached results. For instruction noise: add parameter normalization, disambiguation prompts, and default value injection. Each layer handles one noise category independently.
Build the execution engine with topological ordering. Sort the tool DAG topologically. Execute tools in order, parallelizing tools at the same dependency depth. Use an async task queue where completed tools trigger readiness checks on their downstream dependents.
Add context management for multi-turn interactions. Track cumulative context size. When context exceeds a threshold (e.g., 80K tokens or a domain-appropriate limit), apply summary-based compression: distill completed tool results into concise summaries, dropping raw intermediate data. If turn count exceeds maximum, reset with a fresh context seeded by the summary.
Implement parallel reasoning branches for critical decisions. At decision points where the next action is ambiguous, fan out N parallel reasoning paths (N=3-5). Each path independently selects and executes its tool chain. Collect all terminal results.
Aggregate parallel results with reflective synthesis. Pass all N parallel results to an aggregation step that: (a) identifies consensus across paths, (b) flags contradictions, (c) selects the best-supported conclusion. The aggregator sees intermediate reasoning, not just final answers.
Test with progressive noise injection. Start with clean tool responses. Then systematically inject: (a) 10% timeout rate, (b) malformed responses on 5% of calls, (c) missing fields in 15% of responses, (d) rate limiting after N calls. Verify the pipeline degrades gracefully at each level.
Monitor and log the dependency graph execution. Log each tool invocation with: tool name, inputs received, dependency status, response validity, retry count, and wall-clock time. Use this to identify bottleneck tools and noise hotspots for targeted hardening.

Concrete Examples

Example 1: Multi-API Research Agent with Noise Handling

User: "Build me an agent that takes a research question, searches the web, finds relevant papers, extracts key findings, and produces a summary. It needs to handle API failures gracefully."

Approach:

Map the tool dependency graph:

[user_query] --> [web_search] --> [paper_finder] --> [pdf_extractor] --> [summarizer]
                               \-> [news_search] -/

Implement with topological execution and noise handling:

import asyncio
from dataclasses import dataclass, field
from enum import Enum
from typing import Any

class ToolStatus(Enum):
    PENDING = "pending"
    READY = "ready"
    RUNNING = "running"
    SUCCESS = "success"
    FAILED = "failed"

@dataclass
class ToolNode:
    name: str
    dependencies: list[str] = field(default_factory=list)
    status: ToolStatus = ToolStatus.PENDING
    result: Any = None
    retries: int = 0
    max_retries: int = 3
    noise_profile: dict = field(default_factory=dict)

class AgentPipeline:
    def __init__(self):
        self.tools: dict[str, ToolNode] = {}
        self.context_budget = 80_000  # tokens
        self.context_summary = ""

    def add_tool(self, name, dependencies=None, noise_profile=None):
        """BFS-style: only add if all dependencies already exist."""
        deps = dependencies or []
        for dep in deps:
            if dep not in self.tools:
                raise ValueError(f"Dependency '{dep}' not registered. Add it first.")
        self.tools[name] = ToolNode(
            name=name,
            dependencies=deps,
            noise_profile=noise_profile or {}
        )

    def is_ready(self, name: str) -> bool:
        tool = self.tools[name]
        return all(
            self.tools[dep].status == ToolStatus.SUCCESS
            for dep in tool.dependencies
        )

    async def execute_with_retry(self, tool: ToolNode, executor):
        """Noise-aware execution with layered error handling."""
        while tool.retries <= tool.max_retries:
            try:
                inputs = {
                    dep: self.tools[dep].result
                    for dep in tool.dependencies
                }
                tool.status = ToolStatus.RUNNING
                result = await asyncio.wait_for(
                    executor(tool.name, inputs),
                    timeout=30.0
                )
                # Validate response against expected schema
                if not self.validate_response(tool, result):
                    raise ValueError("Schema validation failed")
                tool.result = result
                tool.status = ToolStatus.SUCCESS
                return
            except asyncio.TimeoutError:
                tool.retries += 1
                await asyncio.sleep(2 ** tool.retries)  # exponential backoff
            except Exception as e:
                tool.retries += 1
                if tool.retries > tool.max_retries:
                    tool.status = ToolStatus.FAILED
                    tool.result = {"error": str(e), "fallback": True}
                    return

    async def run(self, executor):
        """Topological execution with parallelism at each depth."""
        from graphlib import TopologicalSorter
        ts = TopologicalSorter(
            {name: set(t.dependencies) for name, t in self.tools.items()}
        )
        ts.prepare()
        while ts.is_active():
            ready_batch = list(ts.get_ready())
            await asyncio.gather(*[
                self.execute_with_retry(self.tools[name], executor)
                for name in ready_batch
            ])
            for name in ready_batch:
                ts.done(name)

pipeline = AgentPipeline()
pipeline.add_tool("web_search", noise_profile={"timeout_rate": 0.1})
pipeline.add_tool("paper_finder", dependencies=["web_search"],
                   noise_profile={"partial_response_rate": 0.15})
pipeline.add_tool("news_search", dependencies=["web_search"])
pipeline.add_tool("pdf_extractor",
                   dependencies=["paper_finder", "news_search"],
                   noise_profile={"schema_change_rate": 0.05})
pipeline.add_tool("summarizer", dependencies=["pdf_extractor"])

Output: A pipeline that continues producing summaries even when paper_finder times out (falls back to news_search results), pdf_extractor returns partial data (summarizer works with what's available), or web_search is rate-limited (retries with backoff).

Example 2: Parallel Reasoning for Ambiguous Tool Selection

User: "My agent needs to decide whether to query a SQL database, call a REST API, or search a document store based on the user's question. Sometimes multiple sources are needed."

Approach:

Implement the Heavy Thinking parallel fan-out pattern:

async def parallel_reasoning(query: str, tools: list, n_paths: int = 3):
    """Fan out N reasoning paths, each independently selecting tools."""

    async def reasoning_path(path_id: int):
        # Each path independently analyzes the query and picks tools
        selected = await select_tools_for_query(query, tools, temperature=0.7)
        results = []
        for tool in selected:
            result = await execute_tool(tool, query)
            results.append({"tool": tool, "result": result, "path": path_id})
        return results

    # Stage 1: Parallel exploration
    all_paths = await asyncio.gather(*[
        reasoning_path(i) for i in range(n_paths)
    ])

    # Stage 2: Reflective aggregation
    return aggregate_with_reflection(query, all_paths)


def aggregate_with_reflection(query: str, paths: list) -> dict:
    """Synthesize parallel results — sees intermediate reasoning, not just answers."""
    tool_votes = {}
    all_results = []

    for path in paths:
        for step in path:
            tool_name = step["tool"]
            tool_votes[tool_name] = tool_votes.get(tool_name, 0) + 1
            all_results.append(step)

    # Consensus: tools selected by majority of paths
    consensus_tools = {t for t, v in tool_votes.items() if v >= len(paths) / 2}

    # Contradiction detection: flag conflicting results from same tool
    contradictions = detect_contradictions(all_results)

    return {
        "consensus_tools": consensus_tools,
        "contradictions": contradictions,
        "merged_result": merge_results(all_results, consensus_tools),
        "confidence": len(consensus_tools) / max(len(tool_votes), 1)
    }

Output: For query "What were our Q3 sales in the Northwest region?", path 1 queries SQL (SELECT sum(amount) FROM sales WHERE quarter='Q3' AND region='NW'), path 2 calls the REST API (/api/sales?q=Q3&region=NW), path 3 searches documents. Aggregator finds SQL and API agree on $2.4M, document search returns a slightly different figure from a draft report. Aggregator flags the contradiction and returns the SQL/API consensus with a note about the draft discrepancy.

Example 3: Progressive Noise Testing for a Deployment Pipeline

User: "I have a CI/CD agent that calls GitHub API, runs tests, deploys to staging, then runs smoke tests. It's flaky in production. Help me harden it."

Approach:

Define the noise profile for each tool in the pipeline:

noise_profiles = {
    "github_api": {
        "timeout": {"rate": 0.08, "retry_strategy": "exponential_backoff"},
        "rate_limit": {"rate": 0.15, "retry_strategy": "wait_for_reset"},
        "partial_response": {"rate": 0.03, "handler": "re_fetch_missing_fields"}
    },
    "test_runner": {
        "flaky_test": {"rate": 0.10, "handler": "retry_failed_only"},
        "timeout": {"rate": 0.05, "retry_strategy": "extend_timeout"}
    },
    "deploy_staging": {
        "rollback_needed": {"rate": 0.02, "handler": "auto_rollback"},
        "partial_deploy": {"rate": 0.04, "handler": "verify_then_retry"}
    },
    "smoke_tests": {
        "environment_not_ready": {"rate": 0.20, "handler": "poll_with_backoff"},
        "false_negative": {"rate": 0.05, "handler": "retry_twice_before_fail"}
    }
}

Inject noise progressively in test environment:

# Level 1: Single-tool noise (one tool fails at a time)
# Level 2: Multi-tool noise (2+ tools fail simultaneously)
# Level 3: Cascading noise (upstream failure causes downstream issues)
# Level 4: Realistic production profile (all noise at observed rates)

for level in [1, 2, 3, 4]:
    results = run_pipeline_with_noise(pipeline, noise_level=level, iterations=100)
    print(f"Level {level}: {results.success_rate}% success, "
          f"{results.avg_retries} avg retries, "
          f"{results.false_failures} false failures caught")

Output: Level 1: 98% success. Level 2: 94% success. Level 3: 87% success (identified that deploy_staging + smoke_tests cascading failure needed a health-check gate). Level 4: 91% success after adding the gate — up from the original 72% observed in production.

Best Practices

Do: Model tool relationships as explicit DAGs. This catches impossible execution orders at design time, not runtime.
Do: Classify every failure mode as instruction noise or tool noise before writing error handling. Different noise types need different mitigation strategies.
Do: Use parallel fan-out (Heavy Thinking pattern) when a decision depends on ambiguous or incomplete information. Three cheap parallel explorations beat one expensive serial guess.
Do: Compress context aggressively in multi-turn pipelines. Summarize completed tool results; keep only active branch context in full.
Avoid: Retrying blindly without classifying the failure. A rate limit needs a wait; a schema change needs adaptation; a timeout needs backoff. Generic retries waste budget.
Avoid: Treating tool responses as always well-formed. Validate every response against an expected schema before passing it downstream. One malformed response poisons the entire chain.
Avoid: Building pipelines that only work under clean conditions. If you haven't tested with injected noise, your pipeline is not production-ready.

Error Handling

Failure Mode	Detection	Recovery
Tool timeout	`asyncio.TimeoutError` or HTTP 408/504	Exponential backoff, max 3 retries, then mark as degraded and continue with partial data
Rate limiting	HTTP 429 or `Retry-After` header	Wait for reset window, use secondary API key if available
Malformed response	Schema validation failure (missing required fields, wrong types)	Re-request with explicit format instructions; fall back to cached last-known-good response
Cascading dependency failure	Upstream tool in FAILED state	Skip dependent tools, produce partial result with explicit gap annotation
Context overflow in multi-turn	Token count exceeds budget	Summarize completed branches, discard raw intermediate data, seed fresh context with summary
Parallel path divergence	Aggregator detects contradictory results across paths	Flag contradiction explicitly, prefer result supported by majority of paths, log minority result for review

Limitations

Parallel fan-out multiplies cost. Running 3-5 parallel reasoning paths means 3-5x the tool calls and API spend. Only use Heavy Thinking for genuinely ambiguous decisions, not routine operations.
DAG modeling adds upfront complexity. For simple linear pipelines (A -> B -> C) with reliable tools, the overhead of formal dependency graphs is not justified. Use this pattern when you have 5+ tools with branching dependencies.
Noise profiles require real production data. The failure rates and types must come from actual observations, not guesses. Incorrect noise profiles give false confidence. Start by instrumenting your current pipeline before building the noise-hardened version.
Context summarization loses information. Compressing multi-turn context into summaries inevitably drops details. For tasks requiring exact recall of earlier tool outputs, keep full context and accept the token cost.
This approach does not fix fundamentally broken tools. If an API is down 50% of the time, no amount of retry logic makes it reliable. Fix the tool first, then harden the pipeline around realistic failure rates.

Reference

Paper: LongCat-Flash-Thinking-2601 Technical Report — Focus on Section 4 (Environment Scaling and Tool Graph Construction), Section 5 (Noise-Aware Training with curriculum-based hardening), and Section 6 (Heavy Thinking mode for parallel reasoning with reflective aggregation). The DORA framework architecture in Section 3 provides the async execution model underlying the pipeline design.