name: agent-builder description: Design, evaluate, and improve agentic harnesses for developer tools, assistants, workflow runtimes, copilots, and AI-powered products — including agents built on local or open-source models (Ollama, llama.cpp, vLLM, Llama, Qwen, DeepSeek, Mistral). Use when work involves tool-use architecture, permissions, approval gates, workflow state, durability, context and memory systems, evaluation strategy, observability, operator visibility, framework selection (LangGraph, CrewAI, Pydantic AI, smolagents, DSPy, AutoGen, DeepAgents), memory substrate choice, or phased implementation plans for an AI system. Trigger when symptoms imply harness gaps too — stale context, surprising tool calls, sessions that die on crash, missing approval controls, costs spiraling without clear visibility, tool counts crossing ~50, context windows routinely hitting 92%+ capacity, local-model agents hallucinating tool calls, or on-device agents failing silently after model swaps. author: Tyrone Ross version: 0.2.0 tags: [agentic-harness, agents, architecture, evaluation, memory, observability, frameworks, multi-agent, local-models, open-source, workflow, permissions] category: developer-tools difficulty: advanced
Agent Builder
Problem
Most AI products do not break because the model is too weak. They break at the harness layer: unclear tool boundaries, missing approval policy, brittle state, sloppy context assembly, no evaluation loop, and weak operator visibility. This skill turns those vague issues into concrete primitives, boundaries, phases, and checks — and grounds every decision in empirical evidence from production systems.
Two complementary bodies of knowledge ship with this skill:
references/methodology/— how to decide. Prescriptive design and evaluation playbooks covering principles, shapes, tools, state, context, extensibility, UX, and output patterns.references/catalog/— what exists to choose from. Empirical inventory of architecture types (I–V), the six-component harness model, frameworks (LangGraph, CrewAI, Pydantic AI, smolagents, DSPy, AutoGen, Bedrock), memory substrates (filesystem, vector, in-context), and lab-specific patterns (Anthropic, OpenAI, Perplexity, Manus, Google, Devin, Cursor).
Trigger Conditions
Activate when any of the following hold:
- The user is designing or rebuilding an agent, assistant, copilot, or AI workflow
- The request mentions harness architecture, tool-use architecture, tool registries, permission layers, approval gates, workflow state, session persistence, retries, resumability, memory, evals, observability, or multi-agent design
- The user wants to evaluate an existing harness for risks, missing primitives, UX gaps, or operational weakness
- The user is choosing between frameworks (LangGraph vs CrewAI vs Pydantic AI vs smolagents vs DSPy vs AutoGen vs Bedrock), memory substrates, or coordination patterns
- The symptoms point to harness problems even if the word "harness" never appears:
- tools fire without clear permission
- sessions fail on crash or long waits
- context gets stale or bloated (routinely hitting 92%+ capacity)
- tool count climbs past ~50 and quality drops
- operators cannot see what happened or why
- costs, retries, or handoffs are drifting out of control
- multi-agent setup is producing loops or systemic failures
Default Posture
- Bias toward lean, solo-maintainable architecture.
- Start with a single-agent design unless clear constraints justify more.
- Require an evaluation plan even for greenfield builds.
- Prefer explicit system boundaries, permission policy, and workflow state over prompt cleverness.
- Translate ideas into implementation phases, success criteria, and failure tests.
- When justifying multi-agent, cite empirical cost: single agent ≈ 4× chat tokens, multi-agent ≈ 15× chat tokens, 70%+ of multi-agent failures are systemic (MAST), and only 11% of orgs run production agentic systems (Deloitte 2025). See
references/catalog/01-architecture-taxonomy.mdfor sources. - When the target is a local or open-source model, apply the stricter local-model posture: start single-agent always, cull tools aggressively (Vercel 80% reduction pattern), compaction is non-negotiable (4K–32K context windows), evals are load-bearing not optional. See
references/catalog/06-local-and-open-source-models.md.
Step 0 — Gather Context
Before routing, make sure you have enough to work with.
For design work, confirm:
- what product or system the harness serves
- what actions the agent will take
- who the users are
- any known constraints (solo maintenance, existing stack, timeline)
For evaluation work, inspect the harness itself:
- read the codebase, agent config, skills, hooks, or architecture docs
- if evidence is missing, ask for the narrowest missing input and keep moving
- do not evaluate from vibes alone
Step 1 — Classify The Request
Choose one mode before reading reference files.
design
User is creating a new harness, planning a major rebuild, or asking for architecture, MVP shape, or implementation sequencing.
Default reads: methodology/01-principles-and-solo-dev-defaults.md, methodology/02-harness-shapes-and-architecture.md, methodology/08-design-and-build-playbook.md, catalog/01-architecture-taxonomy.md, references/templates/design-deliverable.md.
evaluation
User has a harness and wants gaps, risks, missing primitives, UX upgrades, or architectural cleanup.
Default reads: methodology/01-principles-and-solo-dev-defaults.md, methodology/09-evaluation-and-improvement-playbook.md, catalog/02-harness-components.md, references/templates/evaluation-deliverable.md.
design + evaluation
User wants a target architecture and a way to verify it, compare it with an existing system, or define acceptance criteria before building.
Default reads: union of the two above.
catalog-lookup
User is asking a factual question about what exists — "which framework", "how does Anthropic's orchestrator work", "what memory substrate", "what's the adoption rate of Type III". Route straight to the catalog. Do not dump methodology files for this mode.
Default reads: only the catalog file(s) relevant to the question. Cite the exact file and section. Surface trade-offs. Do not turn framework questions into flame wars.
Step 2 — Classify The Product Shape
Pick the closest shape and state the assumption if ambiguous:
| Shape (methodology) | Maps to Catalog Type |
|---|---|
| chat assistant | Type I (Augmented Assistant) |
| workflow orchestrator | Type II (Workflow Automaton) or Type III (Orchestrated Team) |
| code agent | Type III (Claude Code / Devin / Cursor patterns) |
| internal copilot | Type I or Type II |
| embedded AI product feature | Type I or Type II |
| hybrid system | Type III+ |
Step 3 — Read The Smallest Useful Reference Set
Read only the files the request actually needs. This file is the index — do not rely on reference-to-reference chains.
Methodology (how to decide)
methodology/01-principles-and-solo-dev-defaults.md— almost every request. Defines default decision posture.methodology/02-harness-shapes-and-architecture.md— choosing system shape, boundaries, lifecycle, transports, deployment.methodology/03-tools-execution-and-permissions.md— tool registries, tool calling, approval gates, sandboxes, trust tiers.methodology/04-state-sessions-and-durability.md— sessions, resumability, retries, idempotency, approval waits, long-running work.methodology/05-context-memory-and-evaluation.md— context windows, retrieval, memory, provenance, evals, replay tests, regression detection.methodology/06-agents-and-extensibility.md— multi-agent design, plugins, hooks, skills, extension surfaces.methodology/07-ux-observability-and-operations.md— streaming UX, health checks, logs, analytics, budgets, supportability.methodology/08-design-and-build-playbook.md— build-ready plan from idea to implementation.methodology/09-evaluation-and-improvement-playbook.md— findings, missing primitives, upgrade priorities, acceptance tests.methodology/10-example-requests-and-output-patterns.md— prompt examples, response structure examples.methodology/11-codex-translation-notes.md— adapting this skill for Codex or other LLM clients; cross-client portability notes.
Catalog (what exists)
catalog/01-architecture-taxonomy.md— Type I–V classification, adoption rates, 4 debates (single-vs-multi, frameworks-vs-raw, scaffolding-vs-minimal, augment-vs-automate), 10 verified stats, coordination patterns, architecture timeline.catalog/02-harness-components.md— six-component harness model (prompt / tools / memory / context / error / observability) and its mapping to the methodology topic files.catalog/03-frameworks.md— LangGraph, CrewAI, Pydantic AI, smolagents, DSPy, AutoGen, Bedrock AgentCore. Decision tree for framework selection.catalog/04-memory-substrates.md— filesystem-as-memory, vector DB, in-context, COALA framework, Claude Code memory tiers, Voyager skill library, DSPy optimization formats, self-improvement patterns (MCTS, OPRO, PromptBreeder, Gödel Agent).catalog/05-lab-patterns.md— production architecture patterns from Anthropic, OpenAI, Perplexity, LangChain DeepAgents, Manus, Google ADK, Microsoft AutoGen/Copilot, Meta Llama Stack, DeepSeek, Cohere, Devin, xAI Grok, Cursor, Windsurf.catalog/06-local-and-open-source-models.md— constraints and patterns for agents on local/open-source models (Ollama, llama.cpp, vLLM, Llama, Qwen, DeepSeek, Mistral, Phi, Gemma). Tool-call reliability tiers, framework fit for local deployment, failure modes, decision tree by hardware, three non-obvious insights for local agents.
Templates (output shapes)
references/templates/design-deliverable.md— use when producing a design output.references/templates/evaluation-deliverable.md— use when producing an evaluation output.
Examples (calibration)
examples/design-solo-pr-review-agent.md— worked design deliverable for a solo-maintainer PR review agent. Reference this to calibrate output format and level of detail.examples/evaluation-research-orchestrator.md— worked evaluation deliverable for a Type III research orchestrator with 9 specialists. Reference this to calibrate findings ordering, severity rationale, and upgrade-path structure.
Operating Rules
- Convert vague ambitions into concrete harness primitives.
- Push back on unnecessary complexity.
- Treat workflow state, permissions, context assembly, and evaluation as first-class architecture, not cleanup tasks.
- Separate universal harness primitives from product-specific manifestation.
- For evaluation requests, present findings first and improvement sequence second.
- For design requests, include how the design will be tested before calling it done.
- When recommending a framework, memory substrate, or multi-agent pattern, cite the catalog file you pulled it from.
Output Contract
For design
- recommended harness shape
- core primitives and subsystem boundaries
- MVP boundary
- phased implementation plan
- verification and acceptance criteria
For evaluation
- findings ordered by severity or leverage
- missing or weak primitives
- user experience and operational gaps
- prioritized upgrade path
- tests or checks that confirm the fixes
For design + evaluation
- target architecture
- comparison against current or likely failure modes
- implementation phases
- acceptance criteria
- evaluation plan covering regressions, safety, and UX
For catalog-lookup
- direct answer to the factual question
- trade-offs relative to alternatives in the same catalog file
- source citation (
catalog/NN-filename.md § Section) - one-line pointer to the methodology file that operationalizes the choice, if applicable
Final Check Before Responding
- Did you keep the design lean enough for a solo developer unless the request clearly demanded more?
- Did you avoid recommending multi-agent coordination by default?
- Did you include evaluation, not just construction?
- Did you give the user an operational path forward instead of abstract theory?
- If you recommended multi-agent, a framework, or a memory substrate, did you cite the catalog file you pulled it from?
- If the target is a local/open-source model, did you apply the stricter posture from
catalog/06-local-and-open-source-models.md(single-agent always, cull tools, compaction mandatory, evals non-optional)?