name: ai-agent-design description: > Domain-agnostic expertise in designing effective AI agent systems. Use this skill when the user asks about "writing agent instructions", "prompt engineering for agents", "system prompts", "tool design", "agent evaluation", "agent evals", "reducing hallucinations", "agent cost optimization", "model selection", "agent reliability", "tool descriptions", "agent guardrails", "multi-agent coordination", or needs help making agents work better, behave more reliably, or cost less to run. Complements the mastra skill by providing the "why" behind agent design. version: 0.1.0 license: MIT metadata: author: Bala version: "0.1.0" repository: https://github.com/bala-design/mastra-claude-plugin
AI Agent Design
How to design AI agents that actually work. This skill covers the craft of agent design independent of any specific framework — the principles apply to Mastra and any other agent system.
Note on code examples: The design principles in this skill are stable and framework-agnostic, but any Mastra-specific code examples (model names, constructor params,
maxStepsvalues) are illustrative. Before using them, verify current API signatures via themastraskill's documentation lookup — check embedded docs first (node_modules/@mastra/*/dist/docs/), then fall back to remote docs (https://mastra.ai/llms.txt). Never rely on training data for Mastra API details.
Writing Effective Agent Instructions
Agent instructions (system prompts) are the single most important factor in agent quality. They define the agent's identity, boundaries, and decision-making process.
The Instruction Hierarchy
Structure instructions from most to least important:
- Identity & role — Who is this agent? What domain does it operate in?
- Boundaries — What should it NOT do? When should it refuse or escalate?
- Tool usage guidance — When to use which tool, in what order
- Output format — How to structure responses
- Edge cases — What to do when uncertain, when data is missing, etc.
Writing Principles
Be specific, not vague:
BAD: "You are a helpful assistant."
GOOD: "You are a customer support agent for an e-commerce platform.
You help users with order tracking, returns, and product questions.
You do NOT handle billing disputes — escalate those to the billing team."
Name your tools and explain when to use them:
BAD: "Use the available tools to help the user."
GOOD: "You have access to these tools:
- orderLookupTool: Use when the user asks about an order status or needs order details. Requires an order ID or customer email.
- returnTool: Use to initiate a return. Only use after confirming the order exists via orderLookupTool.
- faqSearchTool: Use for general product questions before crafting your own answer."
Define behavior at boundaries:
"If the user asks about something outside your scope (pricing changes, account deletion, legal matters):
1. Acknowledge their request
2. Explain you can't help with that specific topic
3. Suggest who can help (billing team, account management, legal@company.com)"
Include examples for ambiguous situations:
"When a user complains about a late delivery:
1. Look up the order first
2. Check the current shipping status
3. If the package is in transit, provide the tracking info and expected date
4. If the package is lost (no updates for 5+ days), offer a replacement or refund
5. Always apologize for the inconvenience regardless of the cause"
Dynamic Instructions
Use dynamic instructions (async functions) when:
- Instructions change based on user role, subscription tier, or context
- You want to A/B test different instruction variants
- Instructions reference data that changes frequently (feature flags, policy documents)
Do NOT use dynamic instructions when:
- Instructions are static and universal
- The overhead of fetching instructions adds latency without value
Tool Design
Tools are the hands and eyes of an agent. Poorly designed tools cause agents to fail even with perfect instructions.
The Tool Design Checklist
One tool, one job. A tool should do exactly one thing. If you're tempted to add a "mode" parameter, make two tools instead.
Descriptions are for the LLM. Write the description as if explaining to a smart colleague what this tool does and when to use it. Be specific about inputs and expected outcomes.
Schema fields are self-documenting. Use descriptive names and
.describe()on every field:// BAD z.object({ q: z.string(), n: z.number() }) // GOOD z.object({ query: z.string().describe("The search query to execute"), maxResults: z.number().default(10).describe("Maximum number of results to return (1-100)"), })Constrain inputs. Use enums, min/max, regex patterns to prevent invalid inputs:
z.object({ priority: z.enum(["low", "medium", "high"]).describe("Ticket priority level"), daysBack: z.number().min(1).max(90).describe("Number of days to search back"), })Return structured errors. Never let tools throw unhandled exceptions — return error information the agent can reason about:
execute: async (input) => { try { const data = await fetchData(input); return { success: true, data, error: null }; } catch (e) { return { success: false, data: null, error: `Failed to fetch: ${e.message}` }; } }Minimize output size. Don't return entire database rows when the agent only needs two fields. Large outputs waste tokens and confuse the model.
Tool Granularity Spectrum
Too granular: Too coarse:
getUser, getUserEmail, doEverything(action, params)
getUserName, getUserOrders...
Sweet spot:
getUser(userId) → { name, email, role }
getUserOrders(userId, { limit, status }) → Order[]
createOrder(userId, items) → Order
When NOT to Use Tools
Not everything needs to be a tool. The LLM can:
- Format and restructure data it already has
- Do math on small numbers
- Generate text, summaries, translations
- Make decisions based on provided context
Use tools for things the LLM can't do: fetch live data, execute code, interact with external systems, perform precise calculations.
Model Selection Strategy
Matching Model to Task
| Task characteristics | Recommended tier | Examples |
|---|---|---|
| Simple routing, classification, extraction | Small/fast (gpt-4o-mini, haiku) |
Ticket classification, entity extraction |
| General reasoning, tool use, conversation | Mid-tier (gpt-4o, sonnet) |
Customer support, research, analysis |
| Complex reasoning, nuanced judgment, coding | Large (o1/o3, opus) |
Architecture design, code review, legal analysis |
Cost Optimization Patterns
Tiered model selection: Use cheap models for easy tasks, expensive ones for hard tasks. In agent networks, the routing agent can use a cheaper model while specialist agents use more capable ones.
Minimize context: Don't send the agent's entire conversation history when it only needs the last message. Use
lastMessagesjudiciously.Limit
maxSteps: Set reasonable limits to prevent runaway tool loops. Start with 3-5 and increase only if agents consistently need more.Cache tool results: If a tool fetches slowly-changing data, cache it rather than re-fetching every turn.
Structured output over parsing: Using schema-based structured output is cheaper and more reliable than asking the model to format JSON in its response.
Evaluation & Reliability
What to Measure
| Metric | What it tells you | How to measure |
|---|---|---|
| Task completion | Does the agent achieve the goal? | Manual review, automated checks |
| Tool accuracy | Does it call the right tools with right inputs? | Log tool calls, compare to expected |
| Hallucination rate | Does it make things up? | Compare claims against ground truth |
| Latency | How long does a response take? | End-to-end timing |
| Token usage | How much does each interaction cost? | Sum input + output tokens per turn |
| User satisfaction | Do users find it helpful? | Thumbs up/down, follow-up questions |
Guardrails
Input guardrails (before the agent processes):
- Validate user input format and length
- Detect prompt injection attempts
- Filter out-of-scope requests early
Output guardrails (before the response reaches the user):
- Check for PII in responses
- Validate factual claims against tools
- Ensure response format matches requirements
Execution guardrails (during agent processing):
maxStepsto prevent infinite loops- Timeout limits on tool execution
- Budget caps on token usage per session
The Testing Pyramid for Agents
/ Manual testing \ ← Exploratory, edge cases
/ Integration tests \ ← Full agent + tools + memory
/ Agent eval suites \ ← Automated scoring on test cases
/ Tool unit tests \ ← Inputs → outputs for each tool
/ Schema validation tests \ ← Schema edge cases, types
Resources
- Tool design patterns:
references/tool-design-patterns.md - Instruction templates:
references/instruction-templates.md