mastra-testing - SKILL.md Agent Skill

name: mastra-testing description: > Testing strategies for Mastra AI applications. Use this skill when the user asks about "testing agents", "testing workflows", "unit tests for tools", "agent evals", "evaluating agent quality", "Mastra scorers", "integration testing", "testing with Studio", "mocking tools", "test fixtures", "agent testing strategies", "workflow testing", "snapshot testing agents", or needs help writing tests for any Mastra component. Also triggers when discussing quality assurance, reliability, or production readiness of agents. version: 0.1.0 license: MIT metadata: author: Bala version: "0.1.0" repository: https://github.com/bala-design/mastra-claude-plugin

Mastra Testing

How to test Mastra applications at every level — from unit tests on individual tools to end-to-end eval suites that measure agent quality in production.

Note on test examples: The testing patterns and strategies here are stable, but specific Mastra APIs used in examples (agent.generate(), workflow.createRun(), Memory constructor, scorer setup) may change across versions. Before copying test code, verify current method signatures via the mastra skill's embedded docs (node_modules/@mastra/*/dist/docs/). The testing methodology itself is evergreen.

The Testing Pyramid for Mastra

                /  Studio / Manual  \        ← Exploratory testing, edge cases
               /   Agent Eval Suites \       ← Automated quality scoring
              /   Integration Tests    \     ← Full agent + tools + memory
             /     Workflow Tests        \   ← Step chains, branching, state
            /       Tool Unit Tests       \  ← Input → output for each tool
           /   Schema Validation Tests     \ ← Type safety, edge cases

Start at the bottom and work up. Most Mastra projects have zero tests — any testing is a massive improvement.

Level 1: Schema Validation Tests

Test that your Zod schemas accept valid data and reject invalid data. These are the cheapest tests and catch the most bugs.

import { describe, it, expect } from "vitest";
import { z } from "zod";

// Import the schema from your tool
const orderInputSchema = z.object({
  orderId: z.string().regex(/^ORD-\d+$/),
  includeHistory: z.boolean().default(false),
});

describe("Order Input Schema", () => {
  it("accepts valid order ID", () => {
    const result = orderInputSchema.safeParse({ orderId: "ORD-12345" });
    expect(result.success).toBe(true);
  });

  it("rejects invalid order ID format", () => {
    const result = orderInputSchema.safeParse({ orderId: "12345" });
    expect(result.success).toBe(false);
  });

  it("applies default for includeHistory", () => {
    const result = orderInputSchema.parse({ orderId: "ORD-1" });
    expect(result.includeHistory).toBe(false);
  });

  it("rejects missing required fields", () => {
    const result = orderInputSchema.safeParse({});
    expect(result.success).toBe(false);
  });
});

Level 2: Tool Unit Tests

Test tool execute functions in isolation. Mock external dependencies.

import { describe, it, expect, vi } from "vitest";

// Mock the external API before importing the tool
vi.mock("../lib/api-client", () => ({
  fetchOrder: vi.fn(),
}));

import { orderLookupTool } from "../tools/order-lookup";
import { fetchOrder } from "../lib/api-client";

describe("Order Lookup Tool", () => {
  it("returns order data for valid ID", async () => {
    const mockOrder = {
      id: "ORD-123",
      status: "shipped",
      items: [{ name: "Widget", qty: 2 }],
    };
    vi.mocked(fetchOrder).mockResolvedValue(mockOrder);

    const result = await orderLookupTool.execute({
      orderId: "ORD-123",
    });

    expect(result.order).toEqual(mockOrder);
    expect(fetchOrder).toHaveBeenCalledWith({ orderId: "ORD-123" });
  });

  it("returns error for non-existent order", async () => {
    vi.mocked(fetchOrder).mockResolvedValue(null);

    const result = await orderLookupTool.execute({
      orderId: "ORD-999",
    });

    expect(result.error).toBeTruthy();
    expect(result.order).toBeNull();
  });

  it("handles API failures gracefully", async () => {
    vi.mocked(fetchOrder).mockRejectedValue(new Error("API timeout"));

    const result = await orderLookupTool.execute({
      orderId: "ORD-123",
    });

    expect(result.error).toContain("API timeout");
    expect(result.order).toBeNull();
  });
});

Testing Tool Schemas End-to-End

Verify the full input validation → execute → output validation pipeline:

it("validates input schema before execution", async () => {
  // This should fail at the schema level, not the execute level
  await expect(
    orderLookupTool.execute({ orderId: 12345 }) // number, not string
  ).rejects.toThrow();
});

Level 3: Workflow Tests

Test workflow step chains, branching, and state management.

Testing Individual Steps

import { describe, it, expect } from "vitest";
import { formatStep } from "../workflows/pipeline";

describe("Format Step", () => {
  it("uppercases the input message", async () => {
    const result = await formatStep.execute({
      inputData: { message: "hello world" },
    });

    expect(result).toEqual({ formatted: "HELLO WORLD" });
  });
});

Testing Full Workflow Execution

import { describe, it, expect } from "vitest";
import { contentPipeline } from "../workflows/content-pipeline";

describe("Content Pipeline Workflow", () => {
  it("processes input through all steps", async () => {
    const run = await contentPipeline.createRun();
    const result = await run.start({
      inputData: { topic: "AI testing strategies" },
    });

    expect(result.status).toBe("success");
    expect(result.result).toBeDefined();
    expect(result.result.final).toContain("AI");
  });

  it("handles failure in intermediate steps", async () => {
    const run = await contentPipeline.createRun();
    const result = await run.start({
      inputData: { topic: "" }, // Empty topic should fail
    });

    expect(result.status).toBe("failed");
    expect(result.error).toBeDefined();
  });
});

Testing Branching Workflows

describe("Ticket Router Workflow", () => {
  it("routes billing queries to billing step", async () => {
    const run = await ticketRouter.createRun();
    const result = await run.start({
      inputData: { message: "I was charged twice" },
    });

    expect(result.status).toBe("success");
    expect(result.steps["billing-handler"]).toBeDefined();
    expect(result.steps["technical-handler"]).toBeUndefined();
  });

  it("routes technical queries to technical step", async () => {
    const run = await ticketRouter.createRun();
    const result = await run.start({
      inputData: { message: "The API returns 500 errors" },
    });

    expect(result.status).toBe("success");
    expect(result.steps["technical-handler"]).toBeDefined();
  });
});

Testing Suspend/Resume

describe("Approval Workflow", () => {
  it("suspends for approval and resumes", async () => {
    const run = await approvalWorkflow.createRun();

    // Start — should suspend
    const result1 = await run.start({
      inputData: { amount: 5000, description: "New laptop" },
    });
    expect(result1.status).toBe("suspended");
    expect(result1.suspendPayload).toBeDefined();

    // Resume with approval
    const result2 = await run.resume({
      resumeData: { approved: true, approverNotes: "Looks good" },
    });
    expect(result2.status).toBe("success");
    expect(result2.result.approved).toBe(true);
  });

  it("handles rejection on resume", async () => {
    const run = await approvalWorkflow.createRun();

    await run.start({
      inputData: { amount: 50000, description: "Sports car" },
    });

    const result = await run.resume({
      resumeData: { approved: false, approverNotes: "Over budget" },
    });
    expect(result.status).toBe("success");
    expect(result.result.approved).toBe(false);
  });
});

Testing Workflow State

describe("Stateful Workflow", () => {
  it("accumulates state across steps", async () => {
    const run = await statefulWorkflow.createRun();
    const result = await run.start({
      inputData: { items: ["a", "b", "c"] },
      initialState: { processedCount: 0 },
    });

    expect(result.status).toBe("success");
    expect(result.state.processedCount).toBe(3);
  });
});

Level 4: Agent Integration Tests

Test agents with real (or mocked) LLM calls and tools.

Basic Agent Response Test

import { describe, it, expect } from "vitest";
import { supportAgent } from "../agents/support-agent";

describe("Support Agent", () => {
  it("responds to greeting", async () => {
    const result = await supportAgent.generate("Hello, I need help");

    expect(result.text).toBeTruthy();
    expect(result.text.length).toBeGreaterThan(10);
  });

  it("uses order lookup tool when given an order ID", async () => {
    const result = await supportAgent.generate(
      "Where is my order ORD-12345?",
      { maxSteps: 5 }
    );

    // Check that the tool was called
    const toolCalls = result.steps.flatMap(s => s.toolCalls || []);
    const orderLookup = toolCalls.find(tc => tc.toolName === "orderLookupTool");
    expect(orderLookup).toBeDefined();
    expect(orderLookup.args.orderId).toBe("ORD-12345");
  });
});

Testing Tool Selection

it("selects the right tool for the query", async () => {
  const queries = [
    { input: "Track order ORD-555", expectedTool: "orderLookupTool" },
    { input: "I want to return my purchase", expectedTool: "returnTool" },
    { input: "What's your refund policy?", expectedTool: "faqSearchTool" },
  ];

  for (const { input, expectedTool } of queries) {
    const result = await agent.generate(input, { maxSteps: 3 });
    const toolCalls = result.steps.flatMap(s => s.toolCalls || []);
    const usedTools = toolCalls.map(tc => tc.toolName);
    expect(usedTools).toContain(expectedTool);
  }
});

Cost-Aware Testing

For development, use cheaper models or mock the LLM:

// Create a test-specific agent with a cheaper model
const testAgent = new Agent({
  ...supportAgent,
  model: "openai/gpt-4o-mini", // Cheaper for tests
});

// Or mock the LLM entirely for deterministic tests
vi.mock("@mastra/core/agent", () => ({
  Agent: vi.fn().mockImplementation((config) => ({
    generate: vi.fn().mockResolvedValue({
      text: "Mocked response",
      steps: [],
    }),
  })),
}));

Level 5: Eval Suites (Production Quality)

Mastra has built-in scorers for evaluating agent quality. Set these up for any agent going to production.

Setting Up Scorers

import { Agent } from "@mastra/core/agent";

const agent = new Agent({
  id: "support-agent",
  // ... agent config ...
  scorers: {
    // Built-in Mastra scorers
    "tool-call-accuracy": toolCallAccuracyScorer,
    "response-relevance": responseRelevanceScorer,
    "hallucination": hallucinationScorer,
  },
});

Creating Custom Scorers

// Score whether the agent followed the escalation policy
const escalationScorer = {
  name: "escalation-compliance",
  score: async ({ input, output, context }) => {
    const shouldEscalate = input.includes("billing dispute")
      || input.includes("legal")
      || input.includes("account deletion");

    const didEscalate = output.includes("billing@")
      || output.includes("legal@")
      || output.includes("escalat");

    if (shouldEscalate && !didEscalate) {
      return { score: 0, reason: "Should have escalated but didn't" };
    }
    if (!shouldEscalate && didEscalate) {
      return { score: 0.5, reason: "Unnecessary escalation" };
    }
    return { score: 1, reason: "Correct escalation behavior" };
  },
};

Running Eval Suites

// Define test cases
const evalCases = [
  {
    input: "Where is order ORD-123?",
    expectedBehavior: "Should call orderLookupTool with ORD-123",
    expectedToolCalls: ["orderLookupTool"],
  },
  {
    input: "I have a billing dispute",
    expectedBehavior: "Should escalate to billing team",
    expectedOutput: /billing@|escalat/i,
  },
];

// Run each case and collect scores
for (const testCase of evalCases) {
  const result = await agent.generate(testCase.input, { maxSteps: 5 });
  // Check against expected behavior
  // Log scores for tracking over time
}

Studio Testing

Mastra Studio (mastra dev at http://localhost:4111) provides interactive testing:

What to Test in Studio

Agent conversations: Send messages and inspect tool calls, responses, and reasoning
Workflow execution: Run workflows with different inputs, inspect step-by-step results
Tool behavior: Call tools directly with various inputs
Memory: Verify conversation history persists across turns
Edge cases: Test with empty inputs, very long inputs, adversarial inputs

Studio Testing Checklist

Agent responds correctly to the "happy path"
Agent handles missing/invalid inputs gracefully
Agent calls the right tools for different query types
Agent respects boundaries (refuses out-of-scope requests)
Memory persists across conversation turns
Workflow completes all steps in order
Workflow handles suspension and resumption correctly
Error messages are user-friendly

Test Configuration

vitest.config.ts

import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    globals: true,
    environment: "node",
    setupFiles: ["./test/setup.ts"],
    testTimeout: 30000, // Agent tests can be slow
    hookTimeout: 10000,
  },
});

test/setup.ts

import "dotenv/config";

// Verify required env vars
const required = ["OPENAI_API_KEY"];
for (const key of required) {
  if (!process.env[key]) {
    throw new Error(`Missing required env var: ${key}`);
  }
}

Resources

Test templates: references/test-templates.md