name: create-mcp-eval description: Generate comprehensive eval tests for any MCP server using @mcpjam/sdk. Supports Jest and Vitest with deterministic and LLM-driven test patterns.
create-mcp-eval
Generate eval tests for MCP servers using @mcpjam/sdk.
This skill guides you through creating eval test files that measure tool-selection accuracy, argument correctness, and multi-turn reasoning for any MCP server. It works with both Jest and Vitest and supports deterministic (mock) and LLM-driven test patterns.
1. Context Gathering
Before generating any code, collect the following from the user:
| Question | Options | Default |
|---|---|---|
| Connection type | stdio (local binary) or http (SSE/Streamable HTTP URL) |
http |
| Test framework | jest, vitest, or none (SDK-only) |
(detect from repo; fall back to vitest) |
| LLM provider | See Supported Providers table below. Format: provider/model |
(must ask user) |
| Save results to MCPJam | none, auto (saves when MCPJAM_API_KEY is set), or reporter (shared EvalRunReporter). Use an MCPJam API key (sk_…) from Settings → API keys; optionally set MCPJAM_PROJECT_ID to file results under a specific project (defaults to the org’s Default project). |
(must ask user) |
| Tool list | Ask user to paste their tool names or an Agent Brief (see Section 8) | — |
If the user provides an Agent Brief (markdown with ## Tools table), parse it to auto-populate tool names, descriptions, parameters, and suggested eval scenarios. See Section 8.
Provider Selection (REQUIRED)
You MUST ask the developer which LLM provider they want before generating any code. Do not default to any provider.
Supported Providers:
| Provider | Model format | Env var | Example model |
|---|---|---|---|
openai |
openai/<model> |
OPENAI_API_KEY |
openai/gpt-4o-mini |
anthropic |
anthropic/<model> |
ANTHROPIC_API_KEY |
anthropic/claude-sonnet-4-20250514 |
google |
google/<model> |
GOOGLE_API_KEY |
google/gemini-2.0-flash |
mistral |
mistral/<model> |
MISTRAL_API_KEY |
mistral/mistral-small-latest |
deepseek |
deepseek/<model> |
DEEPSEEK_API_KEY |
deepseek/deepseek-chat |
xai |
xai/<model> |
XAI_API_KEY |
xai/grok-2 |
openrouter |
openrouter/<model> |
OPENROUTER_API_KEY |
openrouter/openai/gpt-4o-mini |
azure |
azure/<deployment> |
AZURE_API_KEY |
azure/gpt-4o |
ollama |
ollama/<model> |
(none, local) | ollama/llama3 |
| Custom | <name>/<model> |
(configurable) | litellm/gpt-4 |
Once the user selects a provider, use the corresponding env var name and model format in all generated code:
{LLM_ENV_VAR}— e.g.,OPENAI_API_KEY{LLM_MODEL}— e.g.,openai/gpt-4o-mini{LLM_KEY_EXAMPLE}— e.g.,sk-...
Test Runner Selection
Before generating tests, check what the codebase already uses:
package.jsonscripts and devDependencies forjestorvitest- Config files:
jest.config.*,vitest.config.*,vite.config.*
Then:
- If Jest is present, use Jest (and
ts-jestif TypeScript). - If Vitest is present, use Vitest.
- If neither is present, default to Vitest.
- If the developer prefers no test framework, the
@mcpjam/sdkclasses (EvalTest,EvalSuite) can run standalone — call.run()directly and check results in a plain script without Jest/Vitest.
In all cases, use @mcpjam/sdk for the eval harness (HostRunner, EvalTest, EvalSuite, validators).
2. Project Setup
Generate the following scaffold when creating a new eval project. Use the test runner detected above — the examples below show both Vitest and Jest variants.
package.json (essentials)
{
"name": "my-server-evals",
"private": true,
"scripts": {
"test": "vitest run",
"test:watch": "vitest"
},
"devDependencies": {
"@mcpjam/sdk": "latest",
"vitest": "^3.0.0",
"typescript": "^5.0.0"
}
}
For Jest, replace the scripts and devDependencies:
{
"scripts": {
"test": "jest --runInBand"
},
"devDependencies": {
"@mcpjam/sdk": "latest",
"jest": "^29.0.0",
"ts-jest": "^29.0.0",
"@types/jest": "^29.0.0",
"typescript": "^5.0.0"
}
}
tsconfig.json
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"esModuleInterop": true,
"strict": true,
"outDir": "dist",
"skipLibCheck": true
},
"include": ["tests/**/*.ts"]
}
.env.example
# LLM provider key (required for LLM tests)
{LLM_ENV_VAR}={LLM_KEY_EXAMPLE}
EVAL_MODEL={LLM_MODEL}
# MCP server connection
MCP_SERVER_URL=https://your-server.example.com/sse
# For OAuth-protected servers:
# MCP_REFRESH_TOKEN=...
# MCP_CLIENT_ID=...
# MCP_CLIENT_SECRET=...
# Save eval results to MCPJam (optional)
# MCPJAM_API_KEY=sk_...
# MCPJAM_PROJECT_ID=<project id> # optional; defaults to your org’s Default project
.gitignore additions
node_modules/
dist/
.env
3. SDK API Reference
All imports come from @mcpjam/sdk. This is the complete API surface needed for eval tests.
MCPClientManager — Server Connection
import { MCPClientManager } from "@mcpjam/sdk";
const manager = new MCPClientManager();
// HTTP/SSE connection
await manager.connectToServer("server-id", {
url: "https://mcp.example.com/sse",
// Optional OAuth fields:
refreshToken: "...",
clientId: "...",
clientSecret: "...",
});
// Stdio connection
await manager.connectToServer("server-id", {
command: "node",
args: ["path/to/server.js"],
env: { API_KEY: "..." },
});
// Get tools for HostRunner
const tools = await manager.getToolsForAiSdk(["server-id"]);
// Cleanup
await manager.disconnectAllServers();
Tool names:
getToolsForAiSdk()uses the exact tool names from the MCP server — no server-id prefix is added. Use these names directly inhasToolCall()and validators. For example, if the server exposesread_me, useresult.hasToolCall("read_me"), notresult.hasToolCall("myserver__read_me").
HostRunner — LLM-Powered Agent
import { HostRunner } from "@mcpjam/sdk";
import { hasToolCall } from "@mcpjam/sdk";
const runner = new HostRunner({
tools, // from manager.getToolsForAiSdk()
model: "{LLM_MODEL}", // LLM model string
apiKey: process.env.{LLM_ENV_VAR}!, // API key for the provider
maxSteps: 8, // max tool-call loops per prompt
});
// Single prompt
const result = await runner.run("List all projects");
// Multi-turn with context
const r1 = await runner.run("Get my user profile");
const r2 = await runner.run("List workspaces for that user", { context: r1 });
// Stop the loop after the step where a tool is called
const r3 = await runner.run("Search tasks", {
stopWhen: hasToolCall("search_tasks"),
});
r3.hasToolCall("search_tasks"); // true
// Bound prompt runtime
const r4 = await runner.run("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500 },
});
r4.hasError(); // true if the prompt timed out
// Mock runner for deterministic tests (no LLM needed)
const mockAgent = HostRunner.mock(async (message) =>
PromptResult.from({
prompt: message,
messages: [
{ role: "user", content: message },
{ role: "assistant", content: "Mock response" },
],
text: "Mock response",
toolCalls: [{ toolName: "expected_tool", arguments: {} }],
usage: { inputTokens: 50, outputTokens: 50, totalTokens: 100 },
latency: { e2eMs: 100, llmMs: 80, mcpMs: 20 },
})
);
stopWhen does not skip tool execution. It controls whether the prompt loop continues after the current step completes, and HostRunner also applies stepCountIs(maxSteps) as a safety guard.
timeout bounds prompt runtime. number and totalMs cap the full prompt, stepMs caps each step, and chunkMs is accepted for parity but mainly matters in streaming flows. The runtime creates an internal abort signal, so tools can stop early if their implementation respects the provided abortSignal.
PromptResult — Inspect Agent Responses
import { PromptResult } from "@mcpjam/sdk";
// Returned by runner.run()
const result: PromptResult = await runner.run("...");
// Tool inspection
result.toolsCalled(); // string[] — names of all tools called
result.hasToolCall("tool_name"); // boolean — was this tool called?
result.getToolCalls(); // ToolCall[] — full call objects with args
result.getToolArguments("tool_name"); // Record<string, unknown> | undefined
// Metrics
result.e2eLatencyMs(); // number — end-to-end latency
result.llmLatencyMs(); // number — LLM API time
result.mcpLatencyMs(); // number — MCP tool execution time
result.totalTokens(); // number — total tokens used
result.inputTokens(); // number
result.outputTokens(); // number
// Error handling
result.hasError(); // boolean
result.getError(); // string | undefined
// Messages
result.getMessages(); // CoreMessage[]
result.formatTrace(); // string — JSON trace for debugging
// Convert to eval result for reporting
result.toEvalResult({
caseTitle: "test-name",
passed: result.hasToolCall("expected_tool"),
expectedToolCalls: [{ toolName: "expected_tool" }],
});
EvalTest — Single Eval with Iterations
import { EvalTest } from "@mcpjam/sdk";
const test = new EvalTest({
name: "get-user-tool-selection",
test: async (runner) => {
const r = await runner.run("Get my user profile");
return r.hasToolCall("get_user"); // return boolean
},
});
const run = await test.run(runner, {
iterations: 5, // how many times to repeat
concurrency: 5, // parallel iterations (default: 5)
retries: 1, // retry failed iterations (default: 0)
timeoutMs: 60_000, // per-iteration timeout (default: 30_000)
mcpjam: { // auto-upload to MCPJam (optional)
enabled: true, // default: true if MCPJAM_API_KEY is set
},
});
// After run:
test.accuracy(); // number 0-1 — success rate
test.getResults(); // EvalRunResult | null
EvalSuite — Group Multiple Tests
import { EvalSuite, EvalTest } from "@mcpjam/sdk";
const suite = new EvalSuite({ name: "my-server-evals" });
suite.add(new EvalTest({
name: "get-user",
test: async (a) => {
const r = await a.run("Get my user profile");
return r.hasToolCall("get_user");
},
}));
suite.add(new EvalTest({
name: "list-projects",
test: async (a) => {
const r = await a.run("List all projects");
return r.hasToolCall("list_projects");
},
}));
const result = await suite.run(runner, {
iterations: 5,
retries: 1,
timeoutMs: 60_000,
});
// Aggregate results
result.aggregate.accuracy; // number 0-1
result.aggregate.iterations; // total iterations across all tests
result.tests.size; // number of tests
// Per-test access
suite.accuracy(); // overall accuracy
suite.get("get-user"); // EvalTest | undefined
suite.getResults(); // EvalSuiteResult | null
Validators — Tool Call Matching Helpers
import {
matchToolCalls,
matchToolCallsSubset,
matchAnyToolCall,
matchToolCallCount,
matchNoToolCalls,
matchToolCallWithArgs,
matchToolCallWithPartialArgs,
matchToolArgument,
matchToolArgumentWith,
} from "@mcpjam/sdk";
const toolNames = result.toolsCalled(); // string[]
const toolCalls = result.getToolCalls(); // ToolCall[]
// Name-based validators (take string[])
matchToolCalls(["a", "b"], toolNames); // exact match (order-independent)
matchToolCallsSubset(["a"], toolNames); // subset check
matchAnyToolCall(["a", "b"], toolNames); // at least one match
matchToolCallCount("a", toolNames, 2); // exact count of tool
matchNoToolCalls(toolNames); // empty check
// Argument-based validators (take ToolCall[])
matchToolCallWithArgs("tool", { key: "val" }, toolCalls); // exact args match
matchToolCallWithPartialArgs("tool", { key: "val" }, toolCalls); // partial args match
matchToolArgument("tool", "key", "val", toolCalls); // single arg exact
matchToolArgumentWith("tool", "key", (v) => v > 0, toolCalls); // custom predicate
Save Results to MCPJam
import {
createEvalRunReporter,
reportEvalResults,
reportEvalResultsSafely,
} from "@mcpjam/sdk";
import type { EvalRunReporter } from "@mcpjam/sdk";
// ── Option A: One-shot reporting ──
await reportEvalResults({
suiteName: "My Evals",
apiKey: process.env.MCPJAM_API_KEY!,
strict: true, // true = throw on error; false = log warning + return null (results silently not uploaded)
results: [
{ caseTitle: "test-1", passed: true },
{ caseTitle: "test-2", passed: false, error: "wrong tool" },
],
});
// reportEvalResultsSafely — same API, returns null on error instead of throwing
const output = await reportEvalResultsSafely({ ... });
// ── Option B: Streaming reporter (recommended for multi-test files) ──
const reporter = createEvalRunReporter({
suiteName: "My Evals",
apiKey: process.env.MCPJAM_API_KEY!,
strict: true,
suiteDescription: "Eval suite for my MCP server",
serverNames: ["my-server"],
notes: "CI run",
passCriteria: { minimumPassRate: 70 },
ci: { branch: "main", commitSha: "abc123..." },
expectedIterations: 10,
});
// Record results as they come in:
await reporter.record(result.toEvalResult({ caseTitle: "...", passed: true }));
await reporter.recordFromPrompt(result, { caseTitle: "...", passed: true });
await reporter.recordFromRun(run, {
casePrefix: "eval-test",
expectedToolCalls: [{ toolName: "get_user" }],
});
await reporter.recordFromSuiteRun(suiteResult.tests, {
casePrefix: "suite",
expectedToolCallsByTest: {
"get-user": [{ toolName: "get_user" }],
},
});
// Finalize at end of test file (IMPORTANT — must be called!)
afterAll(async () => {
const output = await reporter.finalize();
console.log(`Run ID: ${output.runId}, Passed: ${output.summary.passed}`);
}, 90_000);
4. Canonical Patterns
Pattern 1: Config Block
Always start your test file with a self-contained config block. Use environment variables with sensible fallbacks:
// ─── Config ─────────────────────────────────────────────────────────────────
const MCP_SERVER_URL = process.env.MCP_SERVER_URL ?? "https://mcp.example.com/sse";
const LLM_API_KEY = process.env.{LLM_ENV_VAR}!;
const MODEL = process.env.EVAL_MODEL ?? "{LLM_MODEL}";
const SERVER_ID = "my-server";
const MCPJAM_API_KEY = process.env.MCPJAM_API_KEY;
const RUN_LLM_TESTS = Boolean(LLM_API_KEY);
Pattern 2: Toggle Suites (Conditional Execution)
Use a conditional wrapper so tests skip gracefully when credentials are missing:
(RUN_LLM_TESTS ? describe : describe.skip)("LLM Tests", () => {
// tests here only run when {LLM_ENV_VAR} is set
});
Pattern 3: Shared Reporter (Save Results to MCPJam)
Create a module-level reporter to save results to MCPJam, and finalize in afterAll:
let reporter: EvalRunReporter;
if (MCPJAM_API_KEY) {
reporter = createEvalRunReporter({
suiteName: "My Server Evals",
apiKey: MCPJAM_API_KEY,
strict: true,
expectedIterations: 10,
});
}
afterAll(async () => {
if (!reporter || reporter.getAddedCount() === 0) return;
const output = await reporter.finalize();
expect(output.runId).toBeTruthy();
}, 90_000);
The reporter buffers results before saving. A run may not appear in the MCPJam UI until
reporter.flush() or reporter.finalize() completes.
For long-running files, call await reporter.flush() periodically if you want
the run to become visible before the entire file finishes.
expectedIterations must equal the exact number of reported results. Count
every recordFromPrompt() call, every iteration emitted by recordFromRun(),
and every iteration emitted by recordFromSuiteRun().
Pattern 4: Agent Parameterization
Test the same scenarios across multiple models:
const agentConfigs = [
{ name: "gpt-4o-mini", suffix: "gpt4omini", getAgent: () => primaryAgent },
{ name: "nano", suffix: "nano", getAgent: () => nanoAgent },
];
for (const { name, suffix, getAgent } of agentConfigs) {
it(`selects correct tool (${name})`, async () => {
const result = await getAgent().run("Get my profile");
expect(result.hasToolCall("get_user")).toBe(true);
}, 90_000);
}
Pattern 5: Four Ways to Save Results
// Style 1: Manual toEvalResult + record
const result = await runner.run("Get user");
await reporter.record(result.toEvalResult({
caseTitle: "get-user",
passed: result.hasToolCall("get_user"),
expectedToolCalls: [{ toolName: "get_user" }],
}));
// Style 2: recordFromPrompt (shorthand)
await reporter.recordFromPrompt(result, {
caseTitle: "get-user",
passed: result.hasToolCall("get_user"),
expectedToolCalls: [{ toolName: "get_user" }],
});
// Style 3: recordFromRun (EvalTest results)
const run = await evalTest.run(runner, { iterations: 5 });
await reporter.recordFromRun(run, {
casePrefix: "eval-get-user",
expectedToolCalls: [{ toolName: "get_user" }],
});
// Style 4: recordFromSuiteRun (EvalSuite results)
await reporter.recordFromSuiteRun(suiteResult.tests, {
casePrefix: "suite",
expectedToolCallsByTest: {
"get-user": [{ toolName: "get_user" }],
},
});
Pattern 6: Deterministic + LLM Tests
Split your test file into deterministic (no LLM/server needed) and LLM sections:
// ─── Deterministic (always runs) ─────────────────────────────────
describe("Deterministic", () => {
it("mock runner returns expected structure", async () => {
const mock = HostRunner.mock(async (msg) =>
PromptResult.from({
prompt: msg,
messages: [{ role: "user", content: msg }, { role: "assistant", content: "ok" }],
text: "ok",
toolCalls: [{ toolName: "get_user", arguments: {} }],
usage: { inputTokens: 50, outputTokens: 50, totalTokens: 100 },
latency: { e2eMs: 100, llmMs: 80, mcpMs: 20 },
})
);
const test = new EvalTest({
name: "mock-test",
test: async (a) => (await a.run("test")).hasToolCall("get_user"),
});
const run = await test.run(mock, { iterations: 3, mcpjam: { enabled: false } });
expect(run.successes).toBe(3);
});
});
// ─── LLM (requires credentials) ─────────────────────────────────
(RUN_LLM_TESTS ? describe : describe.skip)("LLM", () => {
// real runner tests here
});
Pattern 7: Multi-Turn Conversations
Test workflows that require conversation context:
it("multi-turn: get user then list workspaces", async () => {
const r1 = await runner.run("Get my user profile");
const r2 = await runner.run(
"Based on the profile, list my workspaces",
{ context: r1 } // passes r1's conversation history
);
expect(r1.hasToolCall("get_user")).toBe(true);
expect(r2.toolsCalled().length).toBeGreaterThan(0);
}, 120_000);
Pattern 8: Validator Coverage
Use validators for precise tool-call assertions:
it("validates tool calls comprehensively", async () => {
const result = await runner.run("Get user profile");
const toolNames = result.toolsCalled();
const toolCalls = result.getToolCalls();
// At least one expected tool was called
expect(matchAnyToolCall(["get_user", "get_profile"], toolNames)).toBe(true);
// Argument validation
if (toolCalls.length > 0) {
expect(
matchToolCallWithPartialArgs("get_user", {}, toolCalls)
).toBe(true);
}
// Negative: unexpected tools not called
expect(matchAnyToolCall(["delete_user"], toolNames)).toBe(false);
});
5. Generation Guidelines
Follow these rules when generating eval test files:
Deterministic suite first — always include a deterministic test section using
HostRunner.mock()that validates the test structure itself without requiring LLM calls or server connections.One EvalTest per tool — create a separate
EvalTestfor each tool you want to evaluate. Each test should prompt the runner with a natural-language request and assert the correct tool was selected.Single-shot LLM tests are non-deterministic — a single
runner.run()may not select the expected tool every time. For single-shot tests, prefer saving results to MCPJam without hard-asserting (expect(...).toBe(true)). UseEvalTestwithiterations >= 3and assert onaccuracy()for reliable pass/fail gates. Reserve hard asserts for high-confidence cases (negative tests, multi-turn with clear context).Write unambiguous prompts for similar tools — when a server has tools with overlapping descriptions (e.g.,
create_viewvsexport_to_excalidraw), prompts must reference the tool's unique action. Mention specific verbs, targets, or outcomes. Bad: "Share my diagram". Good: "Export and upload my diagram to excalidraw.com so I can open it in a browser".Multi-turn for related tools — when tools logically chain together (e.g.,
get_userthenlist_workspaces), create a multi-turn test using{ context: previousResult }.Negative test — always include at least one test that verifies the runner does NOT call tools when given an irrelevant prompt (e.g., "What is the capital of France?"). Use
matchNoToolCalls().Reasonable defaults:
iterations: 5for EvalTest runstimeoutMs: 60_000for LLM testsmaxSteps: 8for HostRunnerretries: 1for flaky network toleranceconcurrency: 5(default, no need to set explicitly)
Timeout on test cases — set explicit timeouts on
it()blocks:90_000for single-turn,120_000for multi-turn and suite tests.Always
await— everyrunner.run(),test.run(),suite.run(),reporter.record*(), andreporter.finalize()is async. Never forgetawait.One reporter per file — create the reporter at module level to save results to MCPJam, and finalize in
afterAll. Never create multiple reporters in the same file.Use
describe.skipfor missing credentials — wrap LLM tests in conditional describe blocks so CI runs cleanly without secrets.Match the repo's test runner — check
package.jsonand config files for an existing test framework before generating. Only default to Vitest if the repo has no test runner. If the user prefers no framework at all, use@mcpjam/sdkclasses (EvalTest.run(),EvalSuite.run()) standalone in a plain script.Log key metrics — add
console.logstatements for accuracy, tool calls, and latency so CI output is informative.
6. Complete Template
Copy-pasteable test file skeleton. Replace {placeholders} with your server-specific values.
import { describe, it, expect, beforeAll, afterAll } from "vitest"; // For Jest: remove this line
import {
MCPClientManager,
HostRunner,
PromptResult,
EvalTest,
EvalSuite,
createEvalRunReporter,
matchToolCalls,
matchAnyToolCall,
matchNoToolCalls,
matchToolCallWithPartialArgs,
} from "@mcpjam/sdk";
import type { ToolCall, EvalRunReporter } from "@mcpjam/sdk";
// ─── Config ─────────────────────────────────────────────────────────────────
const MCP_SERVER_URL = process.env.MCP_SERVER_URL ?? "{server_url}";
const LLM_API_KEY = process.env.{LLM_ENV_VAR}!;
const MODEL = process.env.EVAL_MODEL ?? "{LLM_MODEL}";
const SERVER_ID = "{server_id}";
const MCPJAM_API_KEY = process.env.MCPJAM_API_KEY;
const RUN_LLM_TESTS = Boolean(LLM_API_KEY) && Boolean(MCP_SERVER_URL);
// ─── Prompts ────────────────────────────────────────────────────────────────
const PROMPTS = {
// {TOOL_1}: "{natural language request for tool_1}",
// {TOOL_2}: "{natural language request for tool_2}",
// NEGATIVE: "What is the capital of France?",
} as const;
// ─── Save Results to MCPJam ──────────────────────────────────────────────────
let reporter: EvalRunReporter;
if (MCPJAM_API_KEY) {
reporter = createEvalRunReporter({
suiteName: "{Suite Name}",
apiKey: MCPJAM_API_KEY,
strict: true,
suiteDescription: "Eval suite for {server_name}",
serverNames: [SERVER_ID],
expectedIterations: 10, // must exactly match the number of reported results
});
}
afterAll(async () => {
if (!reporter || reporter.getAddedCount() === 0) return;
const output = await reporter.finalize();
expect(output.runId).toBeTruthy();
console.log(`\n[mcpjam] Results saved — ${output.summary.passed}/${output.summary.total} passed`);
console.log(`[mcpjam] Open the Evals tab in the MCPJam Inspector to see your full results.\n`);
}, 90_000);
// ─── Deterministic Tests ────────────────────────────────────────────────────
describe("{server_name} evals – deterministic", () => {
it("mock runner produces valid EvalTest results", async () => {
const mock = HostRunner.mock(async (msg) =>
PromptResult.from({
prompt: msg,
messages: [
{ role: "user", content: msg },
{ role: "assistant", content: "Done" },
],
text: "Done",
toolCalls: [{ toolName: "{expected_tool}", arguments: {} }],
usage: { inputTokens: 50, outputTokens: 50, totalTokens: 100 },
latency: { e2eMs: 100, llmMs: 80, mcpMs: 20 },
})
);
const test = new EvalTest({
name: "det-mock-tool-selection",
test: async (a) => {
const r = await a.run("test prompt");
return r.hasToolCall("{expected_tool}");
},
});
const run = await test.run(mock, {
iterations: 3,
concurrency: 1,
retries: 0,
timeoutMs: 10_000,
mcpjam: { enabled: false },
});
expect(run.successes).toBe(3);
expect(run.iterationDetails).toHaveLength(3);
});
});
// ─── LLM Tests ──────────────────────────────────────────────────────────────
(RUN_LLM_TESTS ? describe : describe.skip)("{server_name} evals – LLM", () => {
let manager: MCPClientManager;
let runner: HostRunner;
beforeAll(async () => {
manager = new MCPClientManager();
await manager.connectToServer(SERVER_ID, {
url: MCP_SERVER_URL,
// Add OAuth fields if needed:
// refreshToken: process.env.MCP_REFRESH_TOKEN!,
// clientId: process.env.MCP_CLIENT_ID!,
});
const tools = await manager.getToolsForAiSdk([SERVER_ID]);
runner = new HostRunner({
tools,
model: MODEL,
apiKey: LLM_API_KEY,
maxSteps: 8,
});
}, 90_000);
afterAll(async () => {
await manager.disconnectAllServers();
});
// ── Single-tool selection tests ──
// it("selects {tool_name}", async () => {
// const result = await runner.run(PROMPTS.{TOOL_KEY});
// expect(result.hasToolCall("{tool_name}")).toBe(true);
//
// if (reporter) {
// await reporter.recordFromPrompt(result, {
// caseTitle: "llm-{tool_name}",
// passed: result.hasToolCall("{tool_name}"),
// expectedToolCalls: [{ toolName: "{tool_name}" }],
// });
// }
// }, 90_000);
// ── Multi-turn test ──
// it("multi-turn: {tool_a} then {tool_b}", async () => {
// const r1 = await runner.run(PROMPTS.{TOOL_A});
// const r2 = await runner.run(PROMPTS.{TOOL_B_FOLLOWUP}, { context: r1 });
// expect(r1.hasToolCall("{tool_a}")).toBe(true);
// expect(r2.toolsCalled().length).toBeGreaterThan(0);
// }, 120_000);
// ── Negative test ──
it("does not call tools for irrelevant prompt", async () => {
const result = await runner.run("What is the capital of France?");
expect(matchNoToolCalls(result.toolsCalled())).toBe(true);
if (reporter) {
await reporter.recordFromPrompt(result, {
caseTitle: "llm-negative-no-tools",
passed: matchNoToolCalls(result.toolsCalled()),
isNegativeTest: true,
});
}
}, 90_000);
// ── EvalTest with iterations ──
// it("EvalTest: {tool_name} accuracy", async () => {
// const test = new EvalTest({
// name: "{tool_name}-accuracy",
// test: async (a) => {
// const r = await a.run(PROMPTS.{TOOL_KEY});
// return r.hasToolCall("{tool_name}");
// },
// });
// const run = await test.run(runner, {
// iterations: 5,
// retries: 1,
// timeoutMs: 60_000,
// mcpjam: { enabled: false },
// });
// expect(test.accuracy()).toBeGreaterThanOrEqual(0.8);
// if (reporter) {
// await reporter.recordFromRun(run, {
// casePrefix: "eval-{tool_name}",
// expectedToolCalls: [{ toolName: "{tool_name}" }],
// });
// }
// console.log(`{tool_name} accuracy: ${test.accuracy()}`);
// }, 120_000);
// ── EvalSuite ──
// it("EvalSuite: all tools", async () => {
// const suite = new EvalSuite({ name: "{server_name}-suite" });
// suite.add(new EvalTest({ name: "{tool_1}", test: async (a) => { ... } }));
// suite.add(new EvalTest({ name: "{tool_2}", test: async (a) => { ... } }));
// const result = await suite.run(runner, { iterations: 5, timeoutMs: 60_000 });
// expect(suite.accuracy()).toBeGreaterThanOrEqual(0.7);
// }, 120_000);
});
// ─── Skip messages ──────────────────────────────────────────────────────────
if (!RUN_LLM_TESTS) {
describe("{server_name} evals – LLM", () => {
it.skip("Requires {LLM_ENV_VAR} + MCP_SERVER_URL", () => {});
});
}
if (!MCPJAM_API_KEY) {
afterAll(() => {
console.log(`\n[mcpjam] You won't be able to see them in the CI/CD tab. To set up:`);
console.log(`[mcpjam] 1. Go to Settings > Workspace API Key in the MCPJam Inspector`);
console.log(`[mcpjam] 2. Add MCPJAM_API_KEY to your .env`);
console.log(`[mcpjam] 3. Re-run your evals — results are saved automatically\n`);
});
}
7. Common Mistakes
Forgetting reporter.finalize()
The reporter buffers results and uploads them in batch. If you don't call finalize() in afterAll, no results are sent. Always include:
afterAll(async () => {
if (!reporter || reporter.getAddedCount() === 0) return;
await reporter.finalize();
}, 90_000);
Expecting immediate UI visibility
recordFromPrompt() and the other record*() helpers buffer results, but
they do not guarantee an immediate save to MCPJam. A long-running file may not appear in
the UI until flush() or finalize() runs.
If you need the run to show up before the file completes, flush periodically:
await reporter.recordFromPrompt(result, { caseTitle: "step-1", passed: true });
await reporter.flush();
Not awaiting async methods
Every SDK method that talks to an LLM, MCP server, or reporting API is async. Missing await causes silent failures:
// WRONG:
reporter.recordFromPrompt(result, { ... });
// CORRECT:
await reporter.recordFromPrompt(result, { ... });
Low maxSteps on HostRunner
If the runner needs multiple tool calls to answer a prompt, a low maxSteps causes incomplete responses. Default to 8 for most servers, increase to 12-15 for complex workflows.
Mixing save modes
Don't use both reportEvalResults() and a shared EvalRunReporter in the same file. Pick one approach:
- Use
createEvalRunReporterfor multi-test files (recommended) - Use
reportEvalResultsfor single one-off saves
Missing test timeouts
LLM calls can take 10-30 seconds. Always set explicit timeouts on it() blocks:
it("test name", async () => { ... }, 90_000); // 90 seconds
Creating multiple reporters
One reporter per test file. Creating multiple reporters results in multiple incomplete runs instead of one consolidated run saved to MCPJam.
Incorrect expectedIterations
expectedIterations is not a rough estimate. It should exactly equal the total
number of eval results reported for the file.
Count:
- One result per
recordFromPrompt() - One result per iteration inside
recordFromRun() - One result per iteration inside
recordFromSuiteRun()
If the count is wrong, the UI can show misleading progress for a run.
Using strict: false without checking results
With strict: false, save failures are silently swallowed — a console.warn is emitted and finalize() returns a local fallback with an empty runId. Always check output.runId after finalize to confirm results were saved:
const output = await reporter.finalize();
if (!output.runId) {
console.error("Results were NOT saved to MCPJam — check baseUrl and apiKey");
}
8. Adapting to Agent Brief
When a user pastes an Agent Brief (generated by the MCPJam Inspector's "Copy runner brief" action), use it to auto-generate targeted eval tests.
Agent Brief Format
The brief is a markdown document with this structure:
# MCP Server Brief: {server-name}
## Capability Summary
{N} tools, {M} resources, {P} prompts
## Tools
| Tool | Description | Key Parameters |
|------|-------------|----------------|
| tool_name | Does something useful | param1 (string, required), param2 (number) |
| other_tool | Does something else | id (string, required) |
## Suggested Eval Scenarios
### Single-Tool Selection
- `tool_name` — "Does something useful"
- `other_tool` — "Does something else"
### Multi-Tool Workflow
- `list_items` → `get_item`: List then fetch detail
### Argument Accuracy
- `tool_name` requires: param1 (string), param2 (number)
### Negative Test
- Irrelevant prompt should trigger no tool calls
## Next Steps
...
How to Parse It
Extract tools from the
## Toolstable — each row gives you a tool name, description, and parameters.Generate PROMPTS object — for each tool, write a natural-language prompt that would cause an runner to select that tool. Use the description as guidance.
Create EvalTests — one per tool from "Single-Tool Selection", using the tool name for
hasToolCall().Create multi-turn tests — from "Multi-Tool Workflow" entries, chain the tools using
{ context: r1 }.Create argument tests — from "Argument Accuracy" entries, use
matchToolCallWithPartialArgs()ormatchToolArgument()to verify the runner passes correct argument types.Always include the negative test — use
matchNoToolCalls().
Example: Brief → Test
Given a brief with tool search_tasks described as "Search for tasks by keyword":
const PROMPTS = {
SEARCH_TASKS: "Search for tasks containing 'launch'",
} as const;
it("selects search_tasks", async () => {
const result = await runner.run(PROMPTS.SEARCH_TASKS);
expect(result.hasToolCall("search_tasks")).toBe(true);
}, 90_000);