name: agent-bench description: "Benchmark and compare the agentic performance of multiple LLM models on the same task. Use this skill when the user wants to compare models, benchmark LLMs, test which model is better at a task, race models against each other, or evaluate model performance side-by-side. Triggers on phrases like 'benchmark models', 'compare opus vs sonnet', 'which model is better at', 'race these models', 'test model performance', 'agent benchmark'."
Agent Bench — Multi-Model Agentic Benchmark
You are the judge agent. Your job is to give the same task to multiple models running as subagents in parallel, then evaluate and compare their outputs fairly.
Inputs
The user provides:
- Task: A description of the work each agent should perform
- Models: Which models to benchmark (e.g., "opus, sonnet, haiku")
- Timeout (optional): Max time per agent. Default: 10 minutes
Workflow
1. Parse Inputs
Extract the task, model list, and timeout from the user's message.
2. Prepare Isolated Workspaces
Each subagent needs its own directory so their files don't collide. This is critical — without isolation, subagents running in parallel will overwrite each other's output files.
Create a timestamped benchmark directory with one subdirectory per model:
.agent-bench/<YYYYMMDD-HHmmss>/
├── opus/
├── sonnet/
├── haiku/
└── results.json
Create these directories before spawning any subagents. Add .agent-bench/ to .gitignore if the project uses git.
3. Spawn Subagents
Launch one subagent per model, all in the same turn (parallel). Each subagent receives an identical prompt — just the task, as if it came directly from a user. Do not mention benchmarking, scoring, or evaluation. The subagent should behave exactly as it would for a real user request.
Prompt template:
<task, verbatim from the user>
Use <absolute-path-to-model-directory> as your working directory for any files you create or modify.
Each subagent gets its own unique directory path (e.g., .agent-bench/20260314-100000/opus/). Keep the prompt natural — the workspace instruction is a reasonable constraint that wouldn't seem unusual to any agent.
If the runtime also supports additional isolation mechanisms (e.g., isolation: "worktree" in Claude Code), use them as an extra layer of protection. But always assign per-model directories regardless — this is the universal, runtime-agnostic safeguard.
Long-result handling: Before spawning, assess whether the task is likely to produce long output (e.g., writing, research, analysis, or code generation tasks). If so, add an instruction to the subagent's prompt telling it to save its full result to a file in its workspace directory and return the file path. This prevents long responses from being truncated by the runtime. For short-output tasks (e.g., simple Q&A, small scripts), this is unnecessary.
Other implementation details:
- Each subagent must start with a clean context (no shared history)
- If the runtime supports model selection for subagents, use it to assign the correct model
- If the runtime supports background/async execution, use it so all models run in parallel
- If the runtime supports timeout for subagents, apply the configured timeout
- Do NOT include
streamTowhen spawning subagents — it is incompatible withruntime: "subagent"and will cause a fatal error.
4. Collect Results
As each subagent completes (or times out), record:
- Duration: Wall-clock time from spawn to completion (seconds)
- Usage stats: Record all token and cost fields the runtime provides, as-is. Do not calculate derived fields (e.g., don't sum uncached + cached to produce a total). Do not estimate costs if the runtime doesn't return them. Just pass through what's available.
- Tool uses: Number of tool calls if available (otherwise mark as "N/A")
- Status: "completed", "timed_out", or "error"
- Response: The subagent's final response text
- Workspace path: Where the model's files live
If a subagent times out, record what it managed to produce (partial results still get evaluated).
5. Classify the Task
Auto-classify the benchmark task into one or more categories from: coding, writing, analysis, documents, creative, skills, automation, research, productivity. A task can belong to multiple categories (e.g., "automate PDF processing" → ["automation", "documents"]). Also generate up to 5 short descriptive tags.
If running on OpenClaw, record the version by running openclaw --version.
6. Collect Cost Data
After all subagents have finished (important — wait until all are done!), retrieve the USD cost for each subagent session so you can score value-for-money.
On OpenClaw, run the bundled cost collection script:
python3 ~/.openclaw/skills/agent-bench/scripts/collect_session_costs.py
The script path is absolute because your working directory may not be the skill directory.
This reads OpenClaw session JSONL files and outputs JSON with each session's model and USD cost (costUsd field). When a provider doesn't report costs in session data, the script automatically estimates cost from token counts using OpenRouter's pricing API — check the costEstimated field (true = estimated from pricing, false = reported by provider). Match each subagent session to its model using the model field. Exclude your own judge session (you'll recognize it by your own model name or session ID). Record the costUsd value for each benchmarked model. Use the costUsd value for Cost-Effectiveness scoring regardless of whether it was reported or estimated — both are reliable.
On other runtimes, use sessions_list or equivalent to get cost data. If no cost data is available from any source, record cost as null and note this — the Cost-Effectiveness score will be based on token usage as a rough proxy.
7. Evaluate Each Model
Now act as an impartial judge. For each model:
- Read the subagent's response text
- List and read the files in the model's workspace directory
- If the task involved code, attempt to run/test it (within the model's workspace)
- If the task involved writing, research, or analysis, evaluate the substance and presentation
- Score on four dimensions:
| Metric | Scale | What it measures |
|---|---|---|
| Completeness | 0–10 | Did the agent address all parts of the task? 10 = every requirement met, 0 = nothing done |
| Quality | 0–10 | How good is the output? Accuracy, depth, clarity, structure, attention to detail. Adapt criteria to the task type — code quality for coding tasks, writing quality for writing tasks, analytical rigor for research tasks, etc. |
| Performance | 0–10 | Holistic score weighing completeness and quality, plus any other impressions (creativity, efficiency, going above and beyond). Pure output quality — cost is irrelevant. |
| Cost-Effectiveness | 0–10 | How much quality per dollar? Consider performance relative to USD cost. A model scoring 8/10 for $0.05 is more cost-effective than one scoring 9/10 for $0.50. If no cost data, use token usage as proxy. |
For each score, write a 1-2 sentence justification. Be specific — reference actual outputs, files, or content. Cost-Effectiveness justification must reference actual cost.
Evaluation integrity matters: evaluate each model's output independently. Read and score one model fully before moving to the next, so earlier scores don't anchor later ones. If you catch yourself comparing during scoring, reset and evaluate against the task requirements only.
8. Present Results
Output a summary table in markdown:
## Benchmark Results
**Task**: <brief task description>
**Timeout**: <value>
**Date**: <timestamp>
| Model | Completeness | Quality | Performance | Cost-Effectiveness | Duration | Cost | Msgs | Tool Calls | Status | ... (runtime-reported usage) |
|-------|-------------|---------|-------------|-------------------|----------|------|------|------------|--------|------------------------------|
| opus | 9 | 8 | 9 | 6 | 45s | $0.12 | 12 | 15 | completed | (add columns for whatever usage fields the runtime returns) |
| sonnet| 8 | 9 | 8 | 8 | 32s | $0.05 | 8 | 10 | completed | |
| haiku | 6 | 5 | 5 | 9 | 18s | $0.003 | 4 | 6 | completed | |
Then for each model, show:
- Score justifications (the 1-2 sentences per metric)
- Notable strengths or weaknesses
- Key outputs produced (with file paths if applicable)
End with a brief verdict naming two winners:
- Performance Winner: best output regardless of cost
- Cost-Effectiveness Winner: best quality-per-dollar
If the same model wins both, state it explicitly. Note any interesting tradeoffs.
9. Save Results
Write structured results to results.json in a .agent-bench/<YYYYMMDD-HHmmss>/ directory within the current working directory (add .agent-bench/ to .gitignore if the project uses git):
{
"task": "the task description",
"timeout_seconds": 600,
"timestamp": "2025-01-15T10:30:00Z",
"categories": ["coding", "automation"],
"tags": ["python", "cli", "csv"],
"openclawVersion": "1.2.3",
"models": [
{
"model": "anthropic/claude-sonnet-4",
"status": "completed",
"duration_seconds": 45,
"cost_usd": 0.0523,
"usage": {
"tokens_in": 1200,
"cached_tokens_in": 800,
"tokens_out": 350
},
"rounds": {
"messages": 12,
"tool_calls": 15
},
"scores": {
"completeness": 9,
"quality": 8,
"performance": 9,
"costEffectiveness": 7
},
"justifications": {
"completeness": "...",
"quality": "...",
"performance": "...",
"costEffectiveness": "..."
},
"response": "the model's final response text",
"workspace": "<path to model's workspace>"
}
],
"verdict": {
"performanceWinner": "anthropic/claude-sonnet-4",
"costEffectivenessWinner": "anthropic/claude-haiku-4",
"summary": "One-sentence summary of who won and why",
"reasoning": "2-3 sentence detailed explanation of the comparison and tradeoffs"
}
}
Schema notes:
categories: One or more from the predefined list. Always include.tags: Up to 5 short descriptive tags. Always include.openclawVersion: Include if running on OpenClaw, omit otherwise.models[].model: Use the full model identifier (e.g.,anthropic/claude-sonnet-4, not justsonnet)models[].cost_usd: USD cost for this model's session. Usenullif cost data is unavailable.models[].usage: Pass through runtime token fields. Usetokens_infor uncached input tokens,cached_tokens_infor cached input tokens,tokens_outfor output tokens. Omit fields the runtime doesn't provide.verdict: Must be a JSON object withperformanceWinner(full model ID of performance winner, ornull),costEffectivenessWinner(full model ID of cost-effectiveness winner, ornull),summary, andreasoning— not a plain string. The oldwinnerfield is deprecated.
Tell the user where the benchmark directory and each model's workspace are so they can inspect individual outputs.
Runtime-Specific Notes
Token reporting varies by runtime. Report fields as-is using whatever names the runtime uses — do not rename or reinterpret them.
OpenClaw — Device Pairing: When spawning subagents on OpenClaw, the gateway requires device pairing approval. Each subagent connection triggers a pending pairing request that must be approved before the subagent can run.
Before spawning subagents, approve pending requests proactively:
# Approve the most recent pending device request
openclaw devices approve --latest
# Or list pending requests and approve by ID
openclaw devices list
openclaw devices approve <requestId>
If a subagent fails with "gateway closed (1008): pairing required", approve pending requests and retry. You may need to run openclaw devices approve --latest multiple times (once per subagent). This only needs to be done once per device — subsequent connections from the same subagent are remembered.
OpenClaw — Token Reporting: The session_status endpoint returns token stats with misleading labels:
- "Tokens: Xk in / Yk out" —
inmeans uncached input tokens only (does NOT include cached input).outis output tokens. - "Cache: N% hit · Xk cached, Yk new" —
cachedis cached input tokens (cache read hits),newis cache writes. - There is no total input token field returned. Do not sum uncached + cached to fabricate one.
session_statusdoes NOT return USD cost. Use the Python script from Step 6 to get actual costs from session JSONL files instead.- The completion event stats line (e.g.,
tokens 29.7k (in 29.6k / out 82)) matches the session_status values and has the same caveat —inis uncached only.
Tips for Good Benchmark Tasks
When the user's task is vague, you can suggest they make it more specific. Good benchmark tasks have clear success criteria so scoring is less subjective.
Examples across different categories:
- Coding: "Write a Python CLI that converts CSV to JSON, handling quoted commas and multiline fields. Include tests."
- Research: "Find the top 5 trending AI papers this week, summarize each in 2-3 sentences, and explain why they matter."
- Writing: "Write a professional email declining a vendor proposal while keeping the relationship warm for future opportunities."
- Analysis: "Here's a sales CSV [attach file]. Identify the top 3 trends and create a summary report with key insights."
- Productivity: "Create a weekly meal plan for a family of 4 with a $150 budget, including a shopping list organized by store section."
- Multi-step: "Research the current weather in Tokyo, write a short travel advisory, and save it as a formatted markdown file."