name: benchflow description: Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance. user-invocable: true allowed-tools: - Read - Write - Edit - Bash
BenchFlow — Agent Benchmarking
BenchFlow runs AI coding agents against tasks in sandboxed environments and scores their output via ACP (Agent Communication Protocol).
Arguments passed: $ARGUMENTS
Dispatch on arguments
No args or status — show current state
- Check if benchflow is installed:
uv tool list | grep benchflow - Check if API keys are set (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.)
- Check available agents:
bench agent list - Show recent eval results if any exist under
jobs/(the default--jobs-dir) - Point to next action based on state
run <task-path> — run a single task
bench eval run \
--tasks-dir <task-path> \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona
Or via Python SDK:
import asyncio
import benchflow as bf
from benchflow import RolloutConfig, Scene
from benchflow._utils.benchmark_repos import resolve_source
async def main():
config = RolloutConfig(
task_path=resolve_source("benchflow-ai/skillsbench", path="tasks/edit-pdf"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
)
result = await bf.run(config)
print(f"Reward: {result.rewards}, Tools: {result.n_tool_calls}")
asyncio.run(main())
Note: resolve_source() is required for remote repos in the SDK. The CLI
handles this transparently via --source-repo / --source-path.
API keys are auto-inherited from os.environ into the sandbox.
eval <tasks-dir> — run a benchmark suite
bench eval run \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--concurrency 64
Or via YAML config:
bench eval run --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml
YAML format:
source:
repo: benchflow-ai/skillsbench
path: tasks
agent: gemini
model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 64
max_retries: 1
metrics <jobs-dir> — analyze results
bench eval metrics jobs/ # aggregate pass-rate / tokens / cost (add --json to pipe)
bench eval list jobs/ # per-rollout table
view <rollout-dir> — view a trajectory
Results land under jobs/<job-name>/<rollout-name>/ (the default --jobs-dir is jobs/):
rollout-dir/
├── result.json # rewards, agent, timing
├── prompts.json # prompts sent
├── trajectory/
│ └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
├── reward.txt # reward value
└── ctrf.json # test results
create-task — create a new benchmark task
bench tasks init my-task # native task.md format (default)
bench tasks init my-task --no-pytest --no-oracle
bench tasks check tasks/my-task # structural validation
Quick structure (native task.md format, the default):
my-task/
├── task.md # YAML frontmatter (config) + prompt body
├── environment/
│ └── Dockerfile # sandbox setup
├── verifier/
│ ├── test.sh # verifier entrypoint -> writes /logs/verifier/reward.txt
│ └── test_outputs.py
└── oracle/ # optional reference solution (solve.sh)
--format legacy is retired in v0.6.2: bench tasks init always scaffolds a
native task.md package. To bring an existing split-layout task forward, run
bench tasks migrate <dir> --remove-legacy.
skills — discover and evaluate agent skills
bench skills list # discover skills on disk
bench skills eval skills/citation-management \
--agent claude-agent-acp # score a skill against its evals/evals.json
hub — check external-environment-hub compatibility
bench hub check # inventory/structurally-check representative Harbor-registry tasks
agents — list available agents
bench agent list
| Agent | Protocol | Auth |
|---|---|---|
gemini |
ACP | GEMINI_API_KEY or host login |
claude-agent-acp (alias: claude) |
ACP | ANTHROPIC_API_KEY or host login |
codex-acp (alias: codex) |
ACP | OPENAI_API_KEY or host login |
opencode |
ACP | inferred from model |
openhands (alias: oh) |
ACP | LLM_API_KEY |
harvey-lab-harness (alias: harvey-lab) |
ACP | Provider key matching model |
Any agent can be prefixed with acpx/ to run via ACPX (https://acpx.sh/):
bench eval run --tasks-dir tasks/edit-pdf --agent acpx/gemini --model gemini-3.1-flash-lite-preview --sandbox daytona
ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent's install, env vars, credentials, and skill paths are preserved.
compare — multi-agent comparison
Compare by running one config per agent (the agent: key lives in each YAML)
and printing the aggregate scores:
import asyncio
from benchflow.evaluation import Evaluation
async def main():
for config_path in [
"benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml",
"benchmarks/harvey-lab/harvey-lab-harness-parity.yaml",
]:
result = await Evaluation.from_yaml(config_path).run()
print(f"{config_path}: {result.passed}/{result.total} ({result.score:.1%})")
asyncio.run(main())
Setup
# Install benchflow 0.6.0 from PyPI (--prerelease allow is for the pinned LiteLLM rc dep):
uv tool install --prerelease allow benchflow
# (or from source: uv sync --extra dev --locked)
export GEMINI_API_KEY=... # or ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.
export DAYTONA_API_KEY=... # for cloud sandboxes
Sandboxes
| Sandbox | Flag | Best for |
|---|---|---|
docker |
--sandbox docker |
Local dev, small runs (<=10 tasks) |
daytona |
--sandbox daytona |
Cloud runs with concurrency (needs DAYTONA_API_KEY) |
modal |
--sandbox modal |
Serverless, high concurrency (needs Modal auth) |
Use daytona for benchmarks. Docker is limited by network exhaustion.
Skills in tasks
Two approaches for deploying skills:
Baked into Docker image (existing tasks)
COPY skills /root/.claude/skills
Runtime deployment via --skills-dir
bench eval run \
--tasks-dir task-dir \
--agent claude-agent-acp \
--sandbox daytona \
--skills-dir skills/ \
--skill-mode with-skill \
--agent-env BENCHFLOW_SKILL_NUDGE=name
--skill-mode with-skill is required whenever you pass --skills-dir (omitting
it errors). Skills are uploaded to /skills/ in the sandbox and symlinked to
agent-specific paths.
Tips
- Use
gemini-3.1-flash-lite-previewfor testing. Use Pro/Sonnet for real benchmarks. - Evaluations resume — re-running the same
jobs_dirskips completed tasks. Nonein prompts list gets replaced withinstruction.mdcontent.- Partial rewards work (verifier can write
0.5to reward.txt). - GEMINI_API_KEY requires explicit
--agent-env GEMINI_API_KEY=...in CLI; SDK auto-inherits from os.environ.