benchmark

star 0

Benchmark models head-to-head or eval skills (with-skill vs bare baseline) on a calibrated mid-weight task — Claude models at any effort level, plus external CLI agents like codex, gemini, or cursor-agent. Parallel sub-agents do the work, a blind judge scores it, and an HTML report lands in benchmarks/ with a CursorBench-style leaderboard (score %, cost/task, tokens/task, steps/task). Use when the user wants to benchmark, eval, compare, or A/B test models, skills, or coding agents, asks "which model is better at X" or "does this skill actually help", or says /benchmark, /benchmark model, /benchmark skill. Quick mode is the default (one task, minimal questions); "deep" runs more tasks and contenders.

By h00mankind schedule Updated 6/12/2026

play_arrow Run Skill in Manus View GitHub

Skill instructions (SKILL.md) could not be loaded from local cache or raw GitHub repository.

Install via CLI

npx skills add https://github.com/h00mankind/workflow --skill benchmark

Repository Details

star Stars 0

call_split Forks 0

navigation Branch main

article Path SKILL.md