jg-benchmark-ops - SKILL.md Agent Skill

name: jg-benchmark-ops description: "Benchmark collection and evaluation workflow for agent model assignment reviews. Use when pulling benchmarks, evaluating cost/performance, or deciding which models to use for which agents."

New model release available for any agent in the project
User requests benchmark collection, cost/performance evaluation, or model assignment review
Periodic review (e.g. quarterly) of agent model assignments

Identify sources
Use project-defined or common sources: LiveBench, SWE-Bench, Artificial Analysis, or other leaderboards. WebSearch for latest URLs and dates.
Fetch and parse
WebFetch for page content. If empty or JS-rendered, use browser MCP or project fallback. Parse into a structured format (YAML/JSON). Record source URL and retrieval date for every score.
Store
Write to project-defined path (e.g. benchmarks/snapshots/YYYY-MM-DD.yaml). Never overwrite; use new timestamped filename if same-day file exists.
Validate
Run project's schema validator (e.g. python scripts/benchmark_schema.py --validate <path>) before considering the snapshot complete.

If the project has an eval script (e.g. make benchmark-eval, scripts/benchmark_evaluate.py): run it and read the output. Use its verdicts and metrics in the report.
If not: combine snapshot data with model pricing; for each agent, compare current model to alternatives on primary metrics; assign verdict (Excellent / Correct / Monitor / Tune / Upgrade) and note cost impact.

Verdict	Meaning
Excellent	Current model leads its cost tier; no change needed.
Correct	Adequate; within ~5% of tier leader, no cheaper winner.
Monitor	Trails leader by ~5–15% or cheaper option within ~3%; schedule review.
Tune	Same-cost or cheaper model outperforms by >5%; recommend change.
Upgrade	Higher-cost model outperforms on critical-path role; recommend only if cost justified.

Use input/output pricing (per token or per MTok) from provider docs or analysis sites.
Compare: same-tier alternatives (same cost band), cheaper tier, premium tier.
In recommendations, state: current model, suggested model, metric delta, cost delta.

Collection: Sources, dates, snapshot path, list of models collected (and any missing).
Evaluation: Table (Agent | Model | Verdict | Key metrics); Recommendations (agent, change, before/after, cost impact).
No agent or rule file updates unless the user explicitly asks to apply recommendations.