gspdev-benchmark - SKILL.md Agent Skill

name: gspdev-benchmark description: Benchmark GSP token budget — capture snapshots, compare between versions/PRs, track improvements over time allowed-tools: - Read - Bash - Glob - Grep - AskUserQuestion argument-hint: "[capture|compare|trend|pr] default: capture"

Benchmarks GSP's token budget, rate limit risk, and test health. Wraps `dev/scripts/benchmark.sh` with interactive analysis.

Snapshots live in dev/benchmarks/ as JSON files. The workflow is release-centric:

At each release, capture a baseline: benchmark.sh release → {version}.release.json
During development, capture snapshots: benchmark.sh capture [label]
Before PR, compare against release: benchmark.sh compare (auto-selects release baseline)

This measures the cumulative impact of changes heading into the next release.

## Step 0: Parse mode

Read $ARGUMENTS to determine the mode:

Input	Mode
(no args)	Capture + auto-compare against release baseline
`capture [label]`	Capture with optional label
`release`	Capture as the release baseline for this version
`compare [v1] [v2]`	Compare two snapshots (default: release vs latest)
`trend`	Show trajectory across all snapshots

Step 1: Execute mode

Default mode (no args)

Run bash dev/scripts/benchmark.sh capture
If a release baseline exists, automatically run benchmark.sh compare
Present the comparison table and analysis

This is the most common invocation — "where are we vs the last release?"

Capture mode

Run bash dev/scripts/benchmark.sh capture [label].

After capture, read the JSON from dev/benchmarks/ and present a summary:

Total weight and red-zone count
Top 5 heaviest skills with their breakdown
Rate limit risk (API conversations, double-dispatch count)
If a release baseline exists, auto-compare and highlight changes

Release mode

Run bash dev/scripts/benchmark.sh release.

This is part of the publish flow — capture the definitive baseline for this version. All future compare calls will diff against this baseline until the next release.

Remind the user: "Add benchmark.sh release to your /gspdev-publish checklist."

Compare mode

Run bash dev/scripts/benchmark.sh compare [v1] [v2].

Default (no args) compares release baseline → latest capture. After the table output, analyze:

Flag any skill that moved from green/yellow to red
Flag any skill whose exec_context increased (potential pass-through regression)
Summarize net pipeline path changes
Rate limit risk changes
Highlight the biggest wins and regressions

Trend mode

Run bash dev/scripts/benchmark.sh trend.

After the table, identify:

Overall trajectory (improving, stable, regressing)
Largest single-version improvement
Skills that consistently stay in red zone

Step 2: Suggest next actions

Based on the results, suggest relevant actions:

If red-zone skills exist: "Consider optimizing {skill} — {specific suggestion}"
If exec_context is non-zero for agent-spawning skills: "Pass-through exec_context detected in {skill}"
If rate limit risk increased: "New double-dispatch skill adds API conversation overhead"
If tests have failures: "Fix test failures before benchmarking"
If total weight decreased significantly: celebrate the win with specific numbers