create-eval - SKILL.md Agent Skill

name: create-eval description: "Create trigger evaluation setup for a toolkit skill. Use when the user wants to test whether a skill's description triggers correctly, set up eval workspaces, or generate trigger test queries for a skill. Use when user says 'create eval', 'test triggers', 'eval skill', or wants to measure skill triggering accuracy." argument-hint: "[toolkit] [skill]"

Create trigger eval for a skill

Scaffold a trigger eval for $ARGUMENTS (format: toolkit skill or toolkit/skill).

Step 1: Locate the skill

Parse $ARGUMENTS into toolkit and skill name. Find the skill at workbench/<toolkit>/skills/<skill>/SKILL.md. Read the skill's frontmatter (name, description) and body to understand what it does and when it should trigger.

Step 2: Create eval directory

Create evals/<toolkit>/<skill>/ if it doesn't exist.

Step 3: Determine eval workspaces

Ask the user which workspace configurations to test. Each workspace represents a different set of installed toolkits — this tests how the skill behaves when competing with other skills.

Common patterns:

init-only — just dlthub ai init (minimum skills: setup-secrets, toolkit-dispatch). Tests cold-start triggering.
with-<toolkit> — init + the skill's own toolkit installed. Tests triggering with competing sibling skills.

Write config.json:

{
  ".eval-workspaces": {
    "init-only": {"toolkits": []},
    "with-rest-api": {"toolkits": ["rest-api-pipeline"]}
  }
}

Ask the user if they want additional workspace configurations. Each entry adds a workspace with different toolkit combinations.

Step 4: Generate trigger eval queries

Read the skill's SKILL.md description carefully. Then read all competing skill descriptions from the selected toolkits:

uv run python tools/list_skill_descriptions.py workbench/<toolkit1> workbench/<toolkit2> ...

Use the competing descriptions to understand clash surfaces — which skills have overlapping vocabulary or intent.

Generate 20 eval queries — a mix of should-trigger (10) and should-not-trigger (10).

Query quality rules

Queries must be realistic — what a real user would actually type. Include personal context, specific details, file paths, API names, error messages, casual phrasing. Mix formal and informal, long and short.

Bad: "Format this data", "Build a pipeline", "Deploy something"

Good: "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"

Should-trigger queries (10)

Think about coverage — different phrasings of the same intent:

Some formal, some casual
Cases where the user doesn't name the skill explicitly but clearly needs it
Uncommon use cases at the edges of the skill's scope
Cases where this skill competes with another but should win

Should-not-trigger queries (10)

The most valuable negatives are near-misses — queries that share keywords or concepts with the skill but actually need something different:

Adjacent domains or overlapping vocabulary
Ambiguous phrasing where a keyword match would trigger but shouldn't
Queries that touch on the skill's domain but in a context where another tool is better
Specific in-progress tasks that belong to sibling skills

Avoid obviously irrelevant negatives — "write a fibonacci function" as a negative for a pipeline skill doesn't test anything. The negatives should be genuinely tricky.

Disabled queries

If during analysis a query turns out to be an undertrigger (Claude handles it directly without any skill), mark it as disabled instead of removing:

{"query": "...", "should_trigger": true, "disabled": true, "reason": "undertrigger — Claude uses MCP directly"}

Write trigger-eval.json.

Step 5: Review with user

Present the generated queries grouped by should-trigger/should-not-trigger. Explain the reasoning for tricky cases. Let the user edit, add, or remove queries before finalizing.

Step 6: Build workspaces

Run:

uv run python tools/create_eval_workspace.py evals/<toolkit>/<skill>

This creates all workspaces defined in config.json.

Step 7: Continue to run-eval

Ask the user if they want to run the eval now. If yes, hand over to /run-eval <toolkit> <skill>. Do not duplicate the eval running and analysis logic here.