orchestra-eval - SKILL.md Agent Skill

name: orchestra-eval description: "Grade the outputs of an orchestra skill run against its eval assertions — reads produced files, checks each assertion, and writes a grading report." when_to_use: "Use after running an orchestra skill to verify it produced correct outputs. Run orchestra-eval [work-item-id] in the project where the skill was just executed." allowed-tools: Read, Glob, Grep, Bash, Write argument-hint: " [work-item-id]"

Orchestra Eval

Grade an orchestra skill's output against its eval assertions. This skill does not run the skill under test — it grades what was already produced.

Variables

SKILL_NAME: first word of $ARGUMENTS WORK_ITEM: second word of $ARGUMENTS (optional)

Step 1: Load the Eval Suite

Find the skill's evals.json:

${CLAUDE_SKILL_DIR}/../{SKILL_NAME}/evals/evals.json

If the file doesn't exist, stop: "No evals found for {SKILL_NAME}. Add evals/evals.json to the skill directory."

Read the file. Note how many test cases exist.

Step 2: Identify the Work Item

The grader needs to know which output to evaluate.

If WORK_ITEM is provided in $ARGUMENTS:

Use .orchestra/work/{WORK_ITEM}/ as the output directory

If no WORK_ITEM:

Glob .orchestra/work/*/ to find all work items
Find the most recently modified one:
```
ls -t .orchestra/work/ | head -1
```
Use that as the work item
Report which work item was auto-detected

If no work items exist, stop: "No .orchestra/work/ items found in this project. Run the skill first, then eval."

Step 3: Grade Each Test Case

For each test case in evals.json, grade every assertion. Record PASS or FAIL with specific evidence.

Assertion types and how to check them

File existence — "X file exists at path":

ls .orchestra/work/{id}/prd.md 2>/dev/null && echo "EXISTS" || echo "MISSING"

PASS if file exists and is non-empty. FAIL if missing or empty.

Section presence — "contains a ## X section":

grep -c "^## X" .orchestra/work/{id}/prd.md

PASS if count > 0. FAIL if absent.

Frontmatter field — "frontmatter has status: approved":

grep "^status:" .orchestra/work/{id}/prd.md

PASS if value matches. FAIL if missing or wrong value.

Absence checks — "contains no X": Read the file content. Look for the prohibited term or pattern.

For "no framework or library names": scan for: pytest, click, argparse, zoneinfo, pytz, flask, fastapi, django, httpx, requests, sqlalchemy, pydantic
For "no HTTP method names": scan for: GET, POST, PUT, DELETE, PATCH as standalone words
For "no endpoint paths": scan for patterns like /api/, /v1/, starting with /
For "no technical implementation language": use model judgment — read the section and evaluate whether it describes outcomes or implementations PASS if prohibited content is absent. FAIL with the specific term or line found.

Count assertions — "at least N X": Count the relevant items and compare.

Model judgment assertions — assertions about quality, tone, or semantic content that can't be mechanically verified: Read the relevant section and reason about whether it passes. State explicitly: "Model judgment: [reasoning]. Result: PASS/FAIL."

Grading rules

Require evidence for every PASS. Quote the file content or show the grep result.
Require specifics for every FAIL. Name the exact term, missing section, or wrong value.
Never give benefit of the doubt. If an assertion says "contains no technical language" and the file has one framework name, that's a FAIL even if everything else is clean.

Step 4: Write the Grading Report

Write to .orchestra/eval/{skill-name}/{date}-{work-item}.json:

{
  "skill": "{skill-name}",
  "work_item": "{id}",
  "graded_at": "{YYYY-MM-DD}",
  "test_cases": [
    {
      "id": 1,
      "prompt": "{prompt from evals.json}",
      "assertion_results": [
        {
          "text": "{assertion text}",
          "passed": true,
          "evidence": "{quoted content or command output that confirms it}"
        },
        {
          "text": "{assertion text}",
          "passed": false,
          "evidence": "{exactly what was found or missing}"
        }
      ],
      "summary": {
        "passed": N,
        "failed": N,
        "total": N,
        "pass_rate": 0.NN
      }
    }
  ],
  "overall": {
    "passed": N,
    "failed": N,
    "total": N,
    "pass_rate": 0.NN
  }
}

Create .orchestra/eval/ directory if it doesn't exist.

Step 5: Report to User

Present a clean summary:

## Eval: {skill-name} — {work-item}

Overall: {passed}/{total} assertions passed ({pass_rate}%)

### Test Case 1
✓ prd.md exists at .orchestra/work/{id}/prd.md
✓ prd.md frontmatter has status: approved
✓ prd.md contains a ## Problem section
✗ prd.md contains no framework or library names
  → Found "pytest" in line 12: "Success criteria: pytest tests pass"
✓ spec.md contains a ### Unit Tests subsection
...

### Failed Assertions ({N} total)
1. [test-1] prd.md contains no framework or library names
   → Found "pytest" in line 12
2. [test-1] gherkin-spec.md contains at least one error scenario
   → No scenario with error, invalid, or fail in its name

### Next Steps
Fix the {N} failing assertions in {skill-name}/SKILL.md then re-run the skill and eval.

Rules

Grade the assertions in evals.json for the relevant test case, not all test cases — match by which test case fits what was produced
If multiple test cases exist and only one was run, grade the one whose prompt best matches the produced work item
Do not modify the skill under test — only read and grade
The grading.json is the authoritative record; the console output is for the human

Hard Stop

After presenting the report and writing grading.json, stop completely. Do not:

Run any other skill
Start any new work
Interpret project state as a signal to do something
Invoke orchestra-plan, orchestra-implement, or any other skill

Your job is grading. It ends when the report is written.