name: j-rig-eval description: Runs the j-rig seven-layer binary evaluation on a Claude skill and reports a ship or no-ship rollout decision. Activates when a user asks to evaluate, grade, score, or gate a SKILL.md before release.
j-rig Skill Evaluation
Evaluate a Claude skill (SKILL.md) with the j-rig binary-eval harness and
return a clear ship or no-ship decision backed by evidence.
When to use this skill
Use it whenever someone wants to know whether a skill change is safe to ship — before opening a release PR, after editing a skill body, or when gating a skill in CI. The decision is binary: every criterion is a yes or no, and a single blocker failure blocks release regardless of the average.
Instructions
- Locate the skill directory that contains the
SKILL.mdunder evaluation and the eval contract that lists its criteria and test cases. - Run the full seven-layer evaluation against the skill: package integrity, trigger quality, functional execution, judgment, scoring, evidence persistence, and the rollout report.
- Read the rollout decision. A
shipdecision means every blocker passed and no regression-critical criterion failed. Ablockdecision means at least one blocker failed. Awarndecision means non-blocking criteria need attention. Anobsolete_reviewdecision means the naked model matched the skill and the skill may add no value. - Report the decision plainly, then list any failing criteria with their reasons so the author knows exactly what to fix.
Decision rules
- A blocker failure cannot be averaged out by passing criteria — it blocks.
- A regression on a sacred case blocks release even if the overall score rose.
- Observed behavior outranks claimed behavior: grade what the skill actually produced, not what its description promises.
- The evaluator is always separate from the skill under test; the skill never judges itself.
Examples
Input: a request to evaluate a commit-message skill before release. Output: a ship-or-no-ship decision plus the per-criterion pass and fail rows, with one-line reasons for every failure.
Input: a request to gate a skill in a pull request. Output: a no-ship decision naming the single blocker that failed, so the author can fix exactly that one thing and re-run.