askit-evaluate

name: askit-evaluate description: Evaluates a skill or plugin against the Advanced Skill Library Standard across three modes, producing deterministic conformance findings and a tier, an opt-in behavioral pass, and a qualitative review. Use when you want to audit conformance, judge whether a skill behaves and triggers correctly, get a qualitative review, or see what blocks the next tier. chain: - askit-evaluator - askit-quality-grader - askit-reviewer metadata: version: 0.1.1 tier: universal audience: intermediate

Purpose

Assess a known, local component or plugin against STANDARD.md. Three modes. conformance (the default) runs the deterministic portable scripts and returns a per-rule report (pass / warn / error), the satisfied tier, and concrete remediation. behavioral runs a skill against its eval-set and judges whether it triggers and behaves as expected, delegating to askit-quality-grader. review forms a qualitative judgment (correctness, altitude, naming, whether a component is warranted), delegating to askit-reviewer. Only conformance is deterministic and gate-safe; behavioral and review are opt-in LLM-judged passes that produce evidence, never a CI gate result (Design Principle 3, ADR 0023).

When to use

When the user asks to evaluate, audit, or check a skill or plugin, asks "what tier is this" or "what is blocking the next tier" (conformance), asks whether a skill actually triggers and behaves correctly (behavioral), or wants a qualitative review (review).

conformance mode (default, deterministic)

Determine the target path (a plugin root with library.json, or a single skill directory with SKILL.md).
Run: node scripts/evaluate.mjs <path> --json.
Present the findings grouped by rule, the tier (for a plugin), and the remediation. Lead with errors, then warnings.
For a shareable, designed report, render the same object: node scripts/evaluate.mjs <path> --format=html --out report.html (a self-contained page for a non-engineer) or --format=md (the Markdown twin for PR review and agents). It renders the same deterministic object the terminal shows, adds no judgment, and does not change the verdict. See references/report-format.md.
If there are warnings or errors, point the user at askit-build-skill in improve mode to fix them.

behavioral mode (opt-in, LLM-judged)

Locate the target's eval-set under evals/ (triggering {query, should_trigger} cases and {given, expect} behavior cases). The evals/ convention is forward-looking: most real targets do not ship one. If it is absent, the grader DERIVES a case set instead - should-fire queries and adversarial near-misses from the target's description (read sibling skills' descriptions to make the no-fire cases genuinely competitive), plus behavior cases from the documented workflow - judges by static analysis of the artifact, and says so in the evidence.
Delegate to askit-quality-grader: it runs the skill against the cases (on-disk or derived) and judges fire / no-fire and output quality.
Report the verdict per case with evidence. This is evidence, not a gate result; it never fails CI.

review mode (opt-in, qualitative)

Delegate to askit-reviewer: it reviews the component or change for correctness, altitude, naming, and whether it is warranted.
Report findings by severity with remediation. Like behavioral, this is advisory, not a gate result.

Scope

The conformance core is deterministic and is what CI runs (Design Principle 3). Behavioral and review are opt-in LLM-judged layers beside the gate: the Standard defers the full eval engine to roadmap, and G3 requires only the deterministic baseline (ADR 0023). See references/report-format.md for the report shape. The delegations are permitted in agents/_chain-permitted.yaml (askit-evaluate may invoke askit-evaluator for conformance, askit-quality-grader for behavioral, and askit-reviewer for review).