name: bkit-evals
classification: capability
classification-reason: Eval runner is a development-time quality tool, not a workflow phase
deprecation-risk: none
effort: low
description: |
Run skill evals via evals/runner.js — wrapper validates skill names, captures stdout/stderr, persists JSON results.
Triggers: bkit evals, evals run, skill quality, eval runner, 스킬 평가, 評価実行, 评估运行, evaluación, évaluation.
argument-hint: "run | list"
user-invocable: true
allowed-tools:
- Bash
- Read
- Glob
- Grep
imports: []
next-skill: null
pdca-phase: null
task-template: "[Evals] {action}"
bkit Evals — Skill Quality Evaluation Runner
v2.1.11 Sprint β FR-β2. Wraps
evals/runner.jswith input validation, result persistence, and structured reporting. Replaces the barenode evals/runner.js <skill>invocation that previously required users to remember argv structure and ignored timeout / sandbox concerns.
Arguments
| Argument | Description | Example |
|---|---|---|
run <skill> |
Execute the eval suite for one skill | /bkit-evals run gap-detector |
list |
List all skills that have an eval.yaml definition |
/bkit-evals list |
If no argument is provided, render the same output as list.
Behavior
run <skill>
- Validate
skillagainst/^[a-z][a-z0-9-]{0,63}$/. Reject anything else (no shell metacharacters, no slashes, no spaces) — see Security below. - Spawn
node evals/runner.js --skill <skill>viachild_process.spawnSync(argv form, no shell). Default timeout 30 s, max 120 s. The--skillflag form is mandated by the runner CLI and locked by L3 contract test. - Capture stdout / stderr. Parse the trailing JSON block via balanced-brace fallback (string-aware).
- Apply fail-closed defense: if
parsed === nulland stdout includesUsage:, returnreason: 'argv_format_mismatch'; ifparsed === nullotherwise, returnreason: 'parsed_null'. Exit code 0 alone NEVER implies success — the parsed JSON must be present. - Persist the structured result to
.bkit/runtime/evals-{skill}-{ISO timestamp}.jsonwith stdout/stderr tails (2000 chars each),parsedpayload, andreasonfield. - Render a one-line summary in the chat:
- exit code
- parsed pass/fail counts (if available)
- path of the persisted result file
list
- Read
evals/config.jsonto enumerate skill classifications. - For each classification (
workflow,capability,hybrid), list skills that haveevals/{classification}/{skill}/eval.yaml. - Render a category-grouped table with skill name + a one-line note from
the eval YAML (
descriptionfield if present).
Security
- Skill name regex prevents argument injection. Anything outside
[a-z][a-z0-9-]{0,63}is rejected withreason: invalid_skill_name. - argv-array spawn (no shell). No template-string concatenation into command lines.
- Result file path is composed from a hardcoded base + sanitized skill name + timestamp; no traversal possible.
- Subprocess timeout enforced (default 30 s, hard cap 120 s) so a buggy eval cannot block the session indefinitely.
Module Dependencies
| Module | Function | Usage |
|---|---|---|
lib/evals/runner-wrapper.js |
invokeEvals(skill, opts) |
Validate + spawn + persist |
lib/evals/runner-wrapper.js |
isValidSkillName(name) |
Regex pre-check shared with list |
evals/runner.js |
(subprocess) | Existing eval execution engine |
Result Schema
.bkit/runtime/evals-{skill}-{timestamp}.json:
{
"skill": "gap-detector",
"invokedAt": "<ISO 8601>",
"exitCode": 0,
"timedOut": false,
"stdoutTail": "...",
"stderrTail": "...",
"parsed": { /* whatever runner.js prints as JSON, or null */ }
}
Examples
# Single eval
/bkit-evals run gap-detector
# Discovery
/bkit-evals list
Related
/control trust— eval results contribute to trust score/code-review— uses eval data when assessing skills/bkit explore(FR-β1) — explore evals as a category
ARGUMENTS: