name: eval-writer
description: Turn acceptance criteria into a runnable Smithers eval suite (JSONL cases + rubric) and wire it to smithers eval. Use when a workflow's quality must be measured and regression-tested — not "looks good" once, but a repeatable check that fails when the model OR the harness regresses.
Eval Writer
This skill is about the backpressure layer: the suite that pushes evidence back against the agent's claim that a workflow is done. A single passing run proves nothing repeatable. An eval suite turns acceptance criteria into cases (input + expected + rubric), runs the whole workflow over them, and exits non-zero when any case regresses. That is the difference between "the agent said it works" and a gate that can fail.
The key insight: an eval evaluates the model AND the harness together. You're
not scoring a prompt in isolation — you run the real <Workflow> (its agents,
schemas, retries, branches, loops) against fixed inputs and assert on the
persisted output. A regression anywhere in that stack — a worse model, a broken
prompt, a dropped field, a mis-wired branch — turns a case red.
When to reach for it
- A workflow ships something whose quality matters (release notes, a triage decision, a generated patch) and you need to know if it gets worse next week.
- You're about to accept "looks good" as verification. Encode a check that can fail instead.
- You changed a prompt, swapped a model, or refactored the graph and need to prove you didn't regress behavior.
- You want a baseline to optimize against (
smithers optimizeruns a suite twice).
Skip it for one-off prompts nothing downstream depends on. Backpressure is for behavior you'll need to hold steady over time.
Cases: input + expected + rubric, as JSONL
A suite is a .jsonl file under .smithers/evals/, one case per line. Each case
is an input for the workflow plus an expected assertion. Assertions support
status (run reached finished), output (exact match), and outputContains
(partial / deep-subset match — the usual choice).
{"id":"happy-path","input":{"prompt":"Draft release notes"},"expected":{"status":"finished"}}
{"id":"lists-breaking-changes","input":{"prompt":"Release notes for v2"},"expected":{"status":"finished","outputContains":{"notes":{"breakingChanges":[{"severity":"high"}]}}}}
Turn each acceptance criterion into at least one case: a happy path, the
quality-gate criterion itself, and an adversarial/edge case that should trip a
weak run. Keep outputContains keyed to the load-bearing fields of the output
schema (see skills/schema-author/SKILL.md) — assert on the typed fields a human
would actually check, not on prose.
Run it
bunx smithers-orchestrator eval .smithers/workflows/release.tsx \
--cases .smithers/evals/release-quality.jsonl \
--suite release-quality --force
--suite <name>is a stable ID used in run IDs and the report path; reuse it so runs are comparable over time.- Report lands at
.smithers/evals/<suite>.json; the command exits non-zero on any failure — wire that into CI as the gate. --dry-runplans run IDs without launching (cheap shape check before spend).-j/--concurrency Nruns cases in parallel;--max-cases Nsmoke-tests a subset.--optimization <artifact.json>runs the suite with GEPA-patched prompts.
Attach scorers for graded, non-binary quality
Assertions are pass/fail; scorers grade quality on a Task and run after
completion (they never block the run). Attach them to the <Task> whose output
you care about, then read them with smithers scores.
import { schemaAdherenceScorer, faithfulnessScorer, relevancyScorer } from "smithers-orchestrator/scorers";
import { llmJudge } from "smithers-orchestrator/scorers";
<Task id="draft" output={outputs.notes} agent={writer}
scorers={{
schema: { scorer: schemaAdherenceScorer() },
grounded: { scorer: faithfulnessScorer() },
onTopic: { scorer: relevancyScorer() },
quality: { scorer: llmJudge({ model: claude, prompt: "Rate completeness 0-1" }),
sampling: { kind: "ratio", ratio: 0.1 } },
}}>
Draft the release notes.
</Task>
faithfulness (grounded in source), relevancy (on-topic), schemaAdherence
(shape held), and llmJudge(...) (rubric-as-judge) are the workhorses. Sample
expensive judges with { kind: "ratio", ratio: 0.1 }. Inspect:
bunx smithers-orchestrator scores <run-id>
Use assertions for the hard gate (must-be-true), scorers for the trend (is it getting better or worse).
The automated path: the eval-author workflow
You don't have to hand-write the suite. The seeded eval-author workflow turns
plain-English acceptance criteria into a JSONL fixture (id, input, expected,
rubric) under .smithers/evals/, then reports the exact smithers eval command:
bunx smithers-orchestrator workflow run eval-author \
--input '{"prompt":"Release notes must list every breaking change","workflow":".smithers/workflows/release.tsx"}'
Reach for it to bootstrap a suite from criteria, then hand-tighten the cases and
add scorers. See skills/smithers/SKILL.md for the runtime/CLI surface and
docs/llms-core.txt ("Eval suites for regressions", "Scorers") for the exact
report format and the full scorer list.