name: evaluation description: Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML. allowed-tools: Read, Bash, Write, Grep, Glob
Model Evaluation
Config-first evaluation framework for testing model responses against YAML-defined correctness assertions.
The evaluator does not hardcode a specific tool family, manager id, wrapper name, or behavior rule as correctness. Scenarios define the prompt and the acceptable response shape directly under correct.
Quick Reference
| Task | Command |
|---|---|
| Interactive menu | ./run.sh then Evaluate |
| Tool CLI eval | python -m Evaluator.cli --backend vllm --model MODEL --scenario tool_prompts.yaml --host 127.0.0.1 --port 8011 |
| Full configured eval | python -m Evaluator.cli --backend lmstudio --model MODEL --preset full |
| Quick smoke test | python -m Evaluator.cli --backend lmstudio --model MODEL --preset quick |
| Tag filter | python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --tags storageManager |
| Dry run config load | python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --dry-run |
| Eval with environment runtime | python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --env-backend local |
| Eval with LLM judge | python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --judge --judge-rubrics tool_call_quality |
| Eval + upload to HF | python -m Evaluator.cli --backend unsloth --model PATH --upload-to-hf user/model |
Status System
| Status | Meaning | When |
|---|---|---|
| PASS | Configured checks passed | correct assertions passed, and optional environment/judge checks passed |
| FAIL | Configured checks failed or request errored | No correct.any path matched, required environment checks failed, judge failed, or backend errored |
Schema/structural validation may still be reported for debugging, but it is not the source of task correctness. Correctness belongs in scenario YAML.
Key Directories
Evaluator/- Core evaluation codeEvaluator/config/scenarios/- YAML test scenariosEvaluator/config/tool_schema.yaml- Current CLI wrapper/tool schema metadataEvaluator/config/rubrics/- LLM-as-judge rubricsEvaluator/results/- Evaluation output JSON and Markdown
Progressive Reference
| Reference | When to Load | Path |
|---|---|---|
| CLI Commands | Running evaluations, all flags and examples | reference/cli-commands.md |
| Scenario Authoring | Writing or modifying YAML test scenarios | reference/scenario-authoring.md |
| Backends | Configuring vLLM, LM Studio, Ollama, Unsloth, and others | reference/backends.md |
| Results & Metrics | Interpreting JSON/Markdown output and failures | reference/results-metrics.md |
| Presets & Tags | Using presets and tag filters | reference/presets-tags.md |
Active Scenario Pattern
Every test should define what counts as correct:
tests:
- id: storage_copy_runbook
question: Copy the incident runbook into a template file.
tags: [storageManager, single-tool]
system: |
<session_context>
sessionId: "session_eval"
workspaceId: "ws_eval"
</session_context>
correct:
any:
- name: copy_cli
assertions:
- type: jsonpath_equals
path: $.tool_calls[0].name
value: useTools
- type: jsonpath_regex
path: $.tool_calls[0].arguments.tool
pattern: '^storage copy\b(?=.*Incident-Response\.md)(?=.*Incident-Response-Template\.md)'
Use correct.any for multiple valid answers, such as command by id or by name. Use correct.all or nested all/any/not assertions for stricter structures.
Response View
Assertions query a generic response view. This is syntax normalization only:
$.rawpreserves the raw assistant response.$.contentis assistant text.$.content_jsonis parsed JSON content when content is JSON.$.tool_callsis a normalized list of emitted tool calls.- OpenAI-style
function.argumentsJSON strings are parsed into objects. - Plain text blocks like
tool_call: useToolsplusarguments: {...}are parsed into the same view.
The response view must not map CLI commands to old manager tool ids or decide correctness. Scenario YAML decides what is correct.
Tips
- Keep all task-specific expectations in YAML under
correct. - Do not add evaluator code for a specific tool, wrapper, or use case.
- Prefer regex or JSONPath assertions for tool CLI commands, because shell quoting and argument order can vary.
- If a schema allows equivalent forms, represent them as separate
correct.anypaths. - Use
--limitand--tagsfor fast iteration. - Use
--validate-contextonly when the scenario includes context fields that should be structurally checked. - Use
--env-backend localore2bonly when you need runtime execution checks beyond response correctness.