evaluation

star 23

Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.

ProfSynapse By ProfSynapse schedule Updated 4/24/2026

name: evaluation description: Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML. allowed-tools: Read, Bash, Write, Grep, Glob

Model Evaluation

Config-first evaluation framework for testing model responses against YAML-defined correctness assertions.

The evaluator does not hardcode a specific tool family, manager id, wrapper name, or behavior rule as correctness. Scenarios define the prompt and the acceptable response shape directly under correct.

Quick Reference

Task Command
Interactive menu ./run.sh then Evaluate
Tool CLI eval python -m Evaluator.cli --backend vllm --model MODEL --scenario tool_prompts.yaml --host 127.0.0.1 --port 8011
Full configured eval python -m Evaluator.cli --backend lmstudio --model MODEL --preset full
Quick smoke test python -m Evaluator.cli --backend lmstudio --model MODEL --preset quick
Tag filter python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --tags storageManager
Dry run config load python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --dry-run
Eval with environment runtime python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --env-backend local
Eval with LLM judge python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --judge --judge-rubrics tool_call_quality
Eval + upload to HF python -m Evaluator.cli --backend unsloth --model PATH --upload-to-hf user/model

Status System

Status Meaning When
PASS Configured checks passed correct assertions passed, and optional environment/judge checks passed
FAIL Configured checks failed or request errored No correct.any path matched, required environment checks failed, judge failed, or backend errored

Schema/structural validation may still be reported for debugging, but it is not the source of task correctness. Correctness belongs in scenario YAML.

Key Directories

  • Evaluator/ - Core evaluation code
  • Evaluator/config/scenarios/ - YAML test scenarios
  • Evaluator/config/tool_schema.yaml - Current CLI wrapper/tool schema metadata
  • Evaluator/config/rubrics/ - LLM-as-judge rubrics
  • Evaluator/results/ - Evaluation output JSON and Markdown

Progressive Reference

Reference When to Load Path
CLI Commands Running evaluations, all flags and examples reference/cli-commands.md
Scenario Authoring Writing or modifying YAML test scenarios reference/scenario-authoring.md
Backends Configuring vLLM, LM Studio, Ollama, Unsloth, and others reference/backends.md
Results & Metrics Interpreting JSON/Markdown output and failures reference/results-metrics.md
Presets & Tags Using presets and tag filters reference/presets-tags.md

Active Scenario Pattern

Every test should define what counts as correct:

tests:
  - id: storage_copy_runbook
    question: Copy the incident runbook into a template file.
    tags: [storageManager, single-tool]
    system: |
      <session_context>
      sessionId: "session_eval"
      workspaceId: "ws_eval"
      </session_context>
    correct:
      any:
        - name: copy_cli
          assertions:
            - type: jsonpath_equals
              path: $.tool_calls[0].name
              value: useTools
            - type: jsonpath_regex
              path: $.tool_calls[0].arguments.tool
              pattern: '^storage copy\b(?=.*Incident-Response\.md)(?=.*Incident-Response-Template\.md)'

Use correct.any for multiple valid answers, such as command by id or by name. Use correct.all or nested all/any/not assertions for stricter structures.

Response View

Assertions query a generic response view. This is syntax normalization only:

  • $.raw preserves the raw assistant response.
  • $.content is assistant text.
  • $.content_json is parsed JSON content when content is JSON.
  • $.tool_calls is a normalized list of emitted tool calls.
  • OpenAI-style function.arguments JSON strings are parsed into objects.
  • Plain text blocks like tool_call: useTools plus arguments: {...} are parsed into the same view.

The response view must not map CLI commands to old manager tool ids or decide correctness. Scenario YAML decides what is correct.

Tips

  • Keep all task-specific expectations in YAML under correct.
  • Do not add evaluator code for a specific tool, wrapper, or use case.
  • Prefer regex or JSONPath assertions for tool CLI commands, because shell quoting and argument order can vary.
  • If a schema allows equivalent forms, represent them as separate correct.any paths.
  • Use --limit and --tags for fast iteration.
  • Use --validate-context only when the scenario includes context fields that should be structurally checked.
  • Use --env-backend local or e2b only when you need runtime execution checks beyond response correctness.
Install via CLI
npx skills add https://github.com/ProfSynapse/Synaptic-Tuner --skill evaluation
Repository Details
star Stars 23
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator